regex — Understand and Apply Regular Expressions

Regular expressions (regex) are a compact language for searching, validating and replacing patterns in text. Instead of matching fixed strings, you describe whole classes of matches with wildcards, repetitions and anchor characters – think "four digits in a row" or "an email address". You meet them everywhere: in grep, sed and awk on the command line, in editors and in practically every programming language. Keep in mind that several dialects exist – BRE and ERE in the classic Unix tools, PCRE in Perl, PHP and many modern languages – so details around escaping and extensions can differ. This reference shows you the building blocks you reach for most in practice.

Character Classes

. — Match any single character (except newline by default).

h.t matches hat, hit, hot

\d — Match any digit (0-9). Equivalent to [0-9].

\d{3} matches 123, 456, 789

\D — Match any non-digit character.

\D+ matches abc, hello

\w — Match any word character (letter, digit, underscore). Equivalent to [a-zA-Z0-9_].

\w+ matches hello_world, var123

\W — Match any non-word character.

\W matches !, @, spaces

\s — Match any whitespace (space, tab, newline).

hello\sworld matches 'hello world'

\S — Match any non-whitespace character.

\S+ matches any non-space word

[abc] — Match any one character in the set.

[aeiou] matches any vowel

[^abc] — Match any character NOT in the set.

[^0-9] matches any non-digit

[a-z] — Match any character in the range.

[a-zA-Z] matches any letter

Quantifiers

* — Match 0 or more of the preceding element (greedy).

ab*c matches ac, abc, abbc, abbbc

+ — Match 1 or more of the preceding element (greedy).

ab+c matches abc, abbc but NOT ac

? — Match 0 or 1 of the preceding element (optional).

colou?r matches color and colour

{n} — Match exactly n occurrences.

\d{4} matches exactly 4 digits: 2026

{n,} — Match n or more occurrences.

\d{2,} matches 2 or more digits

{n,m} — Match between n and m occurrences.

\d{2,4} matches 12, 123, or 1234

*? +? ?? — Lazy (non-greedy) versions: match as few as possible.

<.*?> matches <b> in '<b>text</b>' (not the whole string)

Anchors & Boundaries

^ — Match the start of a line/string.

^Hello matches 'Hello World' but not 'Say Hello'

$ — Match the end of a line/string.

world$ matches 'hello world' but not 'world hello'

\b — Match a word boundary (between \w and \W).

\bcat\b matches 'cat' but not 'category'

\B — Match a non-word boundary.

\Bcat\B matches 'concatenate' but not 'cat'

^...$ — Match the entire string (combined anchors).

^\d{5}$ matches only exactly 5 digits

Groups & Alternation

(abc) — Capturing group: group and capture for back-references.

(\d{3})-(\d{4}) captures area code and number separately

(?:abc) — Non-capturing group: group without capturing.

(?:https?://)? optionally matches http:// or https://

a|b — Alternation: match a OR b.

cat|dog matches 'cat' or 'dog'

\1 \2 — Back-reference: match the same text as a previous group.

(\w+)\s\1 matches repeated words like 'the the'

(?<name>abc) — Named capturing group.

(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})

Lookahead & Lookbehind

(?=abc) — Positive lookahead: match if followed by abc (without consuming).

\d+(?= USD) matches '100' in '100 USD'

(?!abc) — Negative lookahead: match if NOT followed by abc.

\d+(?! USD) matches '100' in '100 EUR' but not '100 USD'

(?<=abc) — Positive lookbehind: match if preceded by abc.

(?<=\$)\d+ matches '50' in '$50'

(?<!abc) — Negative lookbehind: match if NOT preceded by abc.

(?<!\$)\d+ matches '50' in 'EUR 50' but not '$50'

Flags / Modifiers

i — Case-insensitive matching.

/hello/i matches Hello, HELLO, hello

g — Global: find all matches, not just the first.

/\d+/g finds all numbers in a string

m — Multiline: ^ and $ match start/end of each line.

/^start/m matches 'start' at beginning of any line

s — Dotall: make . also match newline characters.

/start.*end/s matches across multiple lines

Common Patterns

^[\w.-]+@[\w.-]+\.[a-zA-Z]{2,}$ — Email address (basic validation).

user@example.com, name.tag@sub.domain.org

^https?://[\w.-]+(?:/[\w./?%&=-]*)?$ — URL (HTTP/HTTPS).

https://example.com/path?q=search

^\d{1,3}(?:\.\d{1,3}){3}$ — IPv4 address (basic format check).

192.168.1.1, 10.0.0.255

^#?([0-9a-fA-F]{3}|[0-9a-fA-F]{6})$ — Hex color code.

#fff, #1a2b3c, 00ff00

^\d{4}-\d{2}-\d{2}$ — Date in ISO 8601 format (YYYY-MM-DD).

2026-03-19

^(?=.*[A-Z])(?=.*[a-z])(?=.*\d).{8,}$ — Password: min 8 chars, uppercase, lowercase, digit.

Passw0rd, MyS3cret!

Conclusion

Regular expressions are powerful, but they deserve a careful hand. Mind the dialect first: BRE, ERE and PCRE differ in which characters you have to escape and which extensions (such as lookbehind or named groups) are available at all – a pattern that runs in PHP may fail under grep. Also keep the difference between greedy and lazy quantifiers in mind: .* grabs as much as possible by default, which often matches more than intended – .*? fixes that. And beware complex, nested patterns with overlapping quantifiers: they can trigger "catastrophic backtracking" (ReDoS) and bring an engine to a near halt on certain inputs. Keep patterns as simple as you can and test them against real data.