regex — Understand and Apply Regular Expressions
Regular expression syntax reference — character classes, anchors, quantifiers, groups and lookarounds. The regex building blocks explained concisely.
Regular expressions (regex) are a compact language for searching, validating and replacing patterns in text. Instead of matching fixed strings, you describe whole classes of matches with wildcards, repetitions and anchor characters – think "four digits in a row" or "an email address". You meet them everywhere: in grep, sed and awk on the command line, in editors and in practically every programming language. Keep in mind that several dialects exist – BRE and ERE in the classic Unix tools, PCRE in Perl, PHP and many modern languages – so details around escaping and extensions can differ. This reference shows you the building blocks you reach for most in practice.
Character Classes
. — Match any single character (except newline by default).
h.t matches hat, hit, hot\d — Match any digit (0-9). Equivalent to [0-9].
\d{3} matches 123, 456, 789\D — Match any non-digit character.
\D+ matches abc, hello\w — Match any word character (letter, digit, underscore). Equivalent to [a-zA-Z0-9_].
\w+ matches hello_world, var123\W — Match any non-word character.
\W matches !, @, spaces\s — Match any whitespace (space, tab, newline).
hello\sworld matches 'hello world'\S — Match any non-whitespace character.
\S+ matches any non-space word[abc] — Match any one character in the set.
[aeiou] matches any vowel[^abc] — Match any character NOT in the set.
[^0-9] matches any non-digit[a-z] — Match any character in the range.
[a-zA-Z] matches any letterQuantifiers
* — Match 0 or more of the preceding element (greedy).
ab*c matches ac, abc, abbc, abbbc+ — Match 1 or more of the preceding element (greedy).
ab+c matches abc, abbc but NOT ac? — Match 0 or 1 of the preceding element (optional).
colou?r matches color and colour{n} — Match exactly n occurrences.
\d{4} matches exactly 4 digits: 2026{n,} — Match n or more occurrences.
\d{2,} matches 2 or more digits{n,m} — Match between n and m occurrences.
\d{2,4} matches 12, 123, or 1234*? +? ?? — Lazy (non-greedy) versions: match as few as possible.
<.*?> matches <b> in '<b>text</b>' (not the whole string)Anchors & Boundaries
^ — Match the start of a line/string.
^Hello matches 'Hello World' but not 'Say Hello'$ — Match the end of a line/string.
world$ matches 'hello world' but not 'world hello'\b — Match a word boundary (between \w and \W).
\bcat\b matches 'cat' but not 'category'\B — Match a non-word boundary.
\Bcat\B matches 'concatenate' but not 'cat'^...$ — Match the entire string (combined anchors).
^\d{5}$ matches only exactly 5 digitsGroups & Alternation
(abc) — Capturing group: group and capture for back-references.
(\d{3})-(\d{4}) captures area code and number separately(?:abc) — Non-capturing group: group without capturing.
(?:https?://)? optionally matches http:// or https://a|b — Alternation: match a OR b.
cat|dog matches 'cat' or 'dog'\1 \2 — Back-reference: match the same text as a previous group.
(\w+)\s\1 matches repeated words like 'the the'(?<name>abc) — Named capturing group.
(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})Lookahead & Lookbehind
(?=abc) — Positive lookahead: match if followed by abc (without consuming).
\d+(?= USD) matches '100' in '100 USD'(?!abc) — Negative lookahead: match if NOT followed by abc.
\d+(?! USD) matches '100' in '100 EUR' but not '100 USD'(?<=abc) — Positive lookbehind: match if preceded by abc.
(?<=\$)\d+ matches '50' in '$50'(?<!abc) — Negative lookbehind: match if NOT preceded by abc.
(?<!\$)\d+ matches '50' in 'EUR 50' but not '$50'Flags / Modifiers
i — Case-insensitive matching.
/hello/i matches Hello, HELLO, hellog — Global: find all matches, not just the first.
/\d+/g finds all numbers in a stringm — Multiline: ^ and $ match start/end of each line.
/^start/m matches 'start' at beginning of any lines — Dotall: make . also match newline characters.
/start.*end/s matches across multiple linesCommon Patterns
^[\w.-]+@[\w.-]+\.[a-zA-Z]{2,}$ — Email address (basic validation).
user@example.com, name.tag@sub.domain.org^https?://[\w.-]+(?:/[\w./?%&=-]*)?$ — URL (HTTP/HTTPS).
https://example.com/path?q=search^\d{1,3}(?:\.\d{1,3}){3}$ — IPv4 address (basic format check).
192.168.1.1, 10.0.0.255^#?([0-9a-fA-F]{3}|[0-9a-fA-F]{6})$ — Hex color code.
#fff, #1a2b3c, 00ff00^\d{4}-\d{2}-\d{2}$ — Date in ISO 8601 format (YYYY-MM-DD).
2026-03-19^(?=.*[A-Z])(?=.*[a-z])(?=.*\d).{8,}$ — Password: min 8 chars, uppercase, lowercase, digit.
Passw0rd, MyS3cret! Conclusion
Regular expressions are powerful, but they deserve a careful hand. Mind the dialect first: BRE, ERE and PCRE differ in which characters you have to escape and which extensions (such as lookbehind or named groups) are available at all – a pattern that runs in PHP may fail under grep. Also keep the difference between greedy and lazy quantifiers in mind: .* grabs as much as possible by default, which often matches more than intended – .*? fixes that. And beware complex, nested patterns with overlapping quantifiers: they can trigger "catastrophic backtracking" (ReDoS) and bring an engine to a near halt on certain inputs. Keep patterns as simple as you can and test them against real data.
Further Reading
- Wikipedia: Regular expression – introduction to the theory and syntax
- regex101 – interactive online tester with explanations
- MDN: Regular expressions – in-depth guide (JavaScript dialect)