Skip to content
yutils

Regex Tutorial — From Basics to Lookarounds and Catastrophic Backtracking

Hands-on regex tutorial covering character classes, quantifiers, groups, lookaround, and the catastrophic backtracking patterns to avoid.

~10 min read

Regex looks like an alien language at first glance, but thirty minutes of patient reading covers 80% of the text-validation, extraction, and replacement work you'll ever need. This guide moves from basics through groups and lookaround into the patterns you must never ship (catastrophic backtracking). Paste each example into Regex Tester as you read.

Literals vs metacharacters

Most characters match themselves. The pattern cat matches the string "cat". Twelve characters are special:

. ^ $ * + ? ( ) [ ] { } | \

Escape them with \ to match the literal — e.g. \. for a literal period.

Character classes

Built-in

PatternMeaning
.Any character except newline
\dDigit (0-9)
\wWord char ([A-Za-z0-9_])
\sWhitespace (space, tab, newline)
\D \W \SNegations of the above

Custom classes

[aeiou]        # one vowel
[a-z]          # one lowercase letter
[A-Za-z0-9]    # one alphanumeric
[^abc]         # not a, b, or c (^ negates)

Most metacharacters are literal inside brackets — except ] \ ^ -.

Quantifiers

PatternMeaning
?0 or 1
*0 or more
+1 or more
{n}Exactly n
{n,}n or more
{n,m}n to m

\d{3,4} matches 3 or 4 digits. https?:// matches "http://" or "https://".

Greedy vs lazy

Quantifiers are greedy by default — they match as much as possible. Add ? for lazy — as little as possible.

<.+>      # greedy: matches <b>text</b> entirely
<.+?>     # lazy: matches just <b>

For HTML-tag-style "stop at the next closing character" patterns, lazy is almost always what you want.

Anchors

  • ^ — start of line
  • $ — end of line
  • \b — word boundary (between a word char and a non-word char)

^[A-Z] matches lines starting with a capital letter. \bcat\b matches the word "cat" but not "category".

Groups and captures

Parentheses do two things: group a sub-pattern and capture the match.

(\d{4})-(\d{2})-(\d{2})

Against "2026-05-16", group 1 = "2026", group 2 = "05", group 3 = "16". Access via match[1] in JS, m.group(1) in Python.

Named groups

(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})

JS: match.groups.year. Python: m.group('year').

Non-capturing groups

(?:https?|ftp)://

Groups without capturing. Keeps the result-array indices clean.

Alternation

cat|dog|bird

One of the three. Usually combined with a group: (cat|dog|bird).

Lookaround

Checks "X must be followed/preceded by Y" without including Y in the match.

PatternMeaning
X(?=Y)X if followed by Y (lookahead)
X(?!Y)X if not followed by Y (negative lookahead)
(?<=Y)XX if preceded by Y (lookbehind)
(?<!Y)XX if not preceded by Y (negative lookbehind)

\d+(?= USD) matches "100" in "100 USD" but the unit "USD" itself stays out of the match.

Flags

  • i — case-insensitive
  • g — global (JavaScript)
  • m — multiline (^/$ match each line)
  • s — dotAll (. matches newline)
  • u — Unicode (JavaScript)

Real examples

Pragmatic email

^[^\s@]+@[^\s@]+\.[^\s@]+$

Perfect RFC 5322 takes dozens of lines, but this covers 95% of real validation. The only true validation is "send a verification email and wait for the click".

URL

https?://[^\s/$.?#].[^\s]*

Hangul (Korean syllables)

[가-힣]+

With the Unicode flag, \p{Hangul} is more correct.

Slug validator

^[a-z0-9]+(-[a-z0-9]+)*$

kebab-case slug — matches what Slug Generator produces.

Catastrophic backtracking — patterns to avoid

When a match fails, the regex engine backtracks to try alternatives. Some patterns explode exponentially in input length, turning a 100-character string into minutes of CPU — known as ReDoS (Regex Denial of Service). Real production incidents have brought down Cloudflare, Stack Overflow, and others.

Dangerous pattern 1: nested quantifiers

(a+)+$

Against "aaaaaaaaaaaaaaaaaaaaaaaa!" (matches fail), the engine tries every way to split the a's between the inner and outer + — exponential in length.

Dangerous pattern 2: alternation + greedy

(a|a)+$

Same character in both alternatives. Each position now has two choices, multiplying at every step.

Mitigations

  • No nested quantifiers. (a+)+ rewrites to a+.
  • Atomic groups / possessive quantifiers — PCRE and Java support (?>a+)+ or a++ to disable backtracking. JavaScript does not.
  • Cap input length. For user-supplied patterns, enforce a max length (e.g. 10 KB).
  • Match timeouts. Java needs a thread + interrupt pattern; Go's RE2 is immune by design (linear time, but no backreferences / lookbehind).
  • Test before shipping. Run patterns through a ReDoS checker or measure runtime in Regex Tester.

Try it

Recap

  • Memorize the twelve metacharacters; everything else is literal.
  • Quantifiers are greedy by default. Add ? for lazy when you want "up to the next closer".
  • Groups serve two purposes: capture and grouping. Use (?:…) when you don't need the capture.
  • Lookaround checks context without consuming characters — great for splitting and substitution.
  • Never ship nested quantifiers like (a+)+ — that's ReDoS waiting to happen.
Back to guides