Catastrophic backtracking and other regex traps
How backtracking regex engines blow up on hostile input, why nested quantifiers cause ReDoS, and the anchors, atomic groups, and linear engines that fix it.
A regular expression that runs in microseconds on a million inputs can hang a request thread for seconds on the million-and-first. The cause is almost always the same: a pattern with overlapping quantifiers fed an input that almost matches. The engine explores an exponential number of ways to fail before giving up. This post explains how the blowup happens, how to recognize the patterns that cause it, and what to do instead.
How a backtracking engine evaluates a pattern
Most regex engines you use day to day are backtracking NFA engines:
PCRE, Java's java.util.regex, JavaScript's RegExp, Python's re,
Ruby's Onigmo, .NET. They share one operating principle. When a
quantifier like + or * can match a span in more than one way, the
engine picks one, continues, and if the rest of the pattern fails, it
backs up and tries the next way.
Concretely, matching a+ against aaa:
a+greedily consumes all threeas.- Whatever follows in the pattern is tried at that position.
- If it fails,
a+gives back oneaand retries. - This repeats until the pattern matches or
a+is down to onea.
For a single quantifier the number of retries is linear in the input length. The trouble starts when two quantifiers can claim the same characters.
Catastrophic backtracking and ReDoS
Consider this pattern:
^(a+)+$
It means "one or more as, grouped, repeated one or more times, to end
of string." Against aaaa it matches instantly. Against
aaaaaaaaaaaaaaaaaaaaaaaaX it does not match — the X breaks it — and
that non-matching input is the worst case.
The inner a+ and the outer (...)+ overlap completely. A run of n
as can be partitioned between the two quantifiers in 2^(n-1) distinct
ways: one big group, two groups, n singletons, and every combination
between. Because the string ends in X instead of end-of-input, every
one of those partitions fails the $ anchor. The engine dutifully
tries all of them before concluding there is no match. Add one more a
and the work doubles.
The same shape hides in patterns that look more realistic:
^(\w+\s*)*$ # whitespace-separated words
^(\d+)*$ # repeated digit groups
(.*,)* # comma-separated anything
^([a-z]+)*$ # the classic
Each contains a quantified subexpression nested inside another quantifier where the two can match the same characters. The failure is triggered by long input that nearly matches — typically input an attacker controls.
That is what makes this a denial-of-service vector, not just a
performance footgun. ReDoS (regular-expression denial of service)
happens when a regex used in input validation — an email field, a
User-Agent parser, a URL route matcher — is exposed to attacker
input. The attacker sends a crafted string a few dozen bytes long, the
server spends seconds or minutes in the regex engine, and a handful of
requests exhaust the thread pool. Validation regexes copied from
forums are a recurring source, because they are written to accept
well-formed input and never tested against malicious near-misses.
Greedy vs lazy is not the fix
A common reflex is to swap greedy quantifiers for lazy ones — .* to
.*?, + to +? — on the theory that lazy "does less work." Lazy
quantifiers change the order in which the engine tries possibilities:
greedy tries the longest match first and backs off, lazy tries the
shortest first and extends. They do not change the number of
possibilities. A catastrophically backtracking pattern backtracks just
as catastrophically when made lazy; you have only changed which end of
the search space the engine starts from. Lazy quantifiers are a
correctness tool (matching the smallest span), not a performance one.
Mitigations that actually work
Remove the ambiguity. The real fix for ^(a+)+$ is to recognize
that it is equivalent to ^a+$. Most catastrophic patterns have a
flat, unambiguous rewrite. If you need words separated by spaces,
^\w+(\s+\w+)*$ is unambiguous where ^(\w+\s*)*$ is not, because the
separator and the word cannot match the same characters.
Anchor the pattern. Anchors bound the search. An unanchored pattern is implicitly tried at every starting position, multiplying the work and frequently the source of both slowness and false matches.
Use atomic groups or possessive quantifiers — if your engine has
them. An atomic group (?>...) matches its contents and then throws
away the backtracking positions inside it: once matched, the engine
will not give those characters back. A possessive quantifier a++ is
the shorthand for the same idea on a single quantifier. Rewriting
(a+)+ as (?>a+)+ or a++ collapses the exponential search to
linear. Engine support varies:
| Engine | Atomic groups (?>…) |
Possessive a++ |
Notes |
|---|---|---|---|
| PCRE / PCRE2 | yes | yes | full support |
Java java.util.regex |
yes | yes | since Java 4 / 5 |
| Ruby (Onigmo) | yes | yes | |
| .NET | yes | no | possessive not supported |
Python re |
yes | yes | atomic + possessive since 3.11 |
JavaScript RegExp |
no | no | TC39 proposal only; emulate with lookahead |
In JavaScript, atomic groups and possessive quantifiers are still a
proposal, not part of the language. The standard workaround is to
emulate atomicity with a lookahead and a backreference:
(?=(a+))\1 captures the greedy match and refuses to give it back.
Switch to a linear engine. RE2 (Google), the Rust regex crate,
and Go's regexp use a different execution model — they simulate the
NFA without backtracking, guaranteeing match time linear in the input
length regardless of the pattern. The trade-off is feature loss: no
backreferences, no lookaround, because those features are what force
backtracking in the first place. If you are running untrusted patterns
or untrusted input through a regex, a linear engine removes the entire
ReDoS class by construction. That is the right default for anything
user-facing where the regex or its input is not fully under your
control.
Anchors, lookaround, and their costs
^ and $ match string (or line) boundaries; \b matches a word
boundary. Beyond their performance role, missing anchors cause
correctness bugs: \d{4} matches inside abc12345, so a "year"
validator without ^...$ will accept garbage1999garbage. Anchor
deliberately and know whether $ matches end-of-string or end-of-line
under your flags (MULTILINE changes this).
Lookahead (?=...) / (?!...) and lookbehind (?<=...) / (?<!...)
assert that something does or does not appear without consuming
characters. They are useful — password rules, "match X not followed by
Y" — but each assertion runs a sub-match at its position, and
lookbehind in particular can be expensive when its contents are
variable-length. They also force a backtracking engine, ruling out RE2
and friends. Reach for them when the alternative is genuinely worse,
not by default.
When not to use regex at all
Regex matches regular languages. Nested or recursive structure is not regular, and trying to match it leads to patterns that are fragile, unreadable, or quietly wrong:
- HTML and XML — use a real parser. The tag-soup regex always misses a case.
- JSON, YAML, source code — nesting depth is unbounded; a regex cannot count balanced delimiters.
- Arithmetic expressions, balanced parentheses — same problem.
- CSV with quoted, embedded commas — a dedicated CSV reader is shorter and correct.
Regex is the right tool for tokenizing, validating flat formats, and search-and-replace over lines. The moment you find yourself trying to match things that nest, stop and reach for a parser.
Test before you ship
The cheapest defense is to test every validation pattern against long, nearly-matching, hostile input before it reaches production — not just the happy path. Our Regex Tester lets you run a pattern against sample strings and watch how it behaves, so you can catch an ambiguous quantifier before an attacker does. Combine that with an unambiguous rewrite, an explicit anchor, and a linear engine for any input you do not control, and the catastrophic case never fires.