BlogDeveloper

Catastrophic backtracking and other regex traps

How backtracking regex engines blow up on hostile input, why nested quantifiers cause ReDoS, and the anchors, atomic groups, and linear engines that fix it.

A regular expression that runs in microseconds on a million inputs can hang a request thread for seconds on the million-and-first. The cause is almost always the same: a pattern with overlapping quantifiers fed an input that almost matches. The engine explores an exponential number of ways to fail before giving up. This post explains how the blowup happens, how to recognize the patterns that cause it, and what to do instead.

How a backtracking engine evaluates a pattern

Most regex engines you use day to day are backtracking NFA engines: PCRE, Java's java.util.regex, JavaScript's RegExp, Python's re, Ruby's Onigmo, .NET. They share one operating principle. When a quantifier like + or * can match a span in more than one way, the engine picks one, continues, and if the rest of the pattern fails, it backs up and tries the next way.

Concretely, matching a+ against aaa:

  • a+ greedily consumes all three as.
  • Whatever follows in the pattern is tried at that position.
  • If it fails, a+ gives back one a and retries.
  • This repeats until the pattern matches or a+ is down to one a.

For a single quantifier the number of retries is linear in the input length. The trouble starts when two quantifiers can claim the same characters.

Catastrophic backtracking and ReDoS

Consider this pattern:

^(a+)+$

It means "one or more as, grouped, repeated one or more times, to end of string." Against aaaa it matches instantly. Against aaaaaaaaaaaaaaaaaaaaaaaaX it does not match — the X breaks it — and that non-matching input is the worst case.

The inner a+ and the outer (...)+ overlap completely. A run of n as can be partitioned between the two quantifiers in 2^(n-1) distinct ways: one big group, two groups, n singletons, and every combination between. Because the string ends in X instead of end-of-input, every one of those partitions fails the $ anchor. The engine dutifully tries all of them before concluding there is no match. Add one more a and the work doubles.

The same shape hides in patterns that look more realistic:

^(\w+\s*)*$          # whitespace-separated words
^(\d+)*$             # repeated digit groups
(.*,)*               # comma-separated anything
^([a-z]+)*$          # the classic

Each contains a quantified subexpression nested inside another quantifier where the two can match the same characters. The failure is triggered by long input that nearly matches — typically input an attacker controls.

That is what makes this a denial-of-service vector, not just a performance footgun. ReDoS (regular-expression denial of service) happens when a regex used in input validation — an email field, a User-Agent parser, a URL route matcher — is exposed to attacker input. The attacker sends a crafted string a few dozen bytes long, the server spends seconds or minutes in the regex engine, and a handful of requests exhaust the thread pool. Validation regexes copied from forums are a recurring source, because they are written to accept well-formed input and never tested against malicious near-misses.

Greedy vs lazy is not the fix

A common reflex is to swap greedy quantifiers for lazy ones — .* to .*?, + to +? — on the theory that lazy "does less work." Lazy quantifiers change the order in which the engine tries possibilities: greedy tries the longest match first and backs off, lazy tries the shortest first and extends. They do not change the number of possibilities. A catastrophically backtracking pattern backtracks just as catastrophically when made lazy; you have only changed which end of the search space the engine starts from. Lazy quantifiers are a correctness tool (matching the smallest span), not a performance one.

Mitigations that actually work

Remove the ambiguity. The real fix for ^(a+)+$ is to recognize that it is equivalent to ^a+$. Most catastrophic patterns have a flat, unambiguous rewrite. If you need words separated by spaces, ^\w+(\s+\w+)*$ is unambiguous where ^(\w+\s*)*$ is not, because the separator and the word cannot match the same characters.

Anchor the pattern. Anchors bound the search. An unanchored pattern is implicitly tried at every starting position, multiplying the work and frequently the source of both slowness and false matches.

Use atomic groups or possessive quantifiers — if your engine has them. An atomic group (?>...) matches its contents and then throws away the backtracking positions inside it: once matched, the engine will not give those characters back. A possessive quantifier a++ is the shorthand for the same idea on a single quantifier. Rewriting (a+)+ as (?>a+)+ or a++ collapses the exponential search to linear. Engine support varies:

Engine Atomic groups (?>…) Possessive a++ Notes
PCRE / PCRE2 yes yes full support
Java java.util.regex yes yes since Java 4 / 5
Ruby (Onigmo) yes yes
.NET yes no possessive not supported
Python re yes yes atomic + possessive since 3.11
JavaScript RegExp no no TC39 proposal only; emulate with lookahead

In JavaScript, atomic groups and possessive quantifiers are still a proposal, not part of the language. The standard workaround is to emulate atomicity with a lookahead and a backreference: (?=(a+))\1 captures the greedy match and refuses to give it back.

Switch to a linear engine. RE2 (Google), the Rust regex crate, and Go's regexp use a different execution model — they simulate the NFA without backtracking, guaranteeing match time linear in the input length regardless of the pattern. The trade-off is feature loss: no backreferences, no lookaround, because those features are what force backtracking in the first place. If you are running untrusted patterns or untrusted input through a regex, a linear engine removes the entire ReDoS class by construction. That is the right default for anything user-facing where the regex or its input is not fully under your control.

Anchors, lookaround, and their costs

^ and $ match string (or line) boundaries; \b matches a word boundary. Beyond their performance role, missing anchors cause correctness bugs: \d{4} matches inside abc12345, so a "year" validator without ^...$ will accept garbage1999garbage. Anchor deliberately and know whether $ matches end-of-string or end-of-line under your flags (MULTILINE changes this).

Lookahead (?=...) / (?!...) and lookbehind (?<=...) / (?<!...) assert that something does or does not appear without consuming characters. They are useful — password rules, "match X not followed by Y" — but each assertion runs a sub-match at its position, and lookbehind in particular can be expensive when its contents are variable-length. They also force a backtracking engine, ruling out RE2 and friends. Reach for them when the alternative is genuinely worse, not by default.

When not to use regex at all

Regex matches regular languages. Nested or recursive structure is not regular, and trying to match it leads to patterns that are fragile, unreadable, or quietly wrong:

  • HTML and XML — use a real parser. The tag-soup regex always misses a case.
  • JSON, YAML, source code — nesting depth is unbounded; a regex cannot count balanced delimiters.
  • Arithmetic expressions, balanced parentheses — same problem.
  • CSV with quoted, embedded commas — a dedicated CSV reader is shorter and correct.

Regex is the right tool for tokenizing, validating flat formats, and search-and-replace over lines. The moment you find yourself trying to match things that nest, stop and reach for a parser.

Test before you ship

The cheapest defense is to test every validation pattern against long, nearly-matching, hostile input before it reaches production — not just the happy path. Our Regex Tester lets you run a pattern against sample strings and watch how it behaves, so you can catch an ambiguous quantifier before an attacker does. Combine that with an unambiguous rewrite, an explicit anchor, and a linear engine for any input you do not control, and the catastrophic case never fires.