Percent-encoding: reserved characters and the double-encoding bug
How URL percent-encoding works, why space is %20 in a path but + in a form body, encodeURIComponent vs encodeURI, and how the double-encoding bug produces %2520.
A URL is allowed to contain only a limited set of ASCII characters.
Anything outside that set — a space, a non-Latin letter, or one of the
characters that the URL syntax reserves for structure — has to be
represented as percent-encoding: a % followed by two hexadecimal
digits naming a byte. The byte sequence comes from encoding the
character as UTF-8 first, then writing each byte as %XX. So a space
(byte 0x20) becomes %20, and é (UTF-8 bytes 0xC3 0xA9) becomes
%C3%A9. This post covers which characters need encoding, the %20
vs + ambiguity that breaks form handling, the JavaScript functions
that get misused, and the double-encoding bug that shows up when
encoding happens in more than one layer.
Why URLs need it
RFC 3986 defines the grammar for URIs, and that grammar permits a small, fixed character repertoire. Characters fall into three groups:
- Unreserved — always safe, never need encoding:
A-Z a-z 0-9 - . _ ~ - Reserved — legal in a URL but carry structural meaning, so they must be encoded when they appear inside a value rather than as delimiters.
- Everything else — spaces, control characters, and all non-ASCII — which has no place in the raw URL and must be encoded.
The reason reserved characters are the tricky group is that the same
byte can be either a delimiter or data depending on position. A /
between path segments is structure; a / inside a single segment's
value is data and has to become %2F.
Reserved characters
These are the characters that change a URL's meaning if left raw inside a value. Encode them when they are part of the data:
| Char | Encoded | Structural role |
|---|---|---|
| space | %20 |
not legal raw; ends the URL in many parsers |
? |
%3F |
starts the query string |
# |
%23 |
starts the fragment |
& |
%26 |
separates query parameters |
= |
%3D |
separates key from value |
/ |
%2F |
separates path segments |
+ |
%2B |
means space in form-encoded data |
% |
%25 |
introduces a percent-escape |
The + and % rows are the ones that cause silent corruption. A
literal + in a query value that you forget to encode will be read
back as a space by anything doing form decoding. A literal % that
isn't encoded to %25 either errors out or, worse, gets interpreted
as the start of an escape that was never intended.
The %20 vs + ambiguity
This is the single most common source of encoding bugs, and it comes from two different specifications disagreeing about how to represent a space.
- In a generic URI — the path, and the query string under RFC
3986 — a space is
%20. Full stop. - In
application/x-www-form-urlencodeddata — the body of an HTML formPOST, and by long convention the query string when it carries form fields — a space is+. A literal+in that context is%2B.
So q=hello world can correctly appear as either:
?q=hello%20world generic URI rules
?q=hello+world form-encoded rules
Both are valid; what matters is that the encoder and the decoder
agree. The bug appears when one side encodes spaces as + (form
rules) and the other decodes with generic URI rules, leaving literal
+ characters in the data — or when a value containing a real +
(a phone number +1 555..., a search for c++) is decoded by
something that turns + into a space.
Practical rule: if you control both ends and are building a query
string by hand, prefer %20 and encode literal + as %2B. If you
are submitting a real form body, the browser will use + and your
server framework expects it.
encodeURIComponent vs encodeURI in JavaScript
JavaScript ships two encoders, and they exist for different jobs.
encodeURI(url)is for encoding a whole URL that is already structurally complete. It leaves the reserved structural characters intact —: / ? # [ ] @ & = + $ ,and a few others — because those are doing their job as delimiters.encodeURIComponent(value)is for encoding a single piece of data that will be dropped into a URL — one path segment, one query value. It encodes the reserved structural characters too, because in a value they are data, not structure.
encodeURI("https://x.com/a b?q=c/d&e=f")
// "https://x.com/a%20b?q=c/d&e=f" (slashes, ?, & left intact)
encodeURIComponent("c/d&e=f")
// "c%2Fd%26e%3Df" (everything escaped)
Use encodeURIComponent for every individual query value and path
segment. Use encodeURI only when you have a complete URL string and
just want to escape stray spaces and non-ASCII without touching its
structure — a narrower need than most people assume.
One gotcha: encodeURIComponent does not encode +, because +
is an unreserved-looking character that it leaves alone. That is fine
under generic URI rules, but if your server decodes the query string
with form rules, an unencoded + becomes a space. When targeting a
form-decoding endpoint, post-process:
encodeURIComponent("a+b").replace(/%20/g, "+") // form-style
encodeURIComponent("a+b").replace(/\+/g, "%2B") // protect literal +
Pick one convention deliberately rather than relying on the default.
Encode the path and the query differently
The path and the query string have different reserved sets, so encode
them separately rather than running one function over the whole
string. In a path segment, / is a delimiter and must be %2F when
it is part of a single segment's value; + and = are ordinary data.
In the query, & and = are delimiters and must be encoded inside
values; / is usually allowed raw.
The safe approach is to build the URL from already-encoded parts: run
encodeURIComponent on each path segment and each query value
individually, then join them with the raw delimiters you control.
Never encode the assembled string a second time — which is exactly
where the next bug comes from. If what you actually need is a
URL-safe identifier for a path (no % escapes at all), normalize to a
slug instead with our URL slug generator.
The double-encoding bug
Because % itself encodes to %25, running an encoder over an
already-encoded string mangles it. hello world encodes once to
hello%20world. Encode that result again and the % in %20 becomes
%25, producing hello%2520world. Decode that string once and you
get the literal text hello%20world instead of hello world.
The signature to look for is %25 followed by what should have been
a single escape:
hello world original
hello%20world encoded once (correct)
hello%2520world encoded twice (bug — %20 became %2520)
hello%252520... encoded three times
It happens whenever encoding runs at more than one layer and nobody tracks how many times a string has been touched: a frontend encodes a query value, an API gateway or reverse proxy re-encodes the forwarded URL, and a backend framework encodes it a third time before storing or redirecting. Each layer is individually "correct"; the stack is wrong.
To detect it, decode a suspicious value once and check whether the
result still contains %XX escapes. If a single decode leaves visible
%20 or %2F in your data, it was encoded at least twice. The fix is
architectural, not a second replace: encode exactly once, at the
boundary where raw data becomes a URL, and treat the value as opaque
everywhere after that. Strip a layer rather than adding one — never
"fix" double-encoding by decoding twice, because a value that
legitimately contains %25 will be corrupted by the extra pass.
This is the same discipline that trips people up with other transport encodings; the failure mode where a layer mistakes encoded data for plaintext also drives the confusion in Base64 is not encryption.
Summary
Percent-encoding maps unsafe characters to % plus the hex of their
UTF-8 bytes. Encode reserved characters whenever they appear inside a
value, remember that space is %20 under generic URI rules but +
under form rules, reach for encodeURIComponent on individual
components and encodeURI only on complete URLs, and encode each
value exactly once to avoid the %2520 family of bugs.
To encode or decode a value and see exactly which bytes change — and to catch double-encoding by decoding a layer at a time — use our URL encoder/decoder.