Slugifying URLs: Unicode, diacritics, and collisions
How to turn a title into a URL-safe slug: the lowercase-normalize-transliterate pipeline, why diacritics and non-Latin scripts break it, and how to handle collisions.
영문 본문을 표시하고 있습니다. 번역은 준비 중입니다.
A slug is the human-readable, URL-safe identifier you derive from a
title: "Café René!" becomes cafe-rene. It exists so a URL can be
typed, shared, indexed, and remembered without %XX escapes, and so
the path itself carries meaning. Generating one is deceptively simple
in the common case and full of edge cases the moment your input leaves
ASCII. This post walks the pipeline step by step, then covers the two
things that actually bite in production: scripts the naive approach
can't transliterate, and collisions.
What a slug is for
A slug is a stable, opaque-enough key that happens to be readable. Three properties matter:
- URL-safe: every character is in the unreserved set
(
A-Z a-z 0-9 - _ . ~), so no percent-encoding is needed. - Readable:
/posts/cafe-renetells a human and a search crawler what the page is./posts/8a3fdoes not. - Stable: once minted, it never changes, because changing it breaks every link and every backlink that points at it.
That last property is the one people get wrong, and it's covered in its own section below.
The pipeline, step by step
A slugifier is a small ordered transform. Order matters — normalize before you strip, strip before you collapse.
"Café René! — 100% done"
1. lowercase → "café rené! — 100% done"
2. NFKD normalize → "café rené! — 100% done" (é now = e + ´)
3. strip combining marks → "cafe rene! — 100% done"
4. transliterate rest → "cafe rene! - 100% done" (— → -)
5. non-alnum → hyphen → "cafe-rene---100--done"
6. collapse + trim → "cafe-rene-100-done"
Each step in pseudocode:
import re, unicodedata
def slugify(text, maxlen=80):
text = text.lower()
text = unicodedata.normalize("NFKD", text) # decompose
text = "".join(c for c in text # drop combining marks
if unicodedata.category(c) != "Mn")
text = transliterate(text) # ø, ł, ß, CJK, …
text = re.sub(r"[^a-z0-9]+", "-", text) # everything else → -
text = re.sub(r"-{2,}", "-", text).strip("-") # collapse + trim
return text[:maxlen].rstrip("-")
category(c) != "Mn" is the load-bearing line. NFKD splits é into a
plain e followed by U+0301 (combining acute accent), which has
Unicode category Mn (Mark, nonspacing). Stripping the Mn marks
leaves the bare base letters. NFD works identically for accents; NFKD
additionally folds compatibility forms (fi → fi, full-width digits →
ASCII digits), which is usually what you want for a slug.
The diacritics problem is bigger than diacritics
Normalize-and-strip works because accented Latin letters decompose into base + mark. A large class of characters does not decompose at all, and for those the pipeline silently produces nothing.
| Input | After NFKD + strip | Why |
|---|---|---|
é à ü ñ |
e a u n |
decompose to base + Mn mark |
ø |
ø |
no decomposition — it's an atomic letter |
ł |
ł |
atomic Polish L-with-stroke |
ß |
ß |
atomic; needs a special case → ss |
Æ Œ |
Æ Œ |
atomic ligatures → ae oe |
日本語 |
日本語 |
CJK — no Latin form at all |
Привет |
Привет |
Cyrillic — needs romanization |
مرحبا |
مرحبا |
Arabic — needs romanization |
Anything in the right two rows survives normalization untouched, then
gets wiped by the [^a-z0-9]+ step. The failure is quiet and total:
slugify("Smørrebrød") → "smrrebrd" (ø dropped, not transliterated)
slugify("Łódź") → "d" (ł and ó-stripped collide to junk)
slugify("日本語") → "" ← empty slug
slugify("Привет мир") → "" ← empty slug
An empty slug is a real bug: you end up with /posts/ or a route that
collides with every other untranslatable title. The fix is a
transliteration step that runs after mark-stripping and maps the
atomic characters explicitly:
TRANSLIT = {"ø": "o", "ł": "l", "ß": "ss", "æ": "ae",
"œ": "oe", "đ": "d", "þ": "th", "ð": "d"}
For non-Latin scripts you need a romanization table per script —
日本語 → nihongo (or ri-ben-yu, depending on your romanization
choice), Cyrillic via a GOST/BGN table, Arabic via a standard
transliteration. There is no single correct answer; Japanese alone has
Hepburn vs. Kunrei, and Chinese has Pinyin with or without tone marks.
Libraries that ship these tables (the various slugify/unidecode
families) make a choice for you, which is fine until your audience
expects a different one. The honest position: full-coverage
transliteration of arbitrary Unicode is a localization problem, not a
string-cleaning problem, and a slugifier can only approximate it.
If transliteration produces an empty result, fall back to a generated identifier (a short hash or a counter) rather than emitting an empty slug.
Collisions
Distinct titles routinely collapse to the same slug, because slugifying is lossy by design:
slugify("C++") → "c"
slugify("C#") → "c"
slugify("C") → "c"
slugify("Node.js") → "node-js"
slugify("Node JS") → "node-js"
You cannot prevent collisions by being cleverer about the transform — the information that distinguished the inputs is exactly the punctuation you're throwing away. Handle them at write time instead:
- Check-and-suffix: query for the candidate slug; if taken, append
-2,-3, … until one is free. Readable, but requires a uniqueness check and a retry loop. - Short hash suffix: append a few characters of a hash of the
unique key (
node-js-7f3a). Always unique in one shot, slightly less pretty. - Unique constraint + retry: let the database enforce uniqueness and retry on conflict. The only race-safe option under concurrency.
Pick check-and-suffix for content where readability matters and writes are rare; pick the constraint-plus-retry for anything concurrent.
Stability: mint once, never regenerate
The most damaging slug bug is regenerating the slug whenever the title
is edited. Someone fixes a typo in "Cafe Rene" → "Café René", your
code recomputes the slug, the URL silently changes from cafe-rene to
something else, and every external link and every accumulated SEO
signal now points at a 404.
The rule: store the slug as its own column when the record is created, and treat it as immutable. Editing the title does not touch the slug. If you genuinely must change a slug, mint the new one, keep the old one, and serve a 301 redirect from old to new. A slug is part of your URL contract, not a derived view of the title.
Length, numbers, reserved words, stop words
A handful of smaller decisions round out a real implementation:
- Length: cap the slug (60–80 chars is typical) and trim on a
hyphen boundary, then
rstrip("-")so you never end on a hyphen. - Leading digits:
"2026 review"→2026-reviewis fine for URLs, but if the slug is ever used as a programming identifier (anchor IDs, generated variable names) a leading digit is invalid — prefix it if so. - Reserved words: block slugs that collide with your own routes
(
new,edit,admin,api). A post titled "New" must not slug to a path your router already owns. - Stop words: stripping
the,a,of,andshortens slugs and is common in CMS systems, but it's a trade-off — it hurts readability for short titles ("The Office"→office) and only ever applies to one language. Most modern slugifiers leave stop words in.
Slugs versus percent-encoding
A slug and percent-encoding solve overlapping problems differently. If
you keep the original title in the path, a URL encoder escapes the
unsafe bytes: Café René becomes Caf%C3%A9%20Ren%C3%A9 — correct,
lossless, but unreadable and ugly in a share preview. A slug instead
stays inside the unreserved set so no escaping ever happens, at the
cost of being lossy. They're the two ends of the same trade: preserve
fidelity and escape, or preserve readability and discard. The
percent-encoding write-up covers the
escaping side in detail, and if you've ever wondered why
Base64 isn't encryption, it's the
same theme — making bytes safe to transport is not the same as making
them mean less.
For one-off conversions, our URL slug generator runs this whole pipeline — normalize, strip, transliterate, collapse — so you can paste a title and see exactly what it resolves to, including the cases where a non-Latin title would otherwise slug to nothing.