Slugifying URLs: Unicode, diacritics, and collisions

How to turn a title into a URL-safe slug: the lowercase-normalize-transliterate pipeline, why diacritics and non-Latin scripts break it, and how to handle collisions.

英語版を表示しています。翻訳は準備中です。

A slug is the human-readable, URL-safe identifier you derive from a title: "Café René!" becomes cafe-rene. It exists so a URL can be typed, shared, indexed, and remembered without %XX escapes, and so the path itself carries meaning. Generating one is deceptively simple in the common case and full of edge cases the moment your input leaves ASCII. This post walks the pipeline step by step, then covers the two things that actually bite in production: scripts the naive approach can't transliterate, and collisions.

What a slug is for

A slug is a stable, opaque-enough key that happens to be readable. Three properties matter:

  • URL-safe: every character is in the unreserved set (A-Z a-z 0-9 - _ . ~), so no percent-encoding is needed.
  • Readable: /posts/cafe-rene tells a human and a search crawler what the page is. /posts/8a3f does not.
  • Stable: once minted, it never changes, because changing it breaks every link and every backlink that points at it.

That last property is the one people get wrong, and it's covered in its own section below.

The pipeline, step by step

A slugifier is a small ordered transform. Order matters — normalize before you strip, strip before you collapse.

"Café René! — 100% done" 
  1. lowercase             → "café rené! — 100% done"
  2. NFKD normalize        → "café rené! — 100% done"   (é now = e + ´)
  3. strip combining marks → "cafe rene! — 100% done"
  4. transliterate rest    → "cafe rene! - 100% done"   (— → -)
  5. non-alnum → hyphen    → "cafe-rene---100--done"
  6. collapse + trim       → "cafe-rene-100-done"

Each step in pseudocode:

import re, unicodedata

def slugify(text, maxlen=80):
    text = text.lower()
    text = unicodedata.normalize("NFKD", text)      # decompose
    text = "".join(c for c in text                  # drop combining marks
                   if unicodedata.category(c) != "Mn")
    text = transliterate(text)                       # ø, ł, ß, CJK, …
    text = re.sub(r"[^a-z0-9]+", "-", text)          # everything else → -
    text = re.sub(r"-{2,}", "-", text).strip("-")    # collapse + trim
    return text[:maxlen].rstrip("-")

category(c) != "Mn" is the load-bearing line. NFKD splits é into a plain e followed by U+0301 (combining acute accent), which has Unicode category Mn (Mark, nonspacing). Stripping the Mn marks leaves the bare base letters. NFD works identically for accents; NFKD additionally folds compatibility forms (fi, full-width digits → ASCII digits), which is usually what you want for a slug.

The diacritics problem is bigger than diacritics

Normalize-and-strip works because accented Latin letters decompose into base + mark. A large class of characters does not decompose at all, and for those the pipeline silently produces nothing.

Input After NFKD + strip Why
é à ü ñ e a u n decompose to base + Mn mark
ø ø no decomposition — it's an atomic letter
ł ł atomic Polish L-with-stroke
ß ß atomic; needs a special case → ss
Æ Œ Æ Œ atomic ligatures → ae oe
日本語 日本語 CJK — no Latin form at all
Привет Привет Cyrillic — needs romanization
مرحبا مرحبا Arabic — needs romanization

Anything in the right two rows survives normalization untouched, then gets wiped by the [^a-z0-9]+ step. The failure is quiet and total:

slugify("Smørrebrød")  → "smrrebrd"     (ø dropped, not transliterated)
slugify("Łódź")        → "d"            (ł and ó-stripped collide to junk)
slugify("日本語")       → ""             ← empty slug
slugify("Привет мир")  → ""             ← empty slug

An empty slug is a real bug: you end up with /posts/ or a route that collides with every other untranslatable title. The fix is a transliteration step that runs after mark-stripping and maps the atomic characters explicitly:

TRANSLIT = {"ø": "o", "ł": "l", "ß": "ss", "æ": "ae",
            "œ": "oe", "đ": "d", "þ": "th", "ð": "d"}

For non-Latin scripts you need a romanization table per script — 日本語nihongo (or ri-ben-yu, depending on your romanization choice), Cyrillic via a GOST/BGN table, Arabic via a standard transliteration. There is no single correct answer; Japanese alone has Hepburn vs. Kunrei, and Chinese has Pinyin with or without tone marks. Libraries that ship these tables (the various slugify/unidecode families) make a choice for you, which is fine until your audience expects a different one. The honest position: full-coverage transliteration of arbitrary Unicode is a localization problem, not a string-cleaning problem, and a slugifier can only approximate it.

If transliteration produces an empty result, fall back to a generated identifier (a short hash or a counter) rather than emitting an empty slug.

Collisions

Distinct titles routinely collapse to the same slug, because slugifying is lossy by design:

slugify("C++")        → "c"
slugify("C#")         → "c"
slugify("C")          → "c"
slugify("Node.js")    → "node-js"
slugify("Node JS")    → "node-js"

You cannot prevent collisions by being cleverer about the transform — the information that distinguished the inputs is exactly the punctuation you're throwing away. Handle them at write time instead:

  • Check-and-suffix: query for the candidate slug; if taken, append -2, -3, … until one is free. Readable, but requires a uniqueness check and a retry loop.
  • Short hash suffix: append a few characters of a hash of the unique key (node-js-7f3a). Always unique in one shot, slightly less pretty.
  • Unique constraint + retry: let the database enforce uniqueness and retry on conflict. The only race-safe option under concurrency.

Pick check-and-suffix for content where readability matters and writes are rare; pick the constraint-plus-retry for anything concurrent.

Stability: mint once, never regenerate

The most damaging slug bug is regenerating the slug whenever the title is edited. Someone fixes a typo in "Cafe Rene""Café René", your code recomputes the slug, the URL silently changes from cafe-rene to something else, and every external link and every accumulated SEO signal now points at a 404.

The rule: store the slug as its own column when the record is created, and treat it as immutable. Editing the title does not touch the slug. If you genuinely must change a slug, mint the new one, keep the old one, and serve a 301 redirect from old to new. A slug is part of your URL contract, not a derived view of the title.

Length, numbers, reserved words, stop words

A handful of smaller decisions round out a real implementation:

  • Length: cap the slug (60–80 chars is typical) and trim on a hyphen boundary, then rstrip("-") so you never end on a hyphen.
  • Leading digits: "2026 review"2026-review is fine for URLs, but if the slug is ever used as a programming identifier (anchor IDs, generated variable names) a leading digit is invalid — prefix it if so.
  • Reserved words: block slugs that collide with your own routes (new, edit, admin, api). A post titled "New" must not slug to a path your router already owns.
  • Stop words: stripping the, a, of, and shortens slugs and is common in CMS systems, but it's a trade-off — it hurts readability for short titles ("The Office"office) and only ever applies to one language. Most modern slugifiers leave stop words in.

Slugs versus percent-encoding

A slug and percent-encoding solve overlapping problems differently. If you keep the original title in the path, a URL encoder escapes the unsafe bytes: Café René becomes Caf%C3%A9%20Ren%C3%A9 — correct, lossless, but unreadable and ugly in a share preview. A slug instead stays inside the unreserved set so no escaping ever happens, at the cost of being lossy. They're the two ends of the same trade: preserve fidelity and escape, or preserve readability and discard. The percent-encoding write-up covers the escaping side in detail, and if you've ever wondered why Base64 isn't encryption, it's the same theme — making bytes safe to transport is not the same as making them mean less.

For one-off conversions, our URL slug generator runs this whole pipeline — normalize, strip, transliterate, collapse — so you can paste a title and see exactly what it resolves to, including the cases where a non-Latin title would otherwise slug to nothing.