How to use
Paste or type into the textarea and the five counts update live: total characters (counting Unicode code points, not bytes), characters excluding whitespace, words (whitespace-delimited runs), lines (any of `\r\n`, `\r`, or `\n` as a separator), and bytes (UTF-8 encoded). All counts react instantly — no submit button, no server round-trip.
Reach for this when length matters: writing a meta description (Google truncates at ~155–160 characters), a Twitter post (280 chars with emoji-aware counting), an SMS (160 GSM-7 or 70 UCS-2 per segment), a YouTube comment (10,000 chars), a database column with a strict `VARCHAR(255)` limit. The byte count specifically helps when a "character" limit is actually a UTF-8 byte limit in disguise — common in older MySQL columns and some API quotas. Korean and Japanese text expand to 3 bytes per character in UTF-8, so a 100-character Korean paragraph is ~300 bytes.
Examples
A typical meta description
Input
Open-source web utilities — JSON, Base64, UUID, regex, and more. Everything runs in your browser, no upload required, no signup.
Output
Characters: 134
Characters (no WS): 116
Words: 21
Lines: 1
Bytes (UTF-8): 134
At 134 characters this fits Google's ~160-character soft cap with room for the page title prefix. The character count and byte count match because the text is entirely ASCII — every code point fits in 1 byte. If the same sentence were translated into Korean, the byte count would roughly triple while the character count drops.
Korean / Japanese text — bytes ≠ characters
Input
안녕하세요, 오늘 날씨가 참 좋네요.
Output
Characters: 20
Characters (no WS): 18
Words: 4
Lines: 1
Bytes (UTF-8): 53
20 visible characters, 53 UTF-8 bytes (~2.65× expansion). Each Korean syllable block (`안`, `녕`, `하`, etc.) takes 3 bytes in UTF-8; punctuation and the spaces take 1 byte each. When a `VARCHAR(20)` column accepts this string is a question of whether the database measures characters (PostgreSQL) or bytes (older MySQL `latin1`, GBK encodings). Always check.
Emoji and combining marks
Output
Characters: 18
Characters (no WS): 14
Words: 3
Bytes (UTF-8): 45
The family emoji `👨👩👧` is **7 code points** (3 person emoji + 2 zero-width joiners, with skin-tone-free defaults) but renders as one glyph. The `é` in `café` is one code point if precomposed, two if `e` + combining acute accent. `[...text].length` counts code points; `text.length` would count UTF-16 code units (16 for the family alone). Twitter, Discord, and most modern systems use a grapheme-cluster count that treats the family as 1 — neither of our counts matches that.
FAQ
Why is the byte count more than the character count?
UTF-8 uses 1 byte for ASCII (U+0000–U+007F), 2 bytes for the next ~1900 characters (Latin extended, Greek, Cyrillic, Hebrew, Arabic), 3 bytes for the rest of the BMP (Korean, Chinese, Japanese, most other scripts), and 4 bytes for supplementary planes (most emoji, ancient scripts, rare CJK). A pure-ASCII English sentence: 1 byte per character. A Korean sentence: 3 bytes per character. A sentence full of emoji: 4 bytes per visible glyph plus overhead for ZWJ sequences. The byte count is what your storage layer and many APIs measure.
Why does my emoji count look high?
The counter measures Unicode *code points*, not visible *graphemes*. A single emoji glyph like `👨👩👧👦` (family of four) is actually 7 code points (4 person emoji + 3 zero-width joiners). The grapheme cluster — what a user sees as "one character" — needs the `Intl.Segmenter` API (or libraries like `grapheme-splitter`) to compute. Twitter, Discord, and modern social platforms count graphemes, not code points. If you need that count, paste into Twitter's draft box to see the real number, or run `Array.from(new Intl.Segmenter("en", { granularity: "grapheme" }).segment(text)).length` in DevTools.
How are words counted when there are no spaces (Chinese, Japanese)?
They are not. The counter splits on whitespace, so a Japanese sentence with no spaces counts as 1 word regardless of length. True word segmentation needs a morphological analyzer (kuromoji for Japanese, jieba for Chinese, KoNLPy for Korean) and is language-specific. For CJK content, the character count is the meaningful measure of length — it matches how publishers (NHK, Asahi, Yomiuri) historically billed translation work and how Twitter weights CJK posts (each character counts as 2 toward the 280 limit, so the effective limit is 140 CJK characters).
What are the common length limits I should know?
**SEO meta description**: ~155–160 characters before Google truncates. **HTML `<title>`**: ~60 characters for desktop, ~50 for mobile. **Twitter / X post**: 280 characters (CJK weighted 2× so effective 140). **Bluesky post**: 300 characters. **Mastodon post**: 500 characters default (instance-configurable). **SMS**: 160 GSM-7 characters or 70 UCS-2 (Unicode) per segment; messages longer span multiple segments billed separately. **YouTube comment**: 10,000 characters. **GitHub commit message**: no hard limit but 50 / 72 conventions (subject / body wrap). **Database VARCHAR**: depends — `VARCHAR(255)` in MySQL might be bytes (old) or characters (modern utf8mb4), check the column charset.
How is the line count affected by trailing newlines?
A trailing `\n` adds one to the count because it creates an empty line after itself. `hello\nworld` is 2 lines; `hello\nworld\n` is also 2 lines in most editors but 3 by the count-on-splitter approach this tool uses — the split produces `["hello", "world", ""]`, three elements. POSIX text files end with `\n` by convention, but counting "lines of content" usually wants the same number as `wc -l` (which counts trailing newlines explicitly). Pick the convention that matches your downstream consumer.
Can I count tokens (LLM context windows)?
Not here — tokenization is model-specific. OpenAI uses tiktoken (BPE), Anthropic Claude uses a different BPE variant, Google Gemini and Llama have their own. A rough rule of thumb is 1 token ≈ 4 English characters or 1 token ≈ 0.5 CJK characters, so this tool's character count divided by 4 gives a usable estimate for English. For an exact count run OpenAI's `tiktoken` library locally or use their playground; for Anthropic, use their `count_tokens` endpoint.
Related concepts
Counting "how long is this text" is deceptively layered. Five reasonable answers exist for the same string. **Bytes** is the storage measure — UTF-8 encoded length, what `wc -c` returns. **Code units** is the in-memory measure for many string types — UTF-16 in JavaScript and Java means `"hello".length` is 5 but `"𐀀".length` is 2 (one supplementary-plane character occupies two code units). **Code points** is the Unicode-abstract measure — what `[...str].length` gives in JavaScript, what `wc -m` returns. **Graphemes** is the user-facing measure — `👨👩👧👦` is one grapheme but seven code points; `Intl.Segmenter` gives this count. **Words** is whitespace-defined for Latin scripts and morpheme-defined for CJK, which is why no single tool gets it right for all languages.
The difference matters anywhere a "character limit" is enforced. A 280-char Twitter post is graphemes, not code points or bytes. A 255-character `VARCHAR` in MySQL is bytes if the column is `latin1`, characters if it is `utf8mb4`. SMS limits are 160 GSM-7 *septets* or 70 UCS-2 code units per segment. Browsers truncate meta descriptions visually at a pixel width, not a character count, so a string of `Wide W` characters gets cut earlier than `iiiiii`. Picking the right measure for your downstream consumer beats picking a "nice" round number.
Three adjacent concepts are worth knowing. **Normalization** (NFC vs NFD) affects code point counts — `é` is 1 code point precomposed (NFC) or 2 decomposed (NFD), so a normalize step before counting matters for comparison. **Bidi text** mixes left-to-right and right-to-left scripts; counting is unchanged but visual length is unintuitive. **Width** in monospace fonts depends on East Asian Width — a fullwidth Korean character takes 2 columns in a terminal, a half-width Latin character takes 1. The `wcwidth` family of functions encodes that.