Question 1

Why is the byte count more than the character count?

Accepted Answer

UTF-8 uses 1 byte for ASCII (U+0000–U+007F), 2 bytes for the next ~1900 characters (Latin extended, Greek, Cyrillic, Hebrew, Arabic), 3 bytes for the rest of the BMP (Korean, Chinese, Japanese, most other scripts), and 4 bytes for supplementary planes (most emoji, ancient scripts, rare CJK). A pure-ASCII English sentence: 1 byte per character. A Korean sentence: 3 bytes per character. A sentence full of emoji: 4 bytes per visible glyph plus overhead for ZWJ sequences. The byte count is what your storage layer and many APIs measure.

Question 2

Why does my emoji count look high?

Accepted Answer

The counter measures Unicode *code points*, not visible *graphemes*. A single emoji glyph like `👨‍👩‍👧‍👦` (family of four) is actually 7 code points (4 person emoji + 3 zero-width joiners). The grapheme cluster — what a user sees as "one character" — needs the `Intl.Segmenter` API (or libraries like `grapheme-splitter`) to compute. Twitter, Discord, and modern social platforms count graphemes, not code points. If you need that count, paste into Twitter's draft box to see the real number, or run `Array.from(new Intl.Segmenter("en", { granularity: "grapheme" }).segment(text)).length` in DevTools.

Question 3

How are words counted when there are no spaces (Chinese, Japanese)?

Accepted Answer

They are not. The counter splits on whitespace, so a Japanese sentence with no spaces counts as 1 word regardless of length. True word segmentation needs a morphological analyzer (kuromoji for Japanese, jieba for Chinese, KoNLPy for Korean) and is language-specific. For CJK content, the character count is the meaningful measure of length — it matches how publishers (NHK, Asahi, Yomiuri) historically billed translation work and how Twitter weights CJK posts (each character counts as 2 toward the 280 limit, so the effective limit is 140 CJK characters).

Question 4

What are the common length limits I should know?

Accepted Answer

**SEO meta description**: ~155–160 characters before Google truncates. **HTML `<title>`**: ~60 characters for desktop, ~50 for mobile. **Twitter / X post**: 280 characters (CJK weighted 2× so effective 140). **Bluesky post**: 300 characters. **Mastodon post**: 500 characters default (instance-configurable). **SMS**: 160 GSM-7 characters or 70 UCS-2 (Unicode) per segment; messages longer span multiple segments billed separately. **YouTube comment**: 10,000 characters. **GitHub commit message**: no hard limit but 50 / 72 conventions (subject / body wrap). **Database VARCHAR**: depends — `VARCHAR(255)` in MySQL might be bytes (old) or characters (modern utf8mb4), check the column charset.

Question 5

How is the line count affected by trailing newlines?

Accepted Answer

A trailing `
` adds one to the count because it creates an empty line after itself. `hello
world` is 2 lines; `hello
world
` is also 2 lines in most editors but 3 by the count-on-splitter approach this tool uses — the split produces `["hello", "world", ""]`, three elements. POSIX text files end with `
` by convention, but counting "lines of content" usually wants the same number as `wc -l` (which counts trailing newlines explicitly). Pick the convention that matches your downstream consumer.

Question 6

Can I count tokens (LLM context windows)?

Accepted Answer

Not here — tokenization is model-specific. OpenAI uses tiktoken (BPE), Anthropic Claude uses a different BPE variant, Google Gemini and Llama have their own. A rough rule of thumb is 1 token ≈ 4 English characters or 1 token ≈ 0.5 CJK characters, so this tool's character count divided by 4 gives a usable estimate for English. For an exact count run OpenAI's `tiktoken` library locally or use their playground; for Anthropic, use their `count_tokens` endpoint.

Text Counter

How to use

Examples

A typical meta description

Korean / Japanese text — bytes ≠ characters

Emoji and combining marks

FAQ

Why is the byte count more than the character count?

Why does my emoji count look high?

How are words counted when there are no spaces (Chinese, Japanese)?

What are the common length limits I should know?

How is the line count affected by trailing newlines?

Can I count tokens (LLM context windows)?

Related concepts

Related tools

Case Converter

Lorem Ipsum Generator

Korean Romanizer

Korean → English Address Converter