Type "안녕" on the screen and your file or memory stores six bytes: EC 95 88 EB 85 95. Why does Korean take three bytes per character while English A takes one? How does Unicode codepoint U+C548 turn into EC 95 88? This guide walks through UTF-8's encoding rules, how it relates to Unicode, the BOM, surrogate pairs, and the byte-vs-character bugs that bite teams handling Korean, Japanese, or emoji.
First — Unicode and UTF-8 are different things
- Unicode — a standard that assigns a number to each character.
A= U+0041, the Korean character안= U+C548, the emoji🎉= U+1F389. Roughly 150,000 codepoints (Unicode 15). - UTF-8 / UTF-16 / UTF-32 — encodings that turn codepoints into bytes. The same character takes different numbers of bytes in different encodings.
"Stored in Unicode" is imprecise; people usually mean "UTF-8 encoded Unicode text."
UTF-8's core idea — variable length
1 to 4 bytes per codepoint:
| Codepoint range | Bytes | Bit pattern |
|---|---|---|
| U+0000 – U+007F | 1 | 0xxxxxxx |
| U+0080 – U+07FF | 2 | 110xxxxx 10xxxxxx |
| U+0800 – U+FFFF | 3 | 1110xxxx 10xxxxxx 10xxxxxx |
| U+10000 – U+10FFFF | 4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
Rules:
- The number of leading 1s on the first byte = total bytes (1 → 1 byte, 110 → 2 bytes, 1110 → 3 bytes, 11110 → 4 bytes)
- Continuation bytes always start with 10
- x slots get the codepoint bits, right-aligned
Encoding "안" step by step
"안" = U+C548
1. codepoint in binary
0xC548 = 1100 0101 0100 1000
2. UTF-8 3-byte form (16 bits fits in 16 x slots)
1110 xxxx 10 xxxxxx 10 xxxxxx
3. distribute codepoint bits, right-aligned
1100 0101 0100 1000 → 1100 / 010101 / 001000
(4 bits)(6 bits)(6 bits)
4. join
1110 1100 1001 0101 1000 1000
= 0xEC 0x95 0x88
→ "안" = EC 95 88 (3 bytes)Try it: feed "안" into URL Encode / Decode — you'll see %EC%95%88 (each byte shown as %XX).
Why is English 1 byte and Korean 3 bytes?
UTF-8's most important design decision — byte-for-byte compatibility with ASCII (1968). Codepoints U+0000 – U+007F encode identically to ASCII.
Consequences:
- Existing ASCII files are already valid UTF-8 (no conversion)
- English-only code / logs / JSON pay no cost to "adopt" UTF-8
- Non-ASCII (Korean, emoji, Arabic) pay extra bytes — you pay for what you use
Korean 3-byte cost looks like waste from a Korean-only view, but it's the trade-off for global compatibility and preserving ASCII-based infrastructure. Korean-specific EUC-KR uses 2 bytes per Hangul but mixed English+Korean documents get messier.
UTF-16 — the other big encoding
JavaScript / Java / Windows APIs use UTF-16 internally. 2 bytes baseline, 4 bytes for supplementary:
- U+0000 – U+FFFF (BMP, Basic Multilingual Plane) → 2 bytes
- U+10000 – U+10FFFF → 4 bytes (surrogate pair)
Korean characters live in U+AC00 – U+D7A3 (inside the BMP), so UTF-16 encodes them in 2 bytes — smaller than UTF-8's 3. Systems with mostly Korean content might find UTF-16 more compact.
Surrogate pairs — the emoji gotcha
"🎉" = U+1F389 (outside BMP)
UTF-16 surrogate pair:
high = 0xD83C
low = 0xDF89
JavaScript:
"🎉".length === 2 ← not 1!
"🎉".charAt(0) ← only "�" (broken)JavaScript's str.length counts UTF-16 code units. Characters outside the BMP (most emoji, some CJK extensions) register as length 2 even though humans see one character.
Fix — newer APIs like [...str] or Array.from(str) iterate codepoints:
[..."🎉"].length === 1 ✓
[..."안녕🎉"].length === 3 ✓BOM (Byte Order Mark) — the invisible first byte
Three bytes — EF BB BF — sometimes prepended to UTF-8 files. Originally a UTF-16 little-vs-big-endian marker. UTF-8 has no byte order, so the BOM is unnecessary — but Windows Notepad and Excel CSV add it anyway.
Problems:
- Web server returns HTML with a BOM → browser corrupts the first line
- CSV's first column should be
"id", becomes"id"(BOM included) → DB import fails - Shell script starting with
#!/bin/bashpreceded by BOM → "command not found"
Fix — use the editor's "UTF-8 without BOM" option or sed '1s/^\xEF\xBB\xBF//'.
Korean in URLs — percent-encoding
URLs allow only ASCII (RFC 3986). Korean URLs are UTF-8 encoded and each byte becomes %XX:
https://example.com/검색?q=안녕
→ https://example.com/%EA%B2%80%EC%83%89?q=%EC%95%88%EB%85%95
"검" = EA B2 80 → %EA%B2%80
"색" = EC 83 89 → %EC%83%89
"안" = EC 95 88 → %EC%95%88
"녕" = EB 85 95 → %EB%85%95URL Encode / Decode handles the conversion automatically. The "keep host" toggle preserves scheme + host while only encoding path/query.
Punycode — Korean in domain names
Korean domains (한글.kr) can't use percent-encoding — DNS doesn't understand %. Instead, Punycode represents the characters using pure ASCII. Different algorithm, but the output always has the xn-- prefix:
한글.kr → xn--bj0bj06e.kr
한국.kr → xn--3e0b707e.krPunycode (IDN) converts both ways — useful for spotting IDN homograph attacks (Latin "a" vs Cyrillic "а" lookalikes).
HTML entities — yet another encoding
Special characters in HTML / XML (<, >, &) need entity escaping: <, >, &. Korean and other regular characters can stay literal (handled by HTML's charset=utf-8).
HTML Entity Encode / Decode converts both ways.
Common bugs
1. byte ≠ character
-- Does MySQL VARCHAR(10) hold 10 Korean characters?
INSERT INTO users (name) VALUES ('가나다라마바사아자차카');
-- "Data too long for column 'name'" — 11 chars = 33 bytesPlenty of systems interpret column length in bytes. MySQL's utf8mb4 VARCHAR(10) is 10 chars (up to 40 bytes); the legacy utf8 (= utf8mb3) only covers BMP, so emoji break entirely.
2. Length validation that ignores encoding
if (input.length > 100) throw new Error("too long");
// "안녕하세요" — 5 chars vs JS length 5 vs UTF-8 byte 15
// which definition is the 100?3. Mojibake in URLs
URLs with raw Korean (no encodeURIComponent) hit a server that decodes as EUC-KR, then decodeURIComponent on the next hop produces garbage.
4. Excel CSV BOM
Excel writes CSV with a BOM. A parser that doesn't handle it sees the first column as name.
5. JSON.stringify escaping Korean
Some libraries escape non-ASCII to \uXXXX for safety, inflating file size. JSON spec allows raw UTF-8 — most modern guidance recommends keeping it raw.
References
- RFC 3629 (UTF-8) — datatracker
- The Unicode Standard — unicode.org
- Joel Spolsky — "The Absolute Minimum Every Developer Must Know About Unicode" — joelonsoftware.com
- UTF-8 Everywhere Manifesto — utf8everywhere.org
Summary
- Unicode assigns numbers to characters. UTF-8 turns those numbers into bytes.
- UTF-8 is variable-length: ASCII 1 byte, Korean 3 bytes, emoji 4 bytes. The leading 1s of the first byte say how long.
- ASCII compatibility is UTF-8's killer feature — 30 years of tooling kept working.
- UTF-16 is more compact for Korean (2 bytes) and is the internal format of JavaScript / Java — but emoji become surrogate pairs.
- BOM (EF BB BF) is pointless in UTF-8 but Windows / Excel add it anyway. Common cause of "first column corrupted" bugs.
- byte ≠ character. Be explicit about which one your DB columns, length checks, and APIs use.
- Try it: URL Encode / Decode / Punycode (IDN) / HTML Entity Encode / Decode / Base64 Encode / Decode.