Skip to content
yutils

How UTF-8 Encodes Text

Why a Korean character takes 3 bytes, what a Unicode codepoint is, how UTF-8's variable-length scheme stays ASCII-compatible, the BOM, surrogate pairs, and the bugs that come from byte-vs-character confusion.

~9 min read

Type "안녕" on the screen and your file or memory stores six bytes: EC 95 88 EB 85 95. Why does Korean take three bytes per character while English A takes one? How does Unicode codepoint U+C548 turn into EC 95 88? This guide walks through UTF-8's encoding rules, how it relates to Unicode, the BOM, surrogate pairs, and the byte-vs-character bugs that bite teams handling Korean, Japanese, or emoji.

First — Unicode and UTF-8 are different things

  • Unicode — a standard that assigns a number to each character. A = U+0041, the Korean character = U+C548, the emoji 🎉 = U+1F389. Roughly 150,000 codepoints (Unicode 15).
  • UTF-8 / UTF-16 / UTF-32 — encodings that turn codepoints into bytes. The same character takes different numbers of bytes in different encodings.

"Stored in Unicode" is imprecise; people usually mean "UTF-8 encoded Unicode text."

UTF-8's core idea — variable length

1 to 4 bytes per codepoint:

Codepoint rangeBytesBit pattern
U+0000 – U+007F10xxxxxxx
U+0080 – U+07FF2110xxxxx 10xxxxxx
U+0800 – U+FFFF31110xxxx 10xxxxxx 10xxxxxx
U+10000 – U+10FFFF411110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Rules:

  • The number of leading 1s on the first byte = total bytes (1 → 1 byte, 110 → 2 bytes, 1110 → 3 bytes, 11110 → 4 bytes)
  • Continuation bytes always start with 10
  • x slots get the codepoint bits, right-aligned

Encoding "안" step by step

"안" = U+C548

1. codepoint in binary
   0xC548 = 1100 0101 0100 1000

2. UTF-8 3-byte form (16 bits fits in 16 x slots)
   1110 xxxx 10 xxxxxx 10 xxxxxx

3. distribute codepoint bits, right-aligned
   1100 0101 0100 1000 → 1100 / 010101 / 001000
                          (4 bits)(6 bits)(6 bits)

4. join
   1110 1100  1001 0101  1000 1000
   = 0xEC     0x95       0x88

→ "안" = EC 95 88 (3 bytes)

Try it: feed "안" into URL Encode / Decode — you'll see %EC%95%88 (each byte shown as %XX).

Why is English 1 byte and Korean 3 bytes?

UTF-8's most important design decision — byte-for-byte compatibility with ASCII (1968). Codepoints U+0000 – U+007F encode identically to ASCII.

Consequences:

  • Existing ASCII files are already valid UTF-8 (no conversion)
  • English-only code / logs / JSON pay no cost to "adopt" UTF-8
  • Non-ASCII (Korean, emoji, Arabic) pay extra bytes — you pay for what you use

Korean 3-byte cost looks like waste from a Korean-only view, but it's the trade-off for global compatibility and preserving ASCII-based infrastructure. Korean-specific EUC-KR uses 2 bytes per Hangul but mixed English+Korean documents get messier.

UTF-16 — the other big encoding

JavaScript / Java / Windows APIs use UTF-16 internally. 2 bytes baseline, 4 bytes for supplementary:

  • U+0000 – U+FFFF (BMP, Basic Multilingual Plane) → 2 bytes
  • U+10000 – U+10FFFF → 4 bytes (surrogate pair)

Korean characters live in U+AC00 – U+D7A3 (inside the BMP), so UTF-16 encodes them in 2 bytes — smaller than UTF-8's 3. Systems with mostly Korean content might find UTF-16 more compact.

Surrogate pairs — the emoji gotcha

"🎉" = U+1F389 (outside BMP)

UTF-16 surrogate pair:
  high = 0xD83C
  low  = 0xDF89

JavaScript:
  "🎉".length === 2   ← not 1!
  "🎉".charAt(0)      ← only "�" (broken)

JavaScript's str.length counts UTF-16 code units. Characters outside the BMP (most emoji, some CJK extensions) register as length 2 even though humans see one character.

Fix — newer APIs like [...str] or Array.from(str) iterate codepoints:

[..."🎉"].length === 1   ✓
[..."안녕🎉"].length === 3   ✓

BOM (Byte Order Mark) — the invisible first byte

Three bytes — EF BB BF — sometimes prepended to UTF-8 files. Originally a UTF-16 little-vs-big-endian marker. UTF-8 has no byte order, so the BOM is unnecessary — but Windows Notepad and Excel CSV add it anyway.

Problems:

  • Web server returns HTML with a BOM → browser corrupts the first line
  • CSV's first column should be "id", becomes "id" (BOM included) → DB import fails
  • Shell script starting with #!/bin/bash preceded by BOM → "command not found"

Fix — use the editor's "UTF-8 without BOM" option or sed '1s/^\xEF\xBB\xBF//'.

Korean in URLs — percent-encoding

URLs allow only ASCII (RFC 3986). Korean URLs are UTF-8 encoded and each byte becomes %XX:

https://example.com/검색?q=안녕

→ https://example.com/%EA%B2%80%EC%83%89?q=%EC%95%88%EB%85%95

"검" = EA B2 80 → %EA%B2%80
"색" = EC 83 89 → %EC%83%89
"안" = EC 95 88 → %EC%95%88
"녕" = EB 85 95 → %EB%85%95

URL Encode / Decode handles the conversion automatically. The "keep host" toggle preserves scheme + host while only encoding path/query.

Punycode — Korean in domain names

Korean domains (한글.kr) can't use percent-encoding — DNS doesn't understand %. Instead, Punycode represents the characters using pure ASCII. Different algorithm, but the output always has the xn-- prefix:

한글.kr → xn--bj0bj06e.kr
한국.kr → xn--3e0b707e.kr

Punycode (IDN) converts both ways — useful for spotting IDN homograph attacks (Latin "a" vs Cyrillic "а" lookalikes).

HTML entities — yet another encoding

Special characters in HTML / XML (<, >, &) need entity escaping: <, >, &. Korean and other regular characters can stay literal (handled by HTML's charset=utf-8).

HTML Entity Encode / Decode converts both ways.

Common bugs

1. byte ≠ character

-- Does MySQL VARCHAR(10) hold 10 Korean characters?
INSERT INTO users (name) VALUES ('가나다라마바사아자차카');
-- "Data too long for column 'name'" — 11 chars = 33 bytes

Plenty of systems interpret column length in bytes. MySQL's utf8mb4 VARCHAR(10) is 10 chars (up to 40 bytes); the legacy utf8 (= utf8mb3) only covers BMP, so emoji break entirely.

2. Length validation that ignores encoding

if (input.length > 100) throw new Error("too long");
// "안녕하세요" — 5 chars vs JS length 5 vs UTF-8 byte 15
// which definition is the 100?

3. Mojibake in URLs

URLs with raw Korean (no encodeURIComponent) hit a server that decodes as EUC-KR, then decodeURIComponent on the next hop produces garbage.

4. Excel CSV BOM

Excel writes CSV with a BOM. A parser that doesn't handle it sees the first column as name.

5. JSON.stringify escaping Korean

Some libraries escape non-ASCII to \uXXXX for safety, inflating file size. JSON spec allows raw UTF-8 — most modern guidance recommends keeping it raw.

References

Summary

  • Unicode assigns numbers to characters. UTF-8 turns those numbers into bytes.
  • UTF-8 is variable-length: ASCII 1 byte, Korean 3 bytes, emoji 4 bytes. The leading 1s of the first byte say how long.
  • ASCII compatibility is UTF-8's killer feature — 30 years of tooling kept working.
  • UTF-16 is more compact for Korean (2 bytes) and is the internal format of JavaScript / Java — but emoji become surrogate pairs.
  • BOM (EF BB BF) is pointless in UTF-8 but Windows / Excel add it anyway. Common cause of "first column corrupted" bugs.
  • byte ≠ character. Be explicit about which one your DB columns, length checks, and APIs use.
  • Try it: URL Encode / Decode / Punycode (IDN) / HTML Entity Encode / Decode / Base64 Encode / Decode.
Back to guides