CSV looks like the simplest format alive — commas separate fields, newlines separate rows. In practice, it's one of the most-broken formats in production. What happens when a field contains a comma? A quote? A newline? When Excel saves with a BOM? When a Korean Excel writes CP949 instead of UTF-8? This guide walks through RFC 4180, the common traps, and a debugging playbook.
The CSV "standard" — RFC 4180
Published in 2005 — informal, "the way most tools already work." Before that, every tool had its own rules. The essentials:
1. Each line = one record
2. Comma (,) separates fields
3. Newline = CRLF
4. Quote a field with " when it contains , or " or a newline
5. Inside quotes, escape " as ""
Example:
name,note,city
"Lee, Sumi","She said ""hi""","Seoul"
"Park","Multi
line text","Busan"RFC 4180 is informal — Excel, Google Sheets, and library implementations vary subtly.
Edge case 1 — commas inside fields
Bad:
name,address
Kim Sumi,Seoul, Gangnam, 123 ← four fields?
RFC 4180 — quote it:
name,address
Kim Sumi,"Seoul, Gangnam, 123" ← two fields ✓A naive line.split(',') doesn't know about quoted commas. Always use an RFC 4180 parser. CSV ↔ JSON uses papaparse for that.
Edge case 2 — quotes inside quotes
Original: She said "hi"
CSV escape (RFC 4180):
name,quote
"Lee","She said ""hi"""
^^^^^^^^^^^^^
outer "..." is the field boundary
inner "" decodes to a single "Some tools use \" escapes — not standard. CSV's only escape is "".
Edge case 3 — multi-line fields
name,description
"Yu","Line 1
Line 2
Line 3"
"Lee","Single line"Two records, but the line count doesn't match. wc -l and similar line-based tools will mislead you — only a real parser gets it right.
Why multi-line fields are dangerous
- Line-based tools (grep / sed / awk) confuse record boundaries
- DB BULK INSERT operations read line-by-line and truncate records
- Common in user inputs (addresses, descriptions) that contain newlines
Workaround — pre-process to replace newlines with \n tokens, or use JSONL for line-safe data.
Edge case 4 — the BOM (Byte Order Mark)
Excel for Windows prepends a UTF-8 BOM (EF BB BF) when saving CSV:
raw bytes:
EF BB BF 6E 61 6D 65 2C 76 61 6C 75 65 0A
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
BOM "name,value\n"
Parser result:
The first column becomes "\uFEFFname" or "name"
^ the BOM is now part of the column name
→ DB import: "Column 'name' not found"
→ Every row's first column is wrongFix:
- BOM-aware parsers — papaparse strips BOM by default
- Save without BOM (Excel "UTF-8 (no BOM)" option, or LibreOffice)
- Shell —
sed '1s/^\xEF\xBB\xBF//' file.csv
Edge case 5 — Korean Excel's CP949
Korean-locale Windows Excel defaults to CP949 encoding (a Microsoft superset of EUC-KR), not UTF-8. Tools written abroad assume UTF-8, then everything Korean becomes mojibake:
Korean Excel saves:
EC A0 A4 EC 9B 90 ← UTF-8 for "직원" if UTF-8 mode
B5 60 B0 F8 ← CP949 for "직원" (totally different bytes!)
UTF-8 parser reading CP949 bytes:
"직원" → "직���" (mojibake)Fix:
- Use Excel's "Save As" → "CSV UTF-8" (Excel 2016+)
- Tell the parser the encoding — Python
encoding="cp949", etc. - A UTF-8 BOM is a strong signal — keeping it can help downstream parsers
Edge case 6 — Excel's auto-conversion
Excel "helpfully" converts values when opening CSVs:
CSV: Excel opens it as:
"0123",,, → 123 (leading zero lost — postal codes broken)
"05/22",,, → May 22 date (looked date-y, became a date)
"1.234e3",,, → 1234 (scientific notation interpreted)
"01/02/2026",,, → 2026-01-02 or 2026-02-01 (locale-dependent)
"=SUM(1+1)",,, → 2 (formula executed! → CSV injection attack)Defenses:
- Prefix with a single quote (
'0123) to force text - Switch to TSV (tab separator) — Excel auto-converts less aggressively
- CSV injection — strip or escape cells starting with
=,+,-,@when exporting user input
CSV Injection
User input field:
=HYPERLINK("http://evil.com/", "Click me")
CSV export opens in Excel → creates a clickable phishing link.
=cmd|'/C calc'!A1
→ Older Excel versions launch calc.exe (historic)OWASP guidance — prefix user-supplied cells starting with =, +, -, @ with a single quote.
Edge case 7 — header or no header?
With header:
name,age
Alice,30
Bob,25
Without header:
Alice,30
Bob,25Same data, different interpretation. RFC 4180 makes headers optional and provides no in-file signal. Conventions:
- All-string values + clearly different first row → header likely
- Document via README or a magic header line
- Force at the parser — papaparse
header: true
CSV ↔ JSON exposes the first-row-as-header toggle.
Dialects — TSV / PSV / SSV / fixed-width
- TSV (tab-separated) — tab separator. Tabs inside fields are rare, so quoting is uncommon
- PSV (pipe-separated) —
|separator. Common from SQL dumps - SSV (semicolon) —
;separator. Europe (because,is the decimal point) - Fixed-width — no separator, columns aligned by position. Legacy mainframe
German / French Excel defaults to ; — the same ".csv" file behaves differently per country.
Practical CSV hygiene
- Use a parser library — never split on
,. Pythoncsv, Nodepapaparse, Java OpenCSV, Goencoding/csv - Be explicit about encoding — UTF-8 default, UTF-8 BOM for Korean Excel compatibility
- Strip BOM — parser option or pre-process
- Defend against CSV injection — escape dangerous prefixes on export
- Consider JSON / JSONL when you need types, nesting, or per-line safety
Debugging — convert to JSON with CSV ↔ JSON to see exactly how each row parses.
Common pitfalls
1. line.split(',')
The classic anti-pattern. Ignores quoted commas, escapes, and BOMs.
2. Korean Excel's CP949 trap
Mojibake whenever a tool assumes UTF-8. Specify the encoding or switch to UTF-8 mode in Excel.
3. Excel losing leading zeros
03722 → 3722. Use ="03722" or pre-format the column as text.
4. Loading huge CSVs into memory
Loading a 100 MB CSV all at once → OOM. Use streaming parsers (papaparse step option, Python csv.reader iterator).
5. Trailing newline → phantom row
name,age
Alice,30
Bob,25
← blank line
Some parsers: 3 rows including an empty one
Others: 2 rows (trailing newline ignored)References
- RFC 4180 (the CSV reference) — datatracker
- OWASP CSV Injection — OWASP
- papaparse — Node CSV parser
- Python csv module — Python docs
Summary
- CSV's "standard" is informal RFC 4180. Tools have subtle dialect differences.
- Seven big traps — embedded commas, quoted quotes, multi-line fields, BOM, encoding (CP949 / UTF-8), Excel auto-conversion, header ambiguity.
- CSV injection — cells starting with
=,+,-,@are Excel formulas. Escape on export. - Dialects — TSV / PSV / SSV (Europe) / fixed-width. Same ".csv" can mean different separators.
- Hygiene — parser library + explicit encoding + BOM strip + injection escape + streaming for large files.
- Reach for JSON / JSONL when types, nesting, or per-line safety matter.
- Debug with CSV ↔ JSON — see how each row actually parses.