Skip to content
yutils

Why CSV Always Breaks (And How to Fix It)

CSV looks like the simplest format alive — commas and rows. In practice, quotes inside fields, embedded newlines, BOM bytes, Excel's encoding choices, and dialect drift turn it into the most-broken format in production.

~8 min read

CSV looks like the simplest format alive — commas separate fields, newlines separate rows. In practice, it's one of the most-broken formats in production. What happens when a field contains a comma? A quote? A newline? When Excel saves with a BOM? When a Korean Excel writes CP949 instead of UTF-8? This guide walks through RFC 4180, the common traps, and a debugging playbook.

The CSV "standard" — RFC 4180

Published in 2005 — informal, "the way most tools already work." Before that, every tool had its own rules. The essentials:

1. Each line = one record
2. Comma (,) separates fields
3. Newline = CRLF
4. Quote a field with " when it contains , or " or a newline
5. Inside quotes, escape " as ""

Example:
name,note,city
"Lee, Sumi","She said ""hi""","Seoul"
"Park","Multi
line text","Busan"

RFC 4180 is informal — Excel, Google Sheets, and library implementations vary subtly.

Edge case 1 — commas inside fields

Bad:
name,address
Kim Sumi,Seoul, Gangnam, 123      ← four fields?

RFC 4180 — quote it:
name,address
Kim Sumi,"Seoul, Gangnam, 123"     ← two fields ✓

A naive line.split(',') doesn't know about quoted commas. Always use an RFC 4180 parser. CSV ↔ JSON uses papaparse for that.

Edge case 2 — quotes inside quotes

Original: She said "hi"

CSV escape (RFC 4180):
name,quote
"Lee","She said ""hi"""

           ^^^^^^^^^^^^^
           outer "..." is the field boundary
           inner "" decodes to a single "

Some tools use \" escapes — not standard. CSV's only escape is "".

Edge case 3 — multi-line fields

name,description
"Yu","Line 1
Line 2
Line 3"
"Lee","Single line"

Two records, but the line count doesn't match. wc -l and similar line-based tools will mislead you — only a real parser gets it right.

Why multi-line fields are dangerous

  • Line-based tools (grep / sed / awk) confuse record boundaries
  • DB BULK INSERT operations read line-by-line and truncate records
  • Common in user inputs (addresses, descriptions) that contain newlines

Workaround — pre-process to replace newlines with \n tokens, or use JSONL for line-safe data.

Edge case 4 — the BOM (Byte Order Mark)

Excel for Windows prepends a UTF-8 BOM (EF BB BF) when saving CSV:

raw bytes:
EF BB BF 6E 61 6D 65 2C 76 61 6C 75 65 0A
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       BOM  "name,value\n"

Parser result:
The first column becomes "\uFEFFname" or "name"
                          ^ the BOM is now part of the column name

→ DB import: "Column 'name' not found"
→ Every row's first column is wrong

Fix:

  • BOM-aware parsers — papaparse strips BOM by default
  • Save without BOM (Excel "UTF-8 (no BOM)" option, or LibreOffice)
  • Shell — sed '1s/^\xEF\xBB\xBF//' file.csv

Edge case 5 — Korean Excel's CP949

Korean-locale Windows Excel defaults to CP949 encoding (a Microsoft superset of EUC-KR), not UTF-8. Tools written abroad assume UTF-8, then everything Korean becomes mojibake:

Korean Excel saves:
EC A0 A4 EC 9B 90  ← UTF-8 for "직원" if UTF-8 mode
B5 60 B0 F8        ← CP949 for "직원" (totally different bytes!)

UTF-8 parser reading CP949 bytes:
"직원" → "직���" (mojibake)

Fix:

  • Use Excel's "Save As" → "CSV UTF-8" (Excel 2016+)
  • Tell the parser the encoding — Python encoding="cp949", etc.
  • A UTF-8 BOM is a strong signal — keeping it can help downstream parsers

Edge case 6 — Excel's auto-conversion

Excel "helpfully" converts values when opening CSVs:

CSV:                Excel opens it as:
"0123",,,           → 123 (leading zero lost — postal codes broken)
"05/22",,,          → May 22 date (looked date-y, became a date)
"1.234e3",,,        → 1234 (scientific notation interpreted)
"01/02/2026",,,     → 2026-01-02 or 2026-02-01 (locale-dependent)
"=SUM(1+1)",,,      → 2 (formula executed! → CSV injection attack)

Defenses:

  • Prefix with a single quote ('0123) to force text
  • Switch to TSV (tab separator) — Excel auto-converts less aggressively
  • CSV injection — strip or escape cells starting with =, +, -, @ when exporting user input

CSV Injection

User input field:
=HYPERLINK("http://evil.com/", "Click me")

CSV export opens in Excel → creates a clickable phishing link.

=cmd|'/C calc'!A1
→ Older Excel versions launch calc.exe (historic)

OWASP guidance — prefix user-supplied cells starting with =, +, -, @ with a single quote.

Edge case 7 — header or no header?

With header:
name,age
Alice,30
Bob,25

Without header:
Alice,30
Bob,25

Same data, different interpretation. RFC 4180 makes headers optional and provides no in-file signal. Conventions:

  • All-string values + clearly different first row → header likely
  • Document via README or a magic header line
  • Force at the parser — papaparse header: true

CSV ↔ JSON exposes the first-row-as-header toggle.

Dialects — TSV / PSV / SSV / fixed-width

  • TSV (tab-separated) — tab separator. Tabs inside fields are rare, so quoting is uncommon
  • PSV (pipe-separated)| separator. Common from SQL dumps
  • SSV (semicolon); separator. Europe (because , is the decimal point)
  • Fixed-width — no separator, columns aligned by position. Legacy mainframe

German / French Excel defaults to ; — the same ".csv" file behaves differently per country.

Practical CSV hygiene

  1. Use a parser library — never split on ,. Python csv, Node papaparse, Java OpenCSV, Go encoding/csv
  2. Be explicit about encoding — UTF-8 default, UTF-8 BOM for Korean Excel compatibility
  3. Strip BOM — parser option or pre-process
  4. Defend against CSV injection — escape dangerous prefixes on export
  5. Consider JSON / JSONL when you need types, nesting, or per-line safety

Debugging — convert to JSON with CSV ↔ JSON to see exactly how each row parses.

Common pitfalls

1. line.split(',')

The classic anti-pattern. Ignores quoted commas, escapes, and BOMs.

2. Korean Excel's CP949 trap

Mojibake whenever a tool assumes UTF-8. Specify the encoding or switch to UTF-8 mode in Excel.

3. Excel losing leading zeros

037223722. Use ="03722" or pre-format the column as text.

4. Loading huge CSVs into memory

Loading a 100 MB CSV all at once → OOM. Use streaming parsers (papaparse step option, Python csv.reader iterator).

5. Trailing newline → phantom row

name,age
Alice,30
Bob,25
        ← blank line

Some parsers: 3 rows including an empty one
Others: 2 rows (trailing newline ignored)

References

Summary

  • CSV's "standard" is informal RFC 4180. Tools have subtle dialect differences.
  • Seven big traps — embedded commas, quoted quotes, multi-line fields, BOM, encoding (CP949 / UTF-8), Excel auto-conversion, header ambiguity.
  • CSV injection — cells starting with =, +, -, @ are Excel formulas. Escape on export.
  • Dialects — TSV / PSV / SSV (Europe) / fixed-width. Same ".csv" can mean different separators.
  • Hygiene — parser library + explicit encoding + BOM strip + injection escape + streaming for large files.
  • Reach for JSON / JSONL when types, nesting, or per-line safety matter.
  • Debug with CSV ↔ JSON — see how each row actually parses.
Back to guides