Skip to content
yutils

How JSON Parsing Works

Inside the JSON parser — lexer states, why JSON forbids trailing commas, why large integers lose precision (IEEE 754), how streaming JSON works, and the parser quirks that surprise senior engineers.

~9 min read

JSON's grammar fits on a single page. Yet writing a parser forces design decisions that explain its real-world quirks: why trailing commas are forbidden, why JSON.parse('{"id":9999999999999999}') silently returns 10000000000000000, why streaming parsers exist. This guide walks through the parser internals and the surprises seasoned engineers still hit.

The JSON grammar — five value types

value  := object | array | string | number | true | false | null
object := { "key" : value , "key" : value ... }
array  := [ value , value ... ]
string := " (escaped chars) "
number := -? digits ( .digits )? ( [eE] [+-]? digits )?

That's the whole spec. No functions, variables, comments, or trailing commas. Simplicity is the point — a parser fits in ~100 lines.

Two stages — lexer → parser

Most JSON parsers split into two passes:

1. Lexer (tokenizer) — characters → tokens

input: {"id": 42, "name": "Yu"}

tokens:
  LBRACE      {
  STRING      "id"
  COLON       :
  NUMBER      42
  COMMA       ,
  STRING      "name"
  COLON       :
  STRING      "Yu"
  RBRACE      }

The lexer skips whitespace and groups characters into meaningful tokens via a state machine:

  • See " → STRING state until the next "
  • See a digit or - → NUMBER state
  • See t/f/n → keyword (true/false/null) state
  • Anything else → error

2. Parser — tokens → tree

Recursive descent is the typical choice:

parseValue() {
  switch (peek()) {
    case LBRACE: return parseObject();
    case LBRACKET: return parseArray();
    case STRING: return consume().value;
    case NUMBER: return parseNumber(consume());
    case "true": consume(); return true;
    ...
  }
}

parseObject() {
  expect(LBRACE);
  while (peek() !== RBRACE) {
    const key = expect(STRING).value;
    expect(COLON);
    obj[key] = parseValue();
    if (peek() === COMMA) consume();
    else break;     // ← branch where trailing comma policy lives
  }
  expect(RBRACE);
  return obj;
}

Why trailing commas are forbidden

{"a": 1, "b": 2,}    ← JSON error
[1, 2, 3,]           ← JSON error

When Douglas Crockford standardized JSON in RFC 4627 (2006), he chose "minimal grammar." Trailing commas would:

  • Add a parser branch — parseObject() would need an extra check for RBRACE right after COMMA
  • Risk producing accidentally-empty trailing entries in some serializers (Python [1,2,] is length 2 vs length 3 confusion)
  • Trade DX for spec simplicity — and simplicity won

The cost shows up in diffs — adding a line means modifying the previous line to add a comma. JavaScript, Python, Go, and Rust all allow trailing commas. JSON's refusal is the most-frequent complaint.

Workarounds — JSON5 / JSONC allow trailing commas and comments. tsconfig.json is JSONC. Strict JSON still forbids them.

The precision bomb — IEEE 754

The spec has no upper bound on number magnitude:

{"id": 9999999999999999}   ← valid per spec

But JavaScript's JSON.parse() returns a Number — IEEE 754 double precision. Safe integers only up to ±2^53 (9,007,199,254,740,992). Beyond that, precision is lost:

JSON.parse('{"id":9999999999999999}').id
// 10000000000000000  ← off by 1

Number.MAX_SAFE_INTEGER
// 9007199254740991  (= 2^53 - 1)

Twitter's API was bitten early — its snowflake IDs are 64-bit and JavaScript clients silently lost the low digits. Fix: ship IDs as strings:

// Bad
{"id": 1234567890123456789}

// Good
{"id": "1234567890123456789"}

Other languages:

  • Pythonjson.loads() handles arbitrarily large ints. No precision loss.
  • Gojson.Unmarshal defaults to float64. Use json.Number to preserve precision.
  • Java — Jackson supports BigInteger / BigDecimal.

See it in action — feed a large integer to JSON Formatter / Validator and the tree view shows the precision loss immediately.

BigInt meets JSON

JavaScript got BigInt in 2020. But JSON.stringify(123n) throws — the spec doesn't define BigInt serialization.

Workaround — patch toJSON or use a reviver:

BigInt.prototype.toJSON = function() { return this.toString(); };

JSON.stringify({id: 1234567890123456789n});
// '{"id":"1234567890123456789"}'

String escapes — quiet traps

The escapes JSON strings allow:

\"    " (double quote)
\\    \ (backslash)
\/    / (slash, optional)
\b    backspace
\f    form feed
\n    newline
\r    carriage return
\t    tab
\uXXXX  Unicode codepoint (4 hex digits)

For codepoints > U+FFFF, JSON uses UTF-16 surrogate pairs (e.g. "🎉" → 🎉). Some parsers accept unpaired surrogates and emit invalid UTF-8 — a security risk when the output crosses trust boundaries.

Duplicate keys

{"a": 1, "a": 2}    ← valid?

RFC 8259 says key names "should" be unique but doesn't require it. Most parsers:

  • Take the last value (JavaScript / Python / Go)
  • Take the first (some older parsers)
  • Preserve all as an array (CouchDB and friends)

Security implication — if a proxy uses one parser and the API uses another with opposite duplicate-key behavior, you've got an auth-bypass primitive. Be deliberate at trust boundaries.

Streaming JSON — memory matters

JSON.parse() reads the entire string in one shot. A 1 GB JSON file needs 1 GB+ of memory. Lambda / Cloud Function limits get hit fast.

Options — streaming parsers emit tokens via callbacks:

  • SAX-style — onObjectStart / onKey / onValue callbacks. The caller builds the structure they actually need.
  • JSONPath streaming — extract only a specific path. Process items in a huge array one at a time. JSONStream (Node) / ijson (Python).
  • JSON Lines (JSONL / NDJSON) — one JSON object per line. Line-by-line streaming is natural. Standard for logs and analytics:
{"user": "alice", "ts": 1700000000}
{"user": "bob", "ts": 1700000001}
{"user": "carol", "ts": 1700000002}
// Each line is its own JSON. No need to load the whole file.

MongoDB EJSON — adding types back

BSON (MongoDB's binary format) has ObjectId, Date, Decimal128 — types JSON doesn't model. MongoDB Extended JSON wraps them in marker objects:

{
  "_id": { "$oid": "507f1f77bcf86cd799439011" },
  "created": { "$date": "2026-05-22T00:00:00Z" },
  "price": { "$numberDecimal": "19.99" }
}

MongoDB Extended JSON recognizes the 16 wrapper types. The tree view in JSON Formatter / Validator also auto- detects EJSON when the toggle is on.

Common pitfalls

1. JSON.stringify and undefined

JSON.stringify({a: undefined, b: 1})  // '{"b":1}'  ← a dropped
JSON.stringify([undefined])           // '[null]'    ← coerced to null
JSON.stringify(undefined)             // undefined   ← the function itself returns undefined

2. NaN / Infinity

JSON.stringify({x: NaN})       // '{"x":null}'
JSON.stringify({x: Infinity}) // '{"x":null}'

JSON has no representation for NaN or Infinity. They round-trip as null, silently. Preserve them as strings if you need to.

3. Circular references

const a = {};
a.self = a;
JSON.stringify(a);  // TypeError: Converting circular structure to JSON

4. Date's automatic toJSON

JSON.stringify({d: new Date()})
// '{"d":"2026-05-22T05:30:00.000Z"}'
// Date's toJSON() returns an ISO 8601 string

// But parsing doesn't restore it
JSON.parse('{"d":"2026-05-22T05:30:00.000Z"}').d
// "2026-05-22T05:30:00.000Z" (still a string!)

Restore with a reviver:

JSON.parse(str, (k, v) =>
  typeof v === "string" && /^\d{4}-\d{2}-\d{2}T/.test(v)
    ? new Date(v) : v);

5. Large-input typing freeze

JSON.parse is synchronous and blocks the main thread. For large inputs, JSON Formatter / Validator debounces past 4 KB (the same pattern from PR #77).

References

Summary

  • JSON has 5 value types + object/array. A working parser is ~100 lines.
  • Lexer (state machine) → Parser (recursive descent). Two passes.
  • No trailing commas — a deliberate spec-simplicity trade. Use JSON5/JSONC for files that need them.
  • Numbers go through IEEE 754 double. Integers past ±2^53 lose precision. Send IDs as strings.
  • Duplicate keys are undefined behavior — parser-dependent. Plan for it at trust boundaries.
  • Streaming = JSONL / NDJSON. Don't read 1 GB in one parse.
  • EJSON layers Date / ObjectId / Decimal128 on top of JSON.
  • Try it — JSON Formatter / Validator / JSON Path / MongoDB Extended JSON / JSON → TypeScript / JSON Schema Generator.
Back to guides