How HTML Parsing Works

All of the following HTML renders fine:

<p>Hello world
<img src="cat.jpg">
<br/>
<BR>
<Br />
<DIV><span>nested</p></div>

Missing closing tags, unquoted attributes, mixed-case tags, improper nesting. In XML, every one of these would be a fatal error. Yet browsers render them anyway. How? This guide walks through the HTML5 parser state machine, its error recovery rules, void elements, and the quirks that make HTML tolerant.

HTML5's core principle — "never throw"

The HTML5 spec (2014) defines parser behavior byte-by-byte — any input produces a deterministic DOM tree. The parser never fails. Even "malformed" input has precise recovery rules written into the spec.

The intent — 30 years of accumulated HTML on the web must keep working. One typo shouldn't break a page. That's the social contract XHTML 1.0 broke.

Two phases — tokenizer + tree construction

HTML bytes
   ↓
1. Tokenizer       — character stream → token stream
   ↓
2. Tree builder    — token stream → DOM tree
   ↓
DOM

1. Tokenizer state machine

The HTML5 tokenizer is an 80+-state state machine. Each state transitions based on the next character:

Data state           ← normal text
   sees '<' → Tag open state
   anything else → accumulate text token

Tag open state
   '/' → End tag open state
   alpha → Tag name state
   '!' → Markup declaration state (<!DOCTYPE / <!-- etc.)
   else → back to Data state (treat < as text)

Tag name state
   alpha → accumulate
   space → Before attribute name state
   '>' → emit token, back to Data state
   '/' → Self-closing start tag state

The full spec defines 80+ states with precise transitions. The result — the same HTML produces the same token stream in every browser.

Why `<br/>` and `<br>` are equivalent

HTML5's void elements (cannot have children):

area, base, br, col, embed, hr, img, input, link, meta,
param, source, track, wbr

When the parser sees a void element's start tag, it emits the element and closes immediately. The trailing / is ignored:

<br>      ← valid (HTML5)
<br/>     ← / ignored, valid
<br />    ← space + / ignored, valid
</br>     ← void elements have no closing tag → ignored

The / was kept for XHTML 1.0 compatibility. It has no effect — void elements close anyway.

Self-closing non-void elements

<div/>     ← HTML5: / ignored, <div> stays open
<span/>    ← same

HTML5 doesn't recognize self-close on non-void elements. React JSX's <div /> compiles to <div></div> — raw HTML self-close is a trap.

Error recovery — precise rules for malformed input

The spec defines recovery for every malformed case.

1. Missing closing tag

Input:  <p>First<p>Second
DOM:    <p>First</p><p>Second</p>

A new <p> implicitly closes the previous one. The spec rule: a <p> cannot be a child of another <p>.

2. Wrong nesting

Input: <b>Hello <i>world</b> friend</i>
DOM:   <b>Hello <i>world</i></b><i> friend</i>

The "adoption agency algorithm" — formatting elements like b/i/em/strong get rearranged when their nesting is broken. It exists for Netscape-era compatibility.

3. Content in the wrong section

Input:
<html>
  <body>
    <p>text
    <title>Late title</title>
  </body>
</html>

DOM:
  <html>
    <head>
      <title>Late title</title>  ← auto-moved from body to head
    </head>
    <body>
      <p>text</p>
    </body>
  </html>

4. Table auto-recovery

Input: <table><div>weird</div></table>

DOM:
  <div>weird</div>      ← foster-parented out of the table
  <table></table>

"Foster parenting" — non-table content inside a table gets moved out in front of the table.

Case-insensitive tags

<DIV>  → <div>
<BR>   → <br>
<Img>  → <img>

The tokenizer normalizes tag and attribute names to lowercase. Attribute values keep their case (so <a href="HTTP://EXAMPLE.COM"> keeps the original URL).

Inside SVG / MathML, case matters: <svg> with<foreignObject> is case-sensitive.

HTML entity decoding

The tokenizer decodes entities to actual characters:

&amp;     → &
&lt;      → <
&gt;      → >
&quot;    → "
&#65;     → A (decimal codepoint)
&#x41;    → A (hex codepoint)
&Aring;   → Å (named entity, 2200+ defined)

Try it — HTML Entity Encode / Decode for both directions.

Why escapes in attributes can still be dangerous

<a href="javascript:alert(1)">click</a>           ← raw, XSS
<a href="&#106;avascript:alert(1)">click</a>      ← entity-encoded, still XSS

Reason — entity decoding happens before scheme checking.
"javascript:" passes through.

Defense — validate URL schemes on the server with an allowlist for attribute values.

Script and style special handling

Inside <script>, the tokenizer pauses normal HTML parsing. Most characters in JavaScript stay as raw text:

<script>
  const x = "<div>"; // fine — < is not a tag start here
</script>

But the literal text </script> closes the block:

<script>
  const x = "</script>"; // ← script ends here!
                          // the rest is parsed as HTML
</script>

Fix — escape:

const x = "<\/script>";  // or "<\u002fscript>"

DOCTYPE — standards vs quirks mode

A <!DOCTYPE html> at the top selects standards mode. Omitting it triggers quirks mode — 1990s-era behavior (different box model, table layout, etc.).

Quirks-mode symptoms:

box-sizing defaults to border-box (IE 5 compatibility)
Image bottom-margin inside table cells differs
The base for 1em in font-size differs

Modern sites always start with <!DOCTYPE html>.

The XHTML cautionary tale

In the early 2000s, XHTML 1.0/1.1 tried to make HTML strict XML. Any malformed input was a fatal error.

<p>open paragraph
<!-- XHTML: parse error, blank screen

<br>  ← XHTML error (must be <br/>)
<DIV> ← XHTML error (must be lowercase) -->

Result — one typo broke the page. Developers and CMS users revolted. WHATWG (Mozilla/Apple/Opera) started HTML5 in 2004 with "compatibility + tolerant parsing." HTML5 was standardized in 2014. XHTML is history.

HTML formatters ride on top of the parser

Formatters reuse the parser:

Parse the HTML → DOM
Pretty-print the DOM with indentation rules

Even malformed input gets cleaned up because the parser already recovered into a valid DOM.

That's what HTML Formatter does. HTML → Markdown follows the same path — parse → DOM → emit markdown.

Common pitfalls

1. The `</script>` literal inside `<script>`

Already covered — JS code with a closing tag literal ends the block early.

2. Missing attribute quotes

<a href=https://example.com>click</a>
<!-- Works -->

<a href=https://example.com/path?q=v>click</a>
<!-- ? becomes an attribute name. Broken. -->

3. Forgetting `&` in text

<p>Tom & Jerry</p>
<!-- Sometimes parsed as the start of an entity, broken render -->

<p>Tom &amp; Jerry</p>  ← safe

4. `&` in URLs

<a href="page.php?a=1&b=2">  ← &b starts to look like an entity

<a href="page.php?a=1&amp;b=2">  ← correct

5. innerHTML XSS

element.innerHTML = userInput;
<!-- If userInput is "<img src=x onerror=alert(1)>", XSS. -->

element.textContent = userInput;  ← safe (skips HTML parsing)

References

HTML Living Standard — Parsing — WHATWG
HTML5 tokenizer states — WHATWG
html5lib (Python parser) — GitHub
Why XHTML didn't replace HTML — W3C

Summary

HTML5 parser = "never throw." Every malformed input has a deterministic recovery in the spec.
Two phases — tokenizer (80+ state machine) → tree builder (DOM).
Void elements (br/img/hr/input ...) cannot have children. The / in <br/> is ignored.
Error recovery handles missing close tags, wrong nesting, and misplaced content (title inside body) automatically.
Tag and attribute names are case-insensitive — except inside SVG / MathML.
DOCTYPE chooses standards vs quirks mode. Always <!DOCTYPE html>.
XHTML's strict approach failed because the web's social contract is "don't break my page on one typo." HTML5 won by embracing tolerant parsing.
Try it: HTML Formatter / HTML Entity Encode / Decode / HTML → Markdown.