All of the following HTML renders fine:
<p>Hello world
<img src="cat.jpg">
<br/>
<BR>
<Br />
<DIV><span>nested</p></div>Missing closing tags, unquoted attributes, mixed-case tags, improper nesting. In XML, every one of these would be a fatal error. Yet browsers render them anyway. How? This guide walks through the HTML5 parser state machine, its error recovery rules, void elements, and the quirks that make HTML tolerant.
HTML5's core principle — "never throw"
The HTML5 spec (2014) defines parser behavior byte-by-byte — any input produces a deterministic DOM tree. The parser never fails. Even "malformed" input has precise recovery rules written into the spec.
The intent — 30 years of accumulated HTML on the web must keep working. One typo shouldn't break a page. That's the social contract XHTML 1.0 broke.
Two phases — tokenizer + tree construction
HTML bytes
↓
1. Tokenizer — character stream → token stream
↓
2. Tree builder — token stream → DOM tree
↓
DOM1. Tokenizer state machine
The HTML5 tokenizer is an 80+-state state machine. Each state transitions based on the next character:
Data state ← normal text
sees '<' → Tag open state
anything else → accumulate text token
Tag open state
'/' → End tag open state
alpha → Tag name state
'!' → Markup declaration state (<!DOCTYPE / <!-- etc.)
else → back to Data state (treat < as text)
Tag name state
alpha → accumulate
space → Before attribute name state
'>' → emit token, back to Data state
'/' → Self-closing start tag stateThe full spec defines 80+ states with precise transitions. The result — the same HTML produces the same token stream in every browser.
Why <br/> and <br> are equivalent
HTML5's void elements (cannot have children):
area, base, br, col, embed, hr, img, input, link, meta,
param, source, track, wbrWhen the parser sees a void element's start tag, it emits the element and closes immediately. The trailing / is ignored:
<br> ← valid (HTML5)
<br/> ← / ignored, valid
<br /> ← space + / ignored, valid
</br> ← void elements have no closing tag → ignoredThe / was kept for XHTML 1.0 compatibility. It has no effect — void elements close anyway.
Self-closing non-void elements
<div/> ← HTML5: / ignored, <div> stays open
<span/> ← sameHTML5 doesn't recognize self-close on non-void elements. React JSX's <div /> compiles to <div></div> — raw HTML self-close is a trap.
Error recovery — precise rules for malformed input
The spec defines recovery for every malformed case.
1. Missing closing tag
Input: <p>First<p>Second
DOM: <p>First</p><p>Second</p>A new <p> implicitly closes the previous one. The spec rule: a <p> cannot be a child of another <p>.
2. Wrong nesting
Input: <b>Hello <i>world</b> friend</i>
DOM: <b>Hello <i>world</i></b><i> friend</i>The "adoption agency algorithm" — formatting elements like b/i/em/strong get rearranged when their nesting is broken. It exists for Netscape-era compatibility.
3. Content in the wrong section
Input:
<html>
<body>
<p>text
<title>Late title</title>
</body>
</html>
DOM:
<html>
<head>
<title>Late title</title> ← auto-moved from body to head
</head>
<body>
<p>text</p>
</body>
</html>4. Table auto-recovery
Input: <table><div>weird</div></table>
DOM:
<div>weird</div> ← foster-parented out of the table
<table></table>"Foster parenting" — non-table content inside a table gets moved out in front of the table.
Case-insensitive tags
<DIV> → <div>
<BR> → <br>
<Img> → <img>The tokenizer normalizes tag and attribute names to lowercase. Attribute values keep their case (so <a href="HTTP://EXAMPLE.COM"> keeps the original URL).
Inside SVG / MathML, case matters: <svg> with<foreignObject> is case-sensitive.
HTML entity decoding
The tokenizer decodes entities to actual characters:
& → &
< → <
> → >
" → "
A → A (decimal codepoint)
A → A (hex codepoint)
Å → Å (named entity, 2200+ defined)Try it — HTML Entity Encode / Decode for both directions.
Why escapes in attributes can still be dangerous
<a href="javascript:alert(1)">click</a> ← raw, XSS
<a href="javascript:alert(1)">click</a> ← entity-encoded, still XSS
Reason — entity decoding happens before scheme checking.
"javascript:" passes through.Defense — validate URL schemes on the server with an allowlist for attribute values.
Script and style special handling
Inside <script>, the tokenizer pauses normal HTML parsing. Most characters in JavaScript stay as raw text:
<script>
const x = "<div>"; // fine — < is not a tag start here
</script>But the literal text </script> closes the block:
<script>
const x = "</script>"; // ← script ends here!
// the rest is parsed as HTML
</script>Fix — escape:
const x = "<\/script>"; // or "<\u002fscript>"DOCTYPE — standards vs quirks mode
A <!DOCTYPE html> at the top selects standards mode. Omitting it triggers quirks mode — 1990s-era behavior (different box model, table layout, etc.).
Quirks-mode symptoms:
box-sizingdefaults to border-box (IE 5 compatibility)- Image bottom-margin inside table cells differs
- The base for
1eminfont-sizediffers
Modern sites always start with <!DOCTYPE html>.
The XHTML cautionary tale
In the early 2000s, XHTML 1.0/1.1 tried to make HTML strict XML. Any malformed input was a fatal error.
<p>open paragraph
<!-- XHTML: parse error, blank screen
<br> ← XHTML error (must be <br/>)
<DIV> ← XHTML error (must be lowercase) -->Result — one typo broke the page. Developers and CMS users revolted. WHATWG (Mozilla/Apple/Opera) started HTML5 in 2004 with "compatibility + tolerant parsing." HTML5 was standardized in 2014. XHTML is history.
HTML formatters ride on top of the parser
Formatters reuse the parser:
- Parse the HTML → DOM
- Pretty-print the DOM with indentation rules
Even malformed input gets cleaned up because the parser already recovered into a valid DOM.
That's what HTML Formatter does. HTML → Markdown follows the same path — parse → DOM → emit markdown.
Common pitfalls
1. The </script> literal inside <script>
Already covered — JS code with a closing tag literal ends the block early.
2. Missing attribute quotes
<a href=https://example.com>click</a>
<!-- Works -->
<a href=https://example.com/path?q=v>click</a>
<!-- ? becomes an attribute name. Broken. -->3. Forgetting & in text
<p>Tom & Jerry</p>
<!-- Sometimes parsed as the start of an entity, broken render -->
<p>Tom & Jerry</p> ← safe4. & in URLs
<a href="page.php?a=1&b=2"> ← &b starts to look like an entity
<a href="page.php?a=1&b=2"> ← correct5. innerHTML XSS
element.innerHTML = userInput;
<!-- If userInput is "<img src=x onerror=alert(1)>", XSS. -->
element.textContent = userInput; ← safe (skips HTML parsing)References
- HTML Living Standard — Parsing — WHATWG
- HTML5 tokenizer states — WHATWG
- html5lib (Python parser) — GitHub
- Why XHTML didn't replace HTML — W3C
Summary
- HTML5 parser = "never throw." Every malformed input has a deterministic recovery in the spec.
- Two phases — tokenizer (80+ state machine) → tree builder (DOM).
- Void elements (br/img/hr/input ...) cannot have children. The / in
<br/>is ignored. - Error recovery handles missing close tags, wrong nesting, and misplaced content (title inside body) automatically.
- Tag and attribute names are case-insensitive — except inside SVG / MathML.
- DOCTYPE chooses standards vs quirks mode. Always
<!DOCTYPE html>. - XHTML's strict approach failed because the web's social contract is "don't break my page on one typo." HTML5 won by embracing tolerant parsing.
- Try it: HTML Formatter / HTML Entity Encode / Decode / HTML → Markdown.