The red/green lines in a Git PR, the conflict markers in your IDE, and the +/- lines from the diff command all answer the same question: "what's different between these two texts, expressed as the fewest edits?" Getting that answer is surprisingly hard. This guide walks through the core idea (LCS / Myers), why moved lines show up as a separate - and +, the unified vs context patch formats, and how word-level intra-line diffs work.
The core problem — Longest Common Subsequence (LCS)
Given two sequences A and B, find the longest subsequence common to both. "Subsequence" preserves order but doesn't require contiguity.
A = A B C B D A B
B = B D C A B A
LCS = B C B A (or B D A B)
length = 4Once you have the LCS, the diff falls out:
- chars in the LCS → unchanged (context)
- chars only in A → deletions (
-) - chars only in B → insertions (
+)
Diff's goal of "minimum edit distance" and the goal of "maximum LCS" are the same problem.
Naive LCS — O(N × M) dynamic programming
The textbook DP. Fill an N × M table:
dp[i][j] = LCS length of A[0..i], B[0..j]
if A[i] == B[j]:
dp[i][j] = dp[i-1][j-1] + 1
else:
dp[i][j] = max(dp[i-1][j], dp[i][j-1])Time and space both O(N × M). Two 10,000-line files = 100 million cells — not practical for Linux-kernel-scale diffs.
Myers algorithm — the de-facto standard (1986)
Eugene Myers' 1986 paper. Time O(ND) — N = sum of input lengths, D = edit distance. Fewer changes = faster. Real commits usually have D ≪ N (a few dozen changed lines).
The trick — shortest path in an edit graph:
characters in B →
┌──────────────
chars │
in A │ at (i, j):
│ right = insert B[j] (+)
↓ │ down = delete A[i] (-)
│ diagonal = chars match (free)Shortest path from (0, 0) to (N, M) = minimum edit script. Diagonals are free (cost 0), so the BFS prefers them. Myers tracks only the furthest reach at each D step → O(D²) memory or better.
Git / Mercurial / GNU diffutils / VS Code all use Myers (often with heuristics on top). Text Diff uses the same family of algorithms.
Why moved lines look like a - and a +
Myers' base output doesn't model moves. There's no single op for "line 5 moved to line 50." Instead:
-line 5 (deleted)
...
+line 5 (re-added at new position)Move detection is a post-process — Git's --color-moved scans the resulting diff for matching -/+ pairs and colors them.
Unified vs context — two patch formats
Unified diff (modern standard)
--- a/file.txt
+++ b/file.txt
@@ -3,5 +3,6 @@
unchanged line 3
-removed line 4
+added line 4
+added line 5
unchanged line 6
unchanged line 7
unchanged line 8--- / +++— file paths@@ -3,5 +3,6 @@— hunk header: "from old line 3, 5 lines; from new line 3, 6 lines"-/+lines, plus 3 lines of context by default
Git diff, GitHub PRs, and emailed patches all use unified.
Context diff (1985, legacy)
*** file.txt.orig
--- file.txt.new
***************
*** 3,7 ****
unchanged line 3
! removed line 4
unchanged line 6
unchanged line 7
unchanged line 8
--- 3,8 ----
unchanged line 3
! added line 4
! added line 5
unchanged line 6
unchanged line 7
unchanged line 8Separate old/new blocks per region. Slightly more readable for humans but much longer. Rarely used today.
Applying patches
A patch file is an applicable transformation. patch or git apply use the hunk header line numbers and context lines to relocate the change. Partial context matches trigger fuzzy matching; misses go to a .rej file.
Diff Patch produces a unified patch from two texts. Apply with the system patch or git apply.
Granularity — line, word, char
The natural unit is a line. But when only a few words on a line change, a line diff shows the whole line as -/+— hard to read.
Solution — diff again inside the changed line, at word or char level:
Line diff:
- The quick brown fox jumps
+ The quick red fox runs
Intra-line word diff:
The quick [brown→red] fox [jumps→runs]That's what GitHub PR highlighting does. Text Diff supports word / char modes with intra-line highlighting.
JSON diff — structure-aware
Text diff doesn't understand JSON. {"a":1,"b":2} and {"b":2,"a":1} show as different lines but are semantically identical.
JSON diff walks the tree:
- Object key order is ignored
- Changed paths are emitted (e.g.
users[2].email) - Type changes (number → string) are flagged
JSON Diff does exactly this. Useful for comparing API responses or schema versions.
Post-processing — heuristics
Raw Myers output isn't always the most readable diff. Git layers heuristics on top:
- indent heuristic — groups lines by indent so the next function start lands at a natural boundary
- compaction — moves blank lines or repeated patterns up or down to align with semantic units
- patience diff — Bram Cohen's algorithm. Anchors on uniquely matching lines, then LCS only between anchors. Works better on code with similar function definitions
Git exposes diff.algorithm: myers (default) / patience / histogram / minimal.
Common pitfalls
1. CRLF vs LF line endings
Windows \r\n vs Unix \n — a file from the other side shows every line as changed. Configure git config core.autocrlf or .gitattributes.
2. Trailing whitespace
Tab/space mixing or trailing spaces produce false diffs. diff -w / git diff -w ignores whitespace. Text Diff has the same ignoreWhitespace option.
3. Very large files
Myers is O(ND), but if the two files are completely different D ≈ N → O(N²). Git falls back to hash-based fast diff or "binary" treatment beyond a threshold.
4. Minified code diffs
Minified JS can be one line of thousands of chars. Line diff shows the whole line changed — meaningless. Either diff the source pre-minification or use char-level diff.
5. Binary files
Diff algorithms assume text. PDF / image / DB dump byte diffs are noise. Git auto-detects (NULL bytes / magic numbers) and just reports "Binary files differ."
References
- Myers, E. (1986). "An O(ND) Difference Algorithm" — original PDF
- GNU diffutils unified format — manual
- Patience diff — Bram Cohen's post
- Git diff algorithm options — git-scm.com
Summary
- Diff's core problem is LCS — maximum LCS = minimum edit script.
- Naive DP is O(N × M) — too slow at scale. Myers O(ND), 1986, has been the standard since.
- Default unit is a line. Moves decompose into - and +; post-process heuristics rejoin them visually.
- Unified diff is the modern format. Context diff is legacy.
- Intra-line word/char diff is what makes GitHub PRs readable.
- Structured data (JSON) deserves a tree diff, not a text diff.
- Try it: Text Diff / JSON Diff / Diff Patch.