How Diff Algorithms Find Changes

The red/green lines in a Git PR, the conflict markers in your IDE, and the +/- lines from the diff command all answer the same question: "what's different between these two texts, expressed as the fewest edits?" Getting that answer is surprisingly hard. This guide walks through the core idea (LCS / Myers), why moved lines show up as a separate - and +, the unified vs context patch formats, and how word-level intra-line diffs work.

The core problem — Longest Common Subsequence (LCS)

Given two sequences A and B, find the longest subsequence common to both. "Subsequence" preserves order but doesn't require contiguity.

A = A B C B D A B
B = B D C A B A

LCS = B C B A    (or B D A B)
length = 4

Once you have the LCS, the diff falls out:

chars in the LCS → unchanged (context)
chars only in A → deletions (-)
chars only in B → insertions (+)

Diff's goal of "minimum edit distance" and the goal of "maximum LCS" are the same problem.

Naive LCS — O(N × M) dynamic programming

The textbook DP. Fill an N × M table:

dp[i][j] = LCS length of A[0..i], B[0..j]

if A[i] == B[j]:
  dp[i][j] = dp[i-1][j-1] + 1
else:
  dp[i][j] = max(dp[i-1][j], dp[i][j-1])

Time and space both O(N × M). Two 10,000-line files = 100 million cells — not practical for Linux-kernel-scale diffs.

Myers algorithm — the de-facto standard (1986)

Eugene Myers' 1986 paper. Time O(ND) — N = sum of input lengths, D = edit distance. Fewer changes = faster. Real commits usually have D ≪ N (a few dozen changed lines).

The trick — shortest path in an edit graph:

         characters in B →
        ┌──────────────
chars  │
in A   │   at (i, j):
       │     right    = insert B[j]   (+)
↓      │     down     = delete A[i]   (-)
       │     diagonal = chars match   (free)

Shortest path from (0, 0) to (N, M) = minimum edit script. Diagonals are free (cost 0), so the BFS prefers them. Myers tracks only the furthest reach at each D step → O(D²) memory or better.

Git / Mercurial / GNU diffutils / VS Code all use Myers (often with heuristics on top). Text Diff uses the same family of algorithms.

Why moved lines look like a - and a +

Myers' base output doesn't model moves. There's no single op for "line 5 moved to line 50." Instead:

-line 5 (deleted)
...
+line 5 (re-added at new position)

Move detection is a post-process — Git's --color-moved scans the resulting diff for matching -/+ pairs and colors them.

Unified vs context — two patch formats

Unified diff (modern standard)

--- a/file.txt
+++ b/file.txt
@@ -3,5 +3,6 @@
 unchanged line 3
-removed line 4
+added line 4
+added line 5
 unchanged line 6
 unchanged line 7
 unchanged line 8

--- / +++ — file paths
@@ -3,5 +3,6 @@ — hunk header: "from old line 3, 5 lines; from new line 3, 6 lines"
- / + lines, plus 3 lines of context by default

Git diff, GitHub PRs, and emailed patches all use unified.

Context diff (1985, legacy)

*** file.txt.orig
--- file.txt.new
***************
*** 3,7 ****
  unchanged line 3
! removed line 4
  unchanged line 6
  unchanged line 7
  unchanged line 8
--- 3,8 ----
  unchanged line 3
! added line 4
! added line 5
  unchanged line 6
  unchanged line 7
  unchanged line 8

Separate old/new blocks per region. Slightly more readable for humans but much longer. Rarely used today.

Applying patches

A patch file is an applicable transformation. patch or git apply use the hunk header line numbers and context lines to relocate the change. Partial context matches trigger fuzzy matching; misses go to a .rej file.

Diff Patch produces a unified patch from two texts. Apply with the system patch or git apply.

Granularity — line, word, char

The natural unit is a line. But when only a few words on a line change, a line diff shows the whole line as -/+— hard to read.

Solution — diff again inside the changed line, at word or char level:

Line diff:
- The quick brown fox jumps
+ The quick red fox runs

Intra-line word diff:
The quick [brown→red] fox [jumps→runs]

That's what GitHub PR highlighting does. Text Diff supports word / char modes with intra-line highlighting.

JSON diff — structure-aware

Text diff doesn't understand JSON. {"a":1,"b":2} and {"b":2,"a":1} show as different lines but are semantically identical.

JSON diff walks the tree:

Object key order is ignored
Changed paths are emitted (e.g. users[2].email)
Type changes (number → string) are flagged

JSON Diff does exactly this. Useful for comparing API responses or schema versions.

Post-processing — heuristics

Raw Myers output isn't always the most readable diff. Git layers heuristics on top:

indent heuristic — groups lines by indent so the next function start lands at a natural boundary
compaction — moves blank lines or repeated patterns up or down to align with semantic units
patience diff — Bram Cohen's algorithm. Anchors on uniquely matching lines, then LCS only between anchors. Works better on code with similar function definitions

Git exposes diff.algorithm: myers (default) / patience / histogram / minimal.

Common pitfalls

1. CRLF vs LF line endings

Windows \r\n vs Unix \n — a file from the other side shows every line as changed. Configure git config core.autocrlf or .gitattributes.

2. Trailing whitespace

Tab/space mixing or trailing spaces produce false diffs. diff -w / git diff -w ignores whitespace. Text Diff has the same ignoreWhitespace option.

3. Very large files

Myers is O(ND), but if the two files are completely different D ≈ N → O(N²). Git falls back to hash-based fast diff or "binary" treatment beyond a threshold.

4. Minified code diffs

Minified JS can be one line of thousands of chars. Line diff shows the whole line changed — meaningless. Either diff the source pre-minification or use char-level diff.

5. Binary files

Diff algorithms assume text. PDF / image / DB dump byte diffs are noise. Git auto-detects (NULL bytes / magic numbers) and just reports "Binary files differ."

References

Myers, E. (1986). "An O(ND) Difference Algorithm" — original PDF
GNU diffutils unified format — manual
Patience diff — Bram Cohen's post
Git diff algorithm options — git-scm.com

Summary

Diff's core problem is LCS — maximum LCS = minimum edit script.
Naive DP is O(N × M) — too slow at scale. Myers O(ND), 1986, has been the standard since.
Default unit is a line. Moves decompose into - and +; post-process heuristics rejoin them visually.
Unified diff is the modern format. Context diff is legacy.
Intra-line word/char diff is what makes GitHub PRs readable.
Structured data (JSON) deserves a tree diff, not a text diff.
Try it: Text Diff / JSON Diff / Diff Patch.