Every developer reads diffs every day, and almost every developer at some point looks at one and thinks: "that is not what changed." The cursor jumped to the wrong block, two unrelated edits got smushed into one hunk, or a moved function appears as a deletion plus a not-quite-identical insertion. The reason is rarely a bug — it is the diff algorithm doing exactly what it was designed to do, on a file where its assumptions break down.
"Diff" is not one algorithm but a small family. Myers (1986) is the default in Unix diff and
git. Patience (Cohen, around 2009) is available as git diff --patience.
Histogram (JGit) is available as git diff --histogram and is the default for several
internal git operations. Each one will produce different output for the same pair of files. Understanding
which is which makes "the diff looks wrong" a five-second fix instead of a half-hour debugging
session.
The Underlying Problem: LCS
The shared core of every diff algorithm is the longest common subsequence (LCS) problem: given two sequences A and B, find the longest sequence that appears in both, in order, but not necessarily contiguously. Once you have the LCS, the edit script falls out: every line that is in the LCS is unchanged; every line in A that is not in the LCS is a deletion; every line in B that is not in the LCS is an insertion.
For an arbitrary number of sequences, finding the LCS is NP-hard. For exactly two sequences, dynamic programming solves it in O(n*m) time and space, where n and m are the lengths. The recurrence is:
L[i][j] = 0 if i == 0 or j == 0
L[i][j] = L[i-1][j-1] + 1 if A[i] == B[j]
L[i][j] = max(L[i-1][j], L[i][j-1]) otherwise Filling that table for two 10,000-line files is 100 million operations and 100 million cells of memory. Doable, but slow enough that no real-world diff tool implements pure LCS. Worse, the LCS-minimizing edit script is not always the most human-readable one — two different edits with the same length can look very different, and "shortest" is rarely the criterion a reviewer cares about.
Myers Algorithm (1986)
Eugene Myers' 1986 paper "An O(ND) Difference Algorithm and Its Variations" gave the diff community its current default. The algorithm runs in O((n+m)*D) time, where D is the minimum number of inserts and deletes needed to transform A into B. For similar files D is tiny — a typical patch has dozens of edits across thousands of lines — and the algorithm is very fast in practice.
Conceptually, Myers walks an edit graph: nodes are (i, j) positions in A and B,
and edges either move diagonally (matching a line, cost 0) or right/down (insertion or
deletion, cost 1). Finding the shortest path from (0, 0) to (n, m) gives the minimal edit script. The cleverness is in how it expands the search frontier in diagonals,
which avoids ever materializing the full O(n*m) table.
Myers is greedy and minimal-by-edit-count. That is both its strength and its weakness. Strength: small edits stay small. Weakness: when a file contains many trivially matching lines — blank lines, lines with only a closing brace, lines with only a comma — Myers will happily align them in whatever way produces the fewest edits, even if that means stitching together two unrelated changes into one weird-looking block.
Patience Diff (Bram Cohen, ~2009)
Bram Cohen, the inventor of BitTorrent, designed patience diff explicitly to fix the readability problem above. The insight is that not all matching lines are equally useful as anchors. A blank line might match in twenty places; a function signature usually matches in exactly one. Anchor on the unambiguous matches first.
The algorithm has four steps:
- Find all lines that appear exactly once in each of A and B. These are the candidate anchors.
- Find the longest common subsequence of those candidates — the largest set of anchor lines that appear in the same relative order on both sides.
- The chosen anchors split each file into segments. Lines between anchors on A correspond to lines between the same anchors on B.
- Recurse into each segment, applying patience again, or falling back to Myers if no unique lines exist inside a segment.
The result is hunks that almost always line up with semantic units: function boundaries,
named declarations, distinct comment blocks. The cost is that patience diff is sometimes
slightly longer in raw edit count than Myers — but for code review purposes the readability
win is enormous. Enable it with git diff --patience or set diff.algorithm = patience in ~/.gitconfig.
Histogram Diff (JGit)
Histogram diff was developed by the JGit team for the Java-based git implementation. It is a refinement of patience: instead of restricting anchors to unique-on-both-sides lines, it builds a frequency table (the histogram) of every line in A and prefers the rarest matches first.
This handles two cases patience does not. First, when a line appears twice in one file but only once in the other, histogram can still use it as a low-priority anchor while patience ignores it entirely. Second, in files with no unique lines at all — generated code, boilerplate-heavy configs, JSON dumps — patience degrades to Myers immediately, whereas histogram still has a meaningful priority ordering.
In practice, git diff --histogram produces output that is at least as readable
as patience for code and noticeably better for repetitive files. It is the default for
several internal git operations (notably git merge's recursive resolution) and
many teams set it as the global default:
git config --global diff.algorithm histogram Minimal vs Readable: A Worked Example
Consider a tiny file change. The original file has six lines:
def greet(name):
print("Hello, " + name)
def farewell(name):
print("Goodbye, " + name)
After editing, the new file is:
def farewell(name):
print("Goodbye, " + name)
def greet(name):
print("Hi there, " + name)
Two functions swapped order, and one line of one function was edited. Myers — minimizing total edits — will often produce:
-def greet(name):
- print("Hello, " + name)
-
def farewell(name):
print("Goodbye, " + name)
+
+def greet(name):
+ print("Hi there, " + name) Three deletions and three insertions, six edits total. Minimal. But the change "swap two functions and edit one print" is not visible at all. The blank line jumped around for no reason.
Patience anchors on unique lines like def greet(name): and produces:
+def farewell(name):
+ print("Goodbye, " + name)
+
def greet(name):
- print("Hello, " + name)
-
-def farewell(name):
- print("Goodbye, " + name)
+ print("Hi there, " + name) The semantic structure — "we moved farewell up and edited the greet body" — is recognizable. Same files, same git, different algorithm.
Reading a Unified Diff: Hunk Headers
Once a diff is computed, it is rendered in unified format by default. The most cryptic part is the hunk header:
@@ -10,7 +10,8 @@ def greet(name): Decoded: starting at line 10 in the old file, 7 lines are shown; starting at line 10 in the
new file, 8 lines are shown. The trailing text after the second @@ is the
"function context" — git heuristically picks a nearby header line that gives reviewers a
landmark. The default 3 lines of context above and below changed lines can be tuned with -U0 (no context) or -U10 (more).
Side-by-side format is the same data laid out in two columns:
old.py | new.py
10 def greet(name): | 10 def greet(name):
11 print("Hello, "+x) | 11 print("Hi, "+x)
12 | 12 The Text Diff Viewer defaults to side-by-side for readability, and lets you toggle to unified when copying patches into a commit or PR description.
Word-Level and Character-Level Diffs
Line-based diff is wrong for prose. A single typo in the middle of a paragraph rewrites the whole paragraph as one big deletion and insertion. Word-level diff splits each line into tokens (whitespace-separated) and runs the same LCS-style algorithm over those tokens. Character-level diff goes further and runs it over individual characters.
The classical references are the Hunt-McIlroy 1976 paper (the original diff implementation
in Unix v7) and the Wu, Manber, Myers 1990 refinement. git diff --word-diff is the practical handle in git, with --word-diff-regex letting you customize what
counts as a word — useful for languages where token boundaries are not whitespace.
Word-level diff is what review tools usually use to highlight which words changed inside an otherwise unchanged-looking line.
Whitespace Modes
Three flags matter when whitespace is part of the noise:
-w/--ignore-all-space: ignore all whitespace differences. Useful when comparing reformatted code.-b/--ignore-space-change: ignore changes in whitespace amount but treat at-least-one-space as different from zero-spaces.--ignore-blank-lines: ignore lines that are entirely whitespace.
These are display options applied after the diff algorithm runs, not modifications to the
algorithm itself. A reformat that changes only whitespace will still cause every line to
differ at the byte level; -w just tells the renderer not to show those lines.
Whitespace differences are also a common deployment-breaking class of bug — whitespace bugs that break production deploys covers the cases where invisible characters silently change strings and configuration values.
Rename and Binary Detection
Two operations that look like they belong to diff but are actually separate steps:
Rename detection. The diff algorithm only sees two byte streams. When git
shows "renamed: a.py -> b.py with 12% changes," it ran the diff and then a separate pass
that compares deleted-vs-added files by similarity score. Enable with git diff -M (default threshold 50%), tune with -M70%, or set diff.renames = true globally. Copy detection (-C) is the same idea
for files copied without a delete.
Binary diffs. Plain LCS-based diff is useless on binary files — there are no meaningful lines. Tools like rsync and xdelta use rolling hashes (rabin-karp style) to find chunks that match between two binary files, producing a patch as a sequence of "copy bytes 1000-2000 from old" plus literal bytes. Git stores binary files as full snapshots with delta compression in the packfile rather than producing human-readable diffs.
Practical Recommendations
- Set
diff.algorithm = histogramin your~/.gitconfig. Most teams that have tried it never go back to Myers. - Enable rename detection:
diff.renames = true. A moved file with small edits is much more readable as a rename plus 3-line patch than as a full delete plus full add. - For reviewing reformats or whitespace-only changes, use
-wto focus on real edits. - For prose, use
--word-diff. For code, line-level with histogram is almost always right. - When two changes look stitched together in a hunk, do not blame the patch author —
re-run with
git diff --patienceand the picture usually clears up.
For comparing arbitrary text outside git — pasted JSON snippets, config exports, log excerpts — the Text Diff Viewer runs a line-level diff in the browser with side-by-side and unified views, and highlights intra-line word changes so small edits inside long lines are visible at a glance.