Why does git diff sometimes show two unrelated changes as one block?

The default Myers algorithm is greedy and aligns whatever matches first. When trivial lines like blank lines, closing braces, or single-character tokens appear many times in both files, Myers will happily match them across function boundaries and present what looks like one merged change. The fix is to switch to a patience or histogram diff (git diff --patience or git diff --histogram), which only anchor on unique lines and therefore keep semantically separate edits separate.

What is the difference between patience and histogram diff?

Patience diff (Bram Cohen, around 2009) anchors recursion only on lines that appear exactly once in each side, ignoring all other matches. Histogram diff (JGit team) is an extension that builds a frequency table of every line and prefers the rarest matches first, falling back gracefully when no unique lines exist. Histogram usually produces output that is at least as readable as patience and handles files full of repeated patterns more robustly, which is why git switched its default for many operations to --histogram.

When should I use `git diff --patience`?

Use patience or histogram when reviewing refactors that move code across the file, when comparing files with many repeated tokens (closing braces, blank lines, boilerplate imports), or when the default Myers output stitches together unrelated chunks. For day-to-day small edits the difference is invisible. You can set diff.algorithm = histogram in your git config to make it the default for all repos.

Does diff detect file renames?

The diff algorithm itself only compares two byte streams; it has no notion of files moving on disk. Rename and copy detection is a separate post-processing step in git that compares file similarity scores after the diff is generated. Enable it explicitly with git diff -M (rename detection) or -C (copy detection), or configure diff.renames = true in your git config to make it permanent. The similarity threshold is tunable (-M50% for example) to control how aggressively pairs of files are treated as a single rename.

How Diff Algorithms Work: Myers, Patience, Histogram — Why git diff Sometimes Looks Wrong

Q: Why is the diff algorithm not just LCS?

For two sequences, the longest common subsequence can be computed exactly with dynamic programming in O(n*m) time and space, where n and m are the file lengths. For large files this is too slow and uses too much memory, and the LCS-minimizing edit script is not always the most human-readable diff. Myers algorithm runs in O(N*D) where D is the edit distance — typically far smaller than n*m for similar files — and patience and histogram trade strict minimality for semantically clearer hunks.

Every developer reads diffs every day, and almost every developer at some point looks at one and thinks: "that is not what changed." The cursor jumped to the wrong block, two unrelated edits got smushed into one hunk, or a moved function appears as a deletion plus a not-quite-identical insertion. The reason is rarely a bug — it is the diff algorithm doing exactly what it was designed to do, on a file where its assumptions break down.

"Diff" is not one algorithm but a small family. Myers (1986) is the default in Unix diff and git. Patience (Cohen, around 2009) is available as git diff --patience. Histogram (JGit) is available as git diff --histogram and is the default for several internal git operations. Each one will produce different output for the same pair of files. Understanding which is which makes "the diff looks wrong" a five-second fix instead of a half-hour debugging session.

Side-by-side comparison of Myers diff (6 edits, blank lines jumping, semantic change invisible) versus Patience diff (same number of edits, semantic structure visible: farewell moved up, greet edited) — The same change — two functions swapped plus one edited line — as seen by Myers (left) and Patience/Histogram (right).

The Underlying Problem: LCS

The shared core of every diff algorithm is the longest common subsequence (LCS) problem: given two sequences A and B, find the longest sequence that appears in both, in order, but not necessarily contiguously. Once you have the LCS, the edit script falls out: every line that is in the LCS is unchanged; every line in A that is not in the LCS is a deletion; every line in B that is not in the LCS is an insertion.

For an arbitrary number of sequences, finding the LCS is NP-hard. For exactly two sequences, dynamic programming solves it in O(n*m) time and space, where n and m are the lengths. The recurrence is:

L[i][j] = 0                          if i == 0 or j == 0
L[i][j] = L[i-1][j-1] + 1            if A[i] == B[j]
L[i][j] = max(L[i-1][j], L[i][j-1]) otherwise

Filling that table for two 10,000-line files is 100 million operations and 100 million cells of memory. Doable, but slow enough that no real-world diff tool implements pure LCS. Worse, the LCS-minimizing edit script is not always the most human-readable one — two different edits with the same length can look very different, and "shortest" is rarely the criterion a reviewer cares about.

Myers Algorithm (1986)

Eugene Myers' 1986 paper "An O(ND) Difference Algorithm and Its Variations" gave the diff community its current default. The algorithm runs in O((n+m)*D) time, where D is the minimum number of inserts and deletes needed to transform A into B. For similar files D is tiny — a typical patch has dozens of edits across thousands of lines — and the algorithm is very fast in practice.

Conceptually, Myers walks an edit graph: nodes are (i, j) positions in A and B, and edges either move diagonally (matching a line, cost 0) or right/down (insertion or deletion, cost 1). Finding the shortest path from (0, 0) to (n, m) gives the minimal edit script. The cleverness is in how it expands the search frontier in diagonals, which avoids ever materializing the full O(n*m) table.

Myers is greedy and minimal-by-edit-count. That is both its strength and its weakness. Strength: small edits stay small. Weakness: when a file contains many trivially matching lines — blank lines, lines with only a closing brace, lines with only a comma — Myers will happily align them in whatever way produces the fewest edits, even if that means stitching together two unrelated changes into one weird-looking block.

Patience Diff (Bram Cohen, ~2009)

Bram Cohen, the inventor of BitTorrent, designed patience diff explicitly to fix the readability problem above. The insight is that not all matching lines are equally useful as anchors. A blank line might match in twenty places; a function signature usually matches in exactly one. Anchor on the unambiguous matches first.

The algorithm has four steps:

Find all lines that appear exactly once in each of A and B. These are the candidate anchors.
Find the longest common subsequence of those candidates — the largest set of anchor lines that appear in the same relative order on both sides.
The chosen anchors split each file into segments. Lines between anchors on A correspond to lines between the same anchors on B.
Recurse into each segment, applying patience again, or falling back to Myers if no unique lines exist inside a segment.

The result is hunks that almost always line up with semantic units: function boundaries, named declarations, distinct comment blocks. The cost is that patience diff is sometimes slightly longer in raw edit count than Myers — but for code review purposes the readability win is enormous. Enable it with git diff --patience or set diff.algorithm = patience in ~/.gitconfig.

Histogram Diff (JGit)

Histogram diff was developed by the JGit team for the Java-based git implementation. It is a refinement of patience: instead of restricting anchors to unique-on-both-sides lines, it builds a frequency table (the histogram) of every line in A and prefers the rarest matches first.

This handles two cases patience does not. First, when a line appears twice in one file but only once in the other, histogram can still use it as a low-priority anchor while patience ignores it entirely. Second, in files with no unique lines at all — generated code, boilerplate-heavy configs, JSON dumps — patience degrades to Myers immediately, whereas histogram still has a meaningful priority ordering.

In practice, git diff --histogram produces output that is at least as readable as patience for code and noticeably better for repetitive files. It is the default for several internal git operations (notably git merge's recursive resolution) and many teams set it as the global default:

git config --global diff.algorithm histogram

Minimal vs Readable: A Worked Example

Consider a tiny file change. The original file has six lines:

def greet(name):
    print("Hello, " + name)

def farewell(name):
    print("Goodbye, " + name)

After editing, the new file is:

def farewell(name):
    print("Goodbye, " + name)

def greet(name):
    print("Hi there, " + name)

Two functions swapped order, and one line of one function was edited. Myers — minimizing total edits — will often produce:

-def greet(name):
-    print("Hello, " + name)
-
 def farewell(name):
     print("Goodbye, " + name)
+
+def greet(name):
+    print("Hi there, " + name)

Three deletions and three insertions, six edits total. Minimal. But the change "swap two functions and edit one print" is not visible at all. The blank line jumped around for no reason.

Patience anchors on unique lines like def greet(name): and produces:

+def farewell(name):
+    print("Goodbye, " + name)
+
 def greet(name):
-    print("Hello, " + name)
-
-def farewell(name):
-    print("Goodbye, " + name)
+    print("Hi there, " + name)

The semantic structure — "we moved farewell up and edited the greet body" — is recognizable. Same files, same git, different algorithm.

Reading a Unified Diff: Hunk Headers

Once a diff is computed, it is rendered in unified format by default. The most cryptic part is the hunk header:

@@ -10,7 +10,8 @@ def greet(name):

Decoded: starting at line 10 in the old file, 7 lines are shown; starting at line 10 in the new file, 8 lines are shown. The trailing text after the second @@ is the "function context" — git heuristically picks a nearby header line that gives reviewers a landmark. The default 3 lines of context above and below changed lines can be tuned with -U0 (no context) or -U10 (more).

Side-by-side format is the same data laid out in two columns:

old.py                       | new.py
10  def greet(name):         | 10  def greet(name):
11      print("Hello, "+x)   | 11      print("Hi, "+x)
12                           | 12

The Text Diff Viewer defaults to side-by-side for readability, and lets you toggle to unified when copying patches into a commit or PR description.

Word-Level and Character-Level Diffs

Line-based diff is wrong for prose. A single typo in the middle of a paragraph rewrites the whole paragraph as one big deletion and insertion. Word-level diff splits each line into tokens (whitespace-separated) and runs the same LCS-style algorithm over those tokens. Character-level diff goes further and runs it over individual characters.

The classical references are the Hunt-McIlroy 1976 paper (the original diff implementation in Unix v7) and the Wu, Manber, Myers 1990 refinement. git diff --word-diff is the practical handle in git, with --word-diff-regex letting you customize what counts as a word — useful for languages where token boundaries are not whitespace.

Word-level diff is what review tools usually use to highlight which words changed inside an otherwise unchanged-looking line.

Whitespace Modes

Three flags matter when whitespace is part of the noise:

-w / --ignore-all-space: ignore all whitespace differences. Useful when comparing reformatted code.
-b / --ignore-space-change: ignore changes in whitespace amount but treat at-least-one-space as different from zero-spaces.
--ignore-blank-lines: ignore lines that are entirely whitespace.

These are display options applied after the diff algorithm runs, not modifications to the algorithm itself. A reformat that changes only whitespace will still cause every line to differ at the byte level; -w just tells the renderer not to show those lines. The same noise appears when a project does a mass rename — flipping identifiers between camelCase vs snake_case rewrites almost every line and buries the real logic change under a wall of red and green. Whitespace differences are also a common deployment-breaking class of bug — whitespace bugs that break production deploys covers the cases where invisible characters silently change strings and configuration values.

Rename and Binary Detection

Two operations that look like they belong to diff but are actually separate steps:

Rename detection. The diff algorithm only sees two byte streams. When git shows "renamed: a.py -> b.py with 12% changes," it ran the diff and then a separate pass that compares deleted-vs-added files by similarity score. Enable with git diff -M (default threshold 50%), tune with -M70%, or set diff.renames = true globally. Copy detection (-C) is the same idea for files copied without a delete.

Binary diffs. Plain LCS-based diff is useless on binary files — there are no meaningful lines. Tools like rsync and xdelta use rolling hashes (rabin-karp style) to find chunks that match between two binary files, producing a patch as a sequence of "copy bytes 1000-2000 from old" plus literal bytes. Git stores binary files as full snapshots with delta compression in the packfile rather than producing human-readable diffs.

Practical Recommendations

Set diff.algorithm = histogram in your ~/.gitconfig. Most teams that have tried it never go back to Myers.
Enable rename detection: diff.renames = true. A moved file with small edits is much more readable as a rename plus 3-line patch than as a full delete plus full add.
For reviewing reformats or whitespace-only changes, use -w to focus on real edits.
For prose, use --word-diff. For code, line-level with histogram is almost always right.
When two changes look stitched together in a hunk, do not blame the patch author — re-run with git diff --patience and the picture usually clears up.

For comparing arbitrary text outside git — pasted JSON snippets, config exports, log excerpts — the Text Diff Viewer runs a line-level diff in the browser with side-by-side and unified views, and highlights intra-line word changes so small edits inside long lines are visible at a glance.