How Diff Algorithms Work: The Engine Behind Every Code Review
You use git diff every day, but how does it actually decide
which lines were added, removed, or moved? Here's what's happening under the hood.
Every code review starts with a diff. Green lines added, red lines removed. It looks simple, but the algorithm deciding which lines to mark as changed and how to group them is solving a genuinely hard computer science problem.
Understanding how diff works helps you write better commits, review code faster, and troubleshoot confusing merge conflicts.
1. The Core Problem: Longest Common Subsequence
At its heart, a diff algorithm is finding the Longest Common Subsequence (LCS) between two sequences (files). The LCS is the longest set of lines that appear in both files in the same order, without needing to be contiguous.
Consider two versions of a file:
Version A: Version B:
1. import foo 1. import foo
2. import bar 2. import baz
3. 3. import bar
4. function x() 4.
5. return 1 5. function x()
6. } 6. return 2
7. }
The LCS here is: import foo, import bar, function x(),
}. Everything not in the LCS is either an insertion or a deletion. The diff
algorithm's job is to find this LCS as efficiently as possible.
2. The Myers Algorithm: Git's Default
Eugene Myers published his diff algorithm in 1986, and it's still the default in Git. It works by exploring an "edit graph" where moving right represents deleting a line from file A, moving down represents inserting a line from file B, and moving diagonally represents a matching line (no change).
The algorithm finds the shortest edit script: the minimum number of insertions and deletions needed to transform file A into file B. This is what makes Git diffs feel "minimal"—they show the fewest changes possible.
Why "Minimal" Matters
There are often multiple valid diffs for the same change. Imagine you add a blank line between two functions. Should the diff show the blank line added after the first function's closing brace, or before the second function's opening line? Myers picks the shortest path, which usually produces the most intuitive result.
3. Patience Diff: Better for Code
Myers is optimal for minimizing edit distance, but it can produce confusing results when large blocks of code are moved or when files have many similar lines (like closing braces in C-style languages).
Patience diff takes a different approach:
- Find all lines that appear exactly once in both files. These are "anchor" lines—they're almost certainly the same logical line.
- Match these unique lines first to establish a skeleton.
- Recursively diff the gaps between anchors using the standard LCS approach.
The result is diffs that align on semantically meaningful lines (function signatures, class definitions) rather than on braces and blank lines. You can enable it in Git:
git diff --diff-algorithm=patience
4. Unified vs. Side-by-Side: Choosing a View
The algorithm produces a list of changes. How you display them is a separate design decision.
Unified Diff
The classic format you see in git diff output and GitHub PRs. Both files are
interleaved in a single column, with - for removed lines and + for
added lines.
import foo
-import bar
+import baz
+import bar
function x()
- return 1
+ return 2
}
Best for: Small, focused changes. Quick scanning. Terminal output. Patch files.
Side-by-Side Diff
The old file and new file are displayed in parallel columns. Changed lines are aligned horizontally so you can compare them directly.
Best for: Large refactors. Comparing config files. Reviewing changes where context matters (seeing what a line was alongside what it became).
Inline Highlighting
The most useful enhancement to either view is word-level or character-level highlighting within changed lines. Instead of marking the entire line as changed, the diff highlights only the specific characters that differ. This is invaluable when a line has a small typo fix buried in a long string.
5. Writing Diff-Friendly Code
Understanding how diffs work helps you write code that produces cleaner code reviews:
- Trailing commas: In arrays and objects, trailing commas mean adding an item only shows one added line, not a modified line plus an added line.
- One import per line: Sorting imports alphabetically and putting each on its own line produces minimal diffs when adding or removing dependencies.
- Separate refactoring from logic changes: A commit that renames a variable across 50 files makes it impossible to spot the one-line bug fix buried inside it. Split them into separate commits.
- Avoid reformatting in the same commit: Running a code formatter on an entire file in the same commit as a bug fix creates noise. Format first, then fix.
6. Beyond Text: Structured Diffing
Traditional diff algorithms operate on lines of text. They don't understand that your file is JSON, or HTML, or a programming language with syntax. This leads to diffs that are technically correct but semantically confusing.
Structured diff tools parse the file into an AST (Abstract Syntax Tree) and compare the trees instead of the text. This means:
- Reordering JSON keys doesn't show as a deletion + insertion.
- Moving a function to a different location shows as a "move," not a delete + add.
- Changing indentation style doesn't generate any diff at all.
While these tools are still maturing, they represent the future of code review.
Conclusion
Diff algorithms are a fascinating intersection of theoretical computer science and practical developer tooling. Myers gives you minimal diffs, Patience gives you semantic diffs, and the display format you choose affects how quickly reviewers can understand your changes.
Need to compare two files or API responses? Use our Diff Checker to see changes highlighted side-by-side or in unified view, with character-level diffing.