[ACCEPTED]-How do document diff algorithms work?-diff

Accepted answer
Score: 38

Well, generally speaking, diff'ing is usually 23 solved by the Longest common subsequence problem. Also see the "Algorithm" section 22 of the Wikipedia article on Diff:

The operation 21 of diff is based on solving the longest 20 common subsequence problem.

In this problem, you 19 have two sequences of items:

   a b c d f g h j q z

   a b c d e f g i j k r x y z

and you want 18 to find the longest sequence of items 17 that is present in both original sequences 16 in the same order. That is, you want to 15 find a new sequence which can be obtained 14 from the first sequence by deleting some items, and 13 from the second sequence by deleting other 12 items. You also want this sequence to 11 be as long as possible. In this case it 10 is

   a b c d f g j z

From the longest common subsequence it's 9 only a small step to get diff-like output:

   e   h i   q   k r x y 
   +   - +   -   + + + +

That 8 said, this all works fine with text based 7 documents. Since Word Documents are effectively 6 in a binary format, and include lots of 5 formatting information and data, this will 4 be far more complex. Ideally, you could 3 look into automating Word itself as it has 2 the ability to "diff" between 1 documents, as detailed here:

Microsoft Word Tip: How to compare two documents for differences

Score: 15

A diff is essentially just a solution to the longest common sub-sequence problem.

The 4 optimal solution requires knowledge of dynamic programming so 3 it's a fairly complex problem to solve.

However, it 2 can also be done by constructing a suffix-tree. Both 1 algorithms are outlined here.

Score: 3

As Ben S indicated, the differencing problem 5 can be addressed generally by solving the 4 longest common sub-sequence problem. More 3 specifically, The Hunt-McIlroy algorithm is one of the classic 2 algorithms that have been applied to the 1 problem (e.g in the implementation of Unix' diff utility).

Score: 2

The most optimal solution for the LCS problem 3 is O(ND) Myer's algorithm, and here is an algorithmic approach 2 which I used to implement to diff office 1 2007 documents. Link to algorithm paper

More Related questions