[ACCEPTED]-Java library for free-text diff-diff

Accepted answer
Score: 17

This one might be good Diff Match Patch.

0

Score: 8

Depending on your exact requirements, the 1 StringUtils class of the Apache Commons Lang component might be helpful, e.g.:

  • StringUtils#difference: Compares two Strings, and returns the portion where they differ
  • StringUtils#getLevenshteinDistance: Find the Levenshtein distance between two Strings
Score: 1

Here's a (lightly-tested) version of code 15 that does what you asked. You can easily 14 traverse the result in parallel with the 13 inputs to locate insertions and deletions.

public class StringDiff {

    private static int   length(String s) { return s == null ? 0 : s.length(); }
    private static char[] chars(String s) { return s == null ? new char[0] : s.toCharArray(); }

    private final String left;
    private final String right;

    private final char[] lccs;
    private final String lcs;

    public StringDiff(String left, String right) {
        this.left = left;
        this.right = right;
        lccs = init();
        lcs = new String(lccs);
    }

    public String getLcs()  { return lcs; }
    public char[] getLccs() { return lccs.clone(); }

    private char[] init() {
        int lLength = length(left);
        int rLength = length(right);
        char[] lChars = chars(left);
        char[] rChars = chars(right);
        int [][] t = new int [lLength + 1][rLength + 1];
        for (int i = lLength - 1; i >= 0; --i) {
            for (int j = rLength - 1; j >= 0; --j) {
                if (lChars[i] == rChars[j]) {
                    t[i][j] = t[i + 1][j + 1] + 1;
                } else {
                    t[i][j] = Math.max(t[i + 1][j], t[i][j + 1]);
                }
            }
        }
        char[] result = new char[t[0][0]];
        int l = 0, r = 0, p = 0;
        while (l < lLength && r < rLength) {
            if (lChars[l] == rChars[r]) {
                result[p++] = lChars[l++];
                r++;
            } else {
                if (t[l + 1][r] > t[l][r + 1]) {
                    ++l;
                } else {
                    ++r;
                }
            }
        }
        return result;
    }

}

According 12 to it, the actual longest subsequence of 11 your original inputs:

The quick brown  fox jumped over the  lazy     dog.
The quick yellow fox jumped over the well-bred dog.

is:

The quick ow fox jumped over the l dog.

(because "brown" and 10 "yellow" have "ow" in common, etc.)

It's 9 relatively straightforward to modify the 8 above to split on whitespace (instead of 7 into char arrays) and substitute String#equals 6 for == to get a version that finds the longest 5 common subsequence of words instead of characters. For 4 your example above that change would produce 3 the obvious result:

found 7 words
    'The'
    'quick'
    'fox'
    'jumped'
    'over'
    'the'
    'dog.'

(Your question implied 2 character comparisons, as you matched the 1 spaces between words.)

Score: 0

If you're example is really what you want 5 to do - ie subsequences only match if they 4 start at the same index (which is different 3 from how diffs normally operate) - this 2 is all you need to do:

import java.util.*;

class StringDiff {
    public static List<int[]> from(String s1, String s2) {
        int start = -1;
        int pos = 0;
        LinkedList<int[]> list = new LinkedList<int[]>();

        for(; pos < s1.length() && pos < s2.length(); ++pos) {
            if(s1.charAt(pos) == s2.charAt(pos)) {
                if(start < 0) start = pos;
            }
            else {
                if(start >= 0) list.add(new int[] { start, pos });
                start = -1;
            }
        }

        if(start >= 0) list.add(new int[] { start, pos });

        return list;
    }

    public static void main(String[] args) {
        for(int[] idx : from(args[0], args[1]))
            System.out.println(args[0].substring(idx[0], idx[1]));
    }
}

An actual diff implementation 1 will be far more sophisticated.

More Related questions