Sign in to follow this  
soconne

writing my own text diff code

Recommended Posts

I'm in the process of writing my own diff tool for comparing text files side by side. I've managed to implement the algorithm described in this paper. http://www.xmailserver.org/diff2.pdf It basically works by using Dijakstra's shortest path algorithm to trace a path through both strings while minimizing the number of inserts and deletes. I currently have it working just fine for single lines of text, but I cannot seem to figure out how to efficiently apply it to lines of text from a file. Specifically, figuring out which lines of text in a file match the lines in the second file and vice versa. I'm sure my current method for comparing a single string would NOT be optimal for an entire file. Does anybody have any suggestions?

Share this post


Link to post
Share on other sites
If you can do it on a single line, you should be able to treat two files as sets of individual lines and match them up, right?

-- Keep an iterator i of the line you are currently looking at...
-- If A[i].matches(B[i]) then keep incrementing i
-- If you get a mismatch on line i, increase the index on one of the files (call it B[j]) until A[i].matches(B[j])... everything between B[i-1] and B[j] would be inserted.
-- Etc...

Share this post


Link to post
Share on other sites
Quote:
Original post by smitty1276
If you can do it on a single line, you should be able to treat two files as sets of individual lines and match them up, right?

-- Keep an iterator i of the line you are currently looking at...
-- If A[i].matches(B[i]) then keep incrementing i
-- If you get a mismatch on line i, increase the index on one of the files (call it B[j]) until A[i].matches(B[j])... everything between B[i-1] and B[j] would be inserted.
-- Etc...


What happens if, in file B on the second line, the user added 1 word. This word added enough length to wrap one word over onto the next line. Now this could potentially cause all subsequent lines to not match.

IDK, just something to think about in your algorithm.

Share this post


Link to post
Share on other sites
Well this 'wrapping' you're referring to does not occur when you read in the actual file contents. It only seems to go to the next line when using notepad. Unless you actually go to the next line by pressing 'Enter', there will not be a '\n' written to the file.

Share this post


Link to post
Share on other sites
I don't know why I didn't think of this before. But diffing text lines the way that was stated above will not work.

Support you had the following files:

File A (Repository version)
---------------------------
private int a;


File B (Working version)
---------------------------
private int b;
public int a;
private float a;



Using the above method does not work. What I need is a FAST and EFFICIENT way to decide which line in File B correctly corresponds to the line in File A. I could simply perform the Dijakstra search on the line in File A with each line in File B and whichever one results in the least number of inserts or deletes is the closest match. But this will not be very efficient at all if I have to do that for EVERY single line in File A and compare it with every single line in File B.

The obvious optimization would be to flag lines that have already been matched. But then what do you do if a subsequent line in File A better matches a line in File B that has already been match?

So suppose:

File A
-----------------
this is some text
this is any texts


File B
------------------
this is any text


So when going through and diffing these files, I might say that line 0 in File A and line 0 in File B match. But then when looking at the 2nd line in File A, I also see that it closely matches line 0 in File B BUT I have already flagged line 0 in File B as matched.

So taking this into account, does anybody have any other ideas?

Share this post


Link to post
Share on other sites
Quote:
Original post by DeafManNoEars
Quote:
Original post by smitty1276
If you can do it on a single line, you should be able to treat two files as sets of individual lines and match them up, right?

-- Keep an iterator i of the line you are currently looking at...
-- If A[i].matches(B[i]) then keep incrementing i
-- If you get a mismatch on line i, increase the index on one of the files (call it B[j]) until A[i].matches(B[j])... everything between B[i-1] and B[j] would be inserted.
-- Etc...


What happens if, in file B on the second line, the user added 1 word. This word added enough length to wrap one word over onto the next line. Now this could potentially cause all subsequent lines to not match.

IDK, just something to think about in your algorithm.
It isn't a new line until a newline character is encountered, regardless of how a particular text viewer/editor decides to display it.

Share this post


Link to post
Share on other sites
Quote:
Original post by smitty1276
Quote:
Original post by DeafManNoEars
Quote:
Original post by smitty1276
If you can do it on a single line, you should be able to treat two files as sets of individual lines and match them up, right?

-- Keep an iterator i of the line you are currently looking at...
-- If A[i].matches(B[i]) then keep incrementing i
-- If you get a mismatch on line i, increase the index on one of the files (call it B[j]) until A[i].matches(B[j])... everything between B[i-1] and B[j] would be inserted.
-- Etc...


What happens if, in file B on the second line, the user added 1 word. This word added enough length to wrap one word over onto the next line. Now this could potentially cause all subsequent lines to not match.

IDK, just something to think about in your algorithm.
It isn't a new line until a newline character is encountered, regardless of how a particular text viewer/editor decides to display it.



That's good to know. Thanks.

Share this post


Link to post
Share on other sites
Quote:
Original post by soconne
Support you had the following files:

File A (Repository version)
---------------------------
private int a;


File B (Working version)
---------------------------
private int b;
public int a;
private float a;

Using the above method does not work.
Diff tools typically look at lines in a file. It's the nature of software development to INSERT new lines or DELETE old lines from files, so diff tools focus on those and not on differences within lines. In your example above, diff tools will simply show two files with nothing in common. If you change a variable name in a file and then diff the two versions, that LINE will be highlighted.

Some tools will, however, highlight differences or similarities that might exist between lines that differ, but that happens AFTER you run the diff algorithm and identify the lines that are different and identify the lines that match.

EDIT: I should note that binary diff tools do, obviously, exist. I was speaking of diff tools as they are generally used for diffing source code. You can still do a binary type diff on source code if you are looking for specific strings of characters.

Share this post


Link to post
Share on other sites
You could try to combine the two approaches...

File A
-----
private int a;
private int b;

File B
-----
private float x;
private float y;
private int a;
private int c;

For each line you are basically doing a binary diff anyway, so keep up with which parts of the line match and which don't. You could use some sort of heuristic to determine whether a non-matched line is actually a "match" with a small change.

For example... B[0] and B[1] would easily be detected as added lines using my idea. B[2] matches A[0]. B[3] and A[1] do not match, but you can see that only one character changed, so maybe that is actually a match? It almost certainly is, in this case, since A[1] was never encountered in B, but it was 95% matched by B[3]. You already know what line you're looking for in B (in this case A[1]), so it presumably wouldn't be too complicated to see if a non-match meets some minimum requirements to be promoted to a partial match.

How you make that determination would require some trial-and-error, I'm sure.


Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this