Jump to content
  • Advertisement
Sign in to follow this  
soconne

writing my own text diff code

This topic is 4062 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I'm in the process of writing my own diff tool for comparing text files side by side. I've managed to implement the algorithm described in this paper. http://www.xmailserver.org/diff2.pdf It basically works by using Dijakstra's shortest path algorithm to trace a path through both strings while minimizing the number of inserts and deletes. I currently have it working just fine for single lines of text, but I cannot seem to figure out how to efficiently apply it to lines of text from a file. Specifically, figuring out which lines of text in a file match the lines in the second file and vice versa. I'm sure my current method for comparing a single string would NOT be optimal for an entire file. Does anybody have any suggestions?

Share this post


Link to post
Share on other sites
Advertisement
If you can do it on a single line, you should be able to treat two files as sets of individual lines and match them up, right?

-- Keep an iterator i of the line you are currently looking at...
-- If A.matches(B) then keep incrementing i
-- If you get a mismatch on line i, increase the index on one of the files (call it B[j]) until A.matches(B[j])... everything between B[i-1] and B[j] would be inserted.
-- Etc...

Share this post


Link to post
Share on other sites
Quote:
Original post by smitty1276
If you can do it on a single line, you should be able to treat two files as sets of individual lines and match them up, right?

-- Keep an iterator i of the line you are currently looking at...
-- If A.matches(B) then keep incrementing i
-- If you get a mismatch on line i, increase the index on one of the files (call it B[j]) until A.matches(B[j])... everything between B[i-1] and B[j] would be inserted.
-- Etc...


What happens if, in file B on the second line, the user added 1 word. This word added enough length to wrap one word over onto the next line. Now this could potentially cause all subsequent lines to not match.

IDK, just something to think about in your algorithm.

Share this post


Link to post
Share on other sites
Well this 'wrapping' you're referring to does not occur when you read in the actual file contents. It only seems to go to the next line when using notepad. Unless you actually go to the next line by pressing 'Enter', there will not be a '\n' written to the file.

Share this post


Link to post
Share on other sites
I don't know why I didn't think of this before. But diffing text lines the way that was stated above will not work.

Support you had the following files:

File A (Repository version)
---------------------------
private int a;


File B (Working version)
---------------------------
private int b;
public int a;
private float a;



Using the above method does not work. What I need is a FAST and EFFICIENT way to decide which line in File B correctly corresponds to the line in File A. I could simply perform the Dijakstra search on the line in File A with each line in File B and whichever one results in the least number of inserts or deletes is the closest match. But this will not be very efficient at all if I have to do that for EVERY single line in File A and compare it with every single line in File B.

The obvious optimization would be to flag lines that have already been matched. But then what do you do if a subsequent line in File A better matches a line in File B that has already been match?

So suppose:

File A
-----------------
this is some text
this is any texts


File B
------------------
this is any text


So when going through and diffing these files, I might say that line 0 in File A and line 0 in File B match. But then when looking at the 2nd line in File A, I also see that it closely matches line 0 in File B BUT I have already flagged line 0 in File B as matched.

So taking this into account, does anybody have any other ideas?

Share this post


Link to post
Share on other sites
Quote:
Original post by DeafManNoEars
Quote:
Original post by smitty1276
If you can do it on a single line, you should be able to treat two files as sets of individual lines and match them up, right?

-- Keep an iterator i of the line you are currently looking at...
-- If A.matches(B) then keep incrementing i
-- If you get a mismatch on line i, increase the index on one of the files (call it B[j]) until A.matches(B[j])... everything between B[i-1] and B[j] would be inserted.
-- Etc...


What happens if, in file B on the second line, the user added 1 word. This word added enough length to wrap one word over onto the next line. Now this could potentially cause all subsequent lines to not match.

IDK, just something to think about in your algorithm.
It isn't a new line until a newline character is encountered, regardless of how a particular text viewer/editor decides to display it.

Share this post


Link to post
Share on other sites
Quote:
Original post by smitty1276
Quote:
Original post by DeafManNoEars
Quote:
Original post by smitty1276
If you can do it on a single line, you should be able to treat two files as sets of individual lines and match them up, right?

-- Keep an iterator i of the line you are currently looking at...
-- If A.matches(B) then keep incrementing i
-- If you get a mismatch on line i, increase the index on one of the files (call it B[j]) until A.matches(B[j])... everything between B[i-1] and B[j] would be inserted.
-- Etc...


What happens if, in file B on the second line, the user added 1 word. This word added enough length to wrap one word over onto the next line. Now this could potentially cause all subsequent lines to not match.

IDK, just something to think about in your algorithm.
It isn't a new line until a newline character is encountered, regardless of how a particular text viewer/editor decides to display it.



That's good to know. Thanks.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!