Sign in to follow this  
alnite

Split by spaces

Recommended Posts

I was just parsing huge text files without some sort of formatting.  The only thing seemed apparent is that in each row the fields are separated by spaces, and there is a fixed number of fields in each row.  The number of spaces are not the same throughout the file.  To throw a bit more challenge to the mix, there is a date field somewhere in there in the form of "Wednesday February 10, 2000". It looked something like this:

"   John   10    20    Wednesday Febuary 10, 2000            Dexter  19.9    "

I was reading the file, one line at a time, and start extracting the fields. I knew to not split by a space because otherwise the date field would be separated.  I got an idea, how about I split by two spaces?  This would put the date field into one field as expected, and automatically trim all those extra spaces in the middle.  The code worked like a charm.  Sometimes you'd get an extra space tagged to a field, but it could be easily removed with a simple call to trim.

I wrote the tests to try various numbers of spaces, and the code can even detect invalid lines and drop them as not to corrupt the data.  This was done through a simple check of the number of fields parsed, which should return 6 for all of them.

Great. So I started parsing the real text files, and..got an empty result back, as in no lines were being parsed.  Huh, what did I do wrong?  I came back to the test data, and messed with it in many other different ways, including putting more spaces at the front and at the end of each line.  The tests all came back successfully.  The code successfully parsed the fields.  Then, what did go wrong?

An hour and dozen of printf statements later, I found out that all the lines from the real text files came back with 7 fields instead of the expected 6, and the last field is an empty string.  "How did that get in there?"  I copy-pasted some lines from the real files to my tests to make sure that I did not go crazy, and sure enough those lines failed tests even though they looked identical.  What made them different? I turned on "show whitespace" on editor, and saw no tab characters, and I started thinking if there was some alternative unicode for a space that's messing with me.

Then I started counting the trailing spaces..and they came back odd.

facepalm.jpg

Edited by alnite

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this