Fast floating point parser

Started by
17 comments, last by silvermace 15 years, 9 months ago
I have some really large files with, potentially 10000+ tab/space delimited floating point numbers (potentially in exponential format) per line (10000+ lines). I am currently using Boost::Spirit like thus:

std::vector<std::string> items;

parse(m_line.begin(),m_line.end(),
							//BEGIN GRAMMAR
							*space_p >> real_p[push_back_a(items)] >> *(+space_p >> real_p[push_back_a(items)]) >> *space_p
							//END GRAMMAR
							);

Before anyone starts jumping around going "zomg, profile it first!" - I have. ~1min to load the file normally ~5min to load and parse So I'm looking for something as fast as possible; it only has to parse floating point numbers. If anyone has any experience or ideas I would be much in debt.
Advertisement
Any chance of getting the file in a fixed width format in the first place?
That seems really long, even for tens of thousands of numbers. Is that release build?
According to some quick tests I ran (a basic C program, raw char array and char pointers, using atof, one big memory allocation), you should be able to parse 100,000,000 values in like 160 seconds. It would also be very easy to parellelize that. What kind of performance are you really looking for?
If this is something that will be frequently loaded but only stored once, you should consider preprocessing the data. Know your memory format. Preprocess the data into the exact format needed for your memory, and load it into ram as a single block of data so you don't need to parse it.

Otherwise...

You really should reserve the space in the array based on the expected amount of data. It is possible (and quite likely) that there is a huge delay as the array is repeatedly resized. This is something fundamental to the language's standard library.

You should consider block reads followed by parsing the in-memory data, instead of extracting the individual tiny floating point entries from the file. Most systems will perform block reads behind your back, and some systems will perform a predictive cache of the file, but you shouldn't depend on the behavior.


If the format and usage patterns are appropriate, you could consider using OS-functions to memory map portions of the file for use as needed, or otherwise dynamically load just the chunks you need at any given moment.
Quote:Original post by Telastyn
Any chance of getting the file in a fixed width format in the first place?

Nope.

Quote:Is that release build?

Yes

Quote:It would also be very easy to parallelize that

Which part? I have already parallelized the processing. I read in the whole file and then us a parallel for loop to do the processing. This gave me a speed up of ~20%, which was quite nice.

Quote:You really should reserve the space in the array based on the expected amount of data. It is possible (and quite likely) that there is a huge delay as the array is repeatedly resized.

This is embarrassing, but I had completely forgotten about this. I'm sure you're right - reallocating and copying 400mb of data must be pretty damn slow! I'll redo this tonight.

I assume
myvector.reserve(m_cols)
suffices for this?

Quote:You should consider block reads followed by parsing the in-memory data, instead of extracting the individual tiny floating point entries from the file. Most systems will perform block reads behind your back, and some systems will perform a predictive cache of the file, but you shouldn't depend on the behavior.

Sorry, can you elaborate on this?

Quote:load just the chunks you need at any given moment.

Unfortunately I need it all in memory :(
Quote:Original post by _Sigma

Before anyone starts jumping around going "zomg, profile it first!" - I have.
~1min to load the file normally
~5min to load and parse


Then why are you making us guess? Where is your time going? Oh you mean you have done some test runs. Get a profiler and profile for real.
Quote:Original post by stonemetal
Quote:Original post by _Sigma

Before anyone starts jumping around going "zomg, profile it first!" - I have.
~1min to load the file normally
~5min to load and parse


Then why are you making us guess? Where is your time going? Oh you mean you have done some test runs. Get a profiler and profile for real.

I have a profiler. However it doesn't run under vista and that's where I'm dev'ing atm. I have it on my xp laptop, however it is so slow that it doesn't really apply (slow core duo, 4200rpm hdd so I was worried this was bottlenecking).

I am not too familiar with Boost::Spirit and was worried that perhaps it wasn't meant to be used as I am using it.

You're right, I should have done a full profile, however I just assumed I was abusing a library and it was my fault. Typically problems lie between the back of the chair and the keyboard :)


Obvious things first - how often is items resized? Do you pre-allocate it? Do you re-use it?

We're talking 100 million numbers. In compact storage as floats, that's 400 megabytes of data. Do you have RAM, or are you already paging? Keep in mind that if you don't pre-allocate the 'items', you will get reallocations, which will be trashing your heap, which, given large data set, can quickly push your working memory over 1Gb, which *may* be an issue.

----
Random blurbs...

Third-party libraries are intended to save development time. Let's at this point assume that the problem lies in spirit.

How complex is grammar? Can you write a lex/yacc (or your own flavor) parser? Those result in as fast as it gets. If it's not too complex, writing your own floating point FSM parser is pretty straight-forward.

Can lines be extracted independently? Do they even matter? If so, can you look at disk access? How much can you pre-load? Fragmentation? Async/overlapped file reads?

Can you post the grammar, and perhaps what the file looks like?
Ah, so You are in an environment where you don't have a true profiler available. Have you tried rolling your own? Insert a static variable at the start of a function then collect the time spent every time the function runs. you probably wont want to do your whole code base but you should be able to drill down to the problem fairly quickly since you already know around where the problem is.

This topic is closed to new replies.

Advertisement