Sign in to follow this  
_Sigma

Fast floating point parser

Recommended Posts

I have some really large files with, potentially 10000+ tab/space delimited floating point numbers (potentially in exponential format) per line (10000+ lines). I am currently using Boost::Spirit like thus:
std::vector<std::string> items;

parse(m_line.begin(),m_line.end(),
							//BEGIN GRAMMAR
							*space_p >> real_p[push_back_a(items)] >> *(+space_p >> real_p[push_back_a(items)]) >> *space_p
							//END GRAMMAR
							);

Before anyone starts jumping around going "zomg, profile it first!" - I have. ~1min to load the file normally ~5min to load and parse So I'm looking for something as fast as possible; it only has to parse floating point numbers. If anyone has any experience or ideas I would be much in debt.

Share this post


Link to post
Share on other sites
According to some quick tests I ran (a basic C program, raw char array and char pointers, using atof, one big memory allocation), you should be able to parse 100,000,000 values in like 160 seconds. It would also be very easy to parellelize that. What kind of performance are you really looking for?

Share this post


Link to post
Share on other sites
If this is something that will be frequently loaded but only stored once, you should consider preprocessing the data. Know your memory format. Preprocess the data into the exact format needed for your memory, and load it into ram as a single block of data so you don't need to parse it.

Otherwise...

You really should reserve the space in the array based on the expected amount of data. It is possible (and quite likely) that there is a huge delay as the array is repeatedly resized. This is something fundamental to the language's standard library.

You should consider block reads followed by parsing the in-memory data, instead of extracting the individual tiny floating point entries from the file. Most systems will perform block reads behind your back, and some systems will perform a predictive cache of the file, but you shouldn't depend on the behavior.


If the format and usage patterns are appropriate, you could consider using OS-functions to memory map portions of the file for use as needed, or otherwise dynamically load just the chunks you need at any given moment.

Share this post


Link to post
Share on other sites
Quote:
Original post by Telastyn
Any chance of getting the file in a fixed width format in the first place?

Nope.

Quote:
Is that release build?

Yes

Quote:
It would also be very easy to parallelize that

Which part? I have already parallelized the processing. I read in the whole file and then us a parallel for loop to do the processing. This gave me a speed up of ~20%, which was quite nice.

Quote:
You really should reserve the space in the array based on the expected amount of data. It is possible (and quite likely) that there is a huge delay as the array is repeatedly resized.

This is embarrassing, but I had completely forgotten about this. I'm sure you're right - reallocating and copying 400mb of data must be pretty damn slow! I'll redo this tonight.

I assume
myvector.reserve(m_cols)
suffices for this?

Quote:
You should consider block reads followed by parsing the in-memory data, instead of extracting the individual tiny floating point entries from the file. Most systems will perform block reads behind your back, and some systems will perform a predictive cache of the file, but you shouldn't depend on the behavior.

Sorry, can you elaborate on this?

Quote:
load just the chunks you need at any given moment.

Unfortunately I need it all in memory :(

Share this post


Link to post
Share on other sites
Quote:
Original post by _Sigma

Before anyone starts jumping around going "zomg, profile it first!" - I have.
~1min to load the file normally
~5min to load and parse


Then why are you making us guess? Where is your time going? Oh you mean you have done some test runs. Get a profiler and profile for real.

Share this post


Link to post
Share on other sites
Quote:
Original post by stonemetal
Quote:
Original post by _Sigma

Before anyone starts jumping around going "zomg, profile it first!" - I have.
~1min to load the file normally
~5min to load and parse


Then why are you making us guess? Where is your time going? Oh you mean you have done some test runs. Get a profiler and profile for real.

I have a profiler. However it doesn't run under vista and that's where I'm dev'ing atm. I have it on my xp laptop, however it is so slow that it doesn't really apply (slow core duo, 4200rpm hdd so I was worried this was bottlenecking).

I am not too familiar with Boost::Spirit and was worried that perhaps it wasn't meant to be used as I am using it.

You're right, I should have done a full profile, however I just assumed I was abusing a library and it was my fault. Typically problems lie between the back of the chair and the keyboard :)


Share this post


Link to post
Share on other sites
Obvious things first - how often is items resized? Do you pre-allocate it? Do you re-use it?

We're talking 100 million numbers. In compact storage as floats, that's 400 megabytes of data. Do you have RAM, or are you already paging? Keep in mind that if you don't pre-allocate the 'items', you will get reallocations, which will be trashing your heap, which, given large data set, can quickly push your working memory over 1Gb, which *may* be an issue.

----
Random blurbs...

Third-party libraries are intended to save development time. Let's at this point assume that the problem lies in spirit.

How complex is grammar? Can you write a lex/yacc (or your own flavor) parser? Those result in as fast as it gets. If it's not too complex, writing your own floating point FSM parser is pretty straight-forward.

Can lines be extracted independently? Do they even matter? If so, can you look at disk access? How much can you pre-load? Fragmentation? Async/overlapped file reads?

Can you post the grammar, and perhaps what the file looks like?

Share this post


Link to post
Share on other sites
Ah, so You are in an environment where you don't have a true profiler available. Have you tried rolling your own? Insert a static variable at the start of a function then collect the time spent every time the function runs. you probably wont want to do your whole code base but you should be able to drill down to the problem fairly quickly since you already know around where the problem is.

Share this post


Link to post
Share on other sites
So I changed to a preallocated vector via:
items.reserve(m_cols).
However, this had almost no impact :(
Guess it's time to break out the real profiler eh?

Quote:
Original post by AntheusDo you have RAM, or are you already paging?

I have 4gb of ram. So It should all be in ram.

Quote:
Can you post the grammar, and perhaps what the file looks like?

As per my OP, grammar is:

*space_p >> real_p[push_back_a(items)] >> *(+space_p >> real_p[push_back_a(items)]) >> *space_p

The files look like
12.314 232.434 12341.3 ...
1234.532
.
.
.

No fixed tabs, spaes in between numbers

Share this post


Link to post
Share on other sites
Quote:
What compiler are you using?

VS 2005 & vs 2008

So I went to my xp box and profiled and ended up with this:


Count %w/children time(us)
void ArcASCIIRaster::threaded_parse::operator()(const tbb::blocked_range<unsigned int>& r) const
6 0.0 2.3 {
6 0.0 1.6 std::string* lines = m_lines;
6 0.0 1.6 bool* failure = m_failure;
6 0.0 1.5 std::string* msg = m_msg;
6 0.0 1.6 *failure = false;

//contains the parsed numbers
6 0.0 2.2 std::vector<float> items;
6 0.0 663.3 items.reserve(m_cols);
6 0.0 1.9 for(unsigned int row = r.begin(); row < r.end(); row++)
{


//current col
5,866 0.0 1,492.7 unsigned int col = 0;

parse(m_lines[row].begin(),m_lines[row].end(),
//BEGIN GRAMMAR
*space_p >> real_p[push_back_a(items)] >> *(+space_p >> real_p[push_back_a(items)]) >> *space_p
//END GRAMMAR
5,866 10.2 186,285,532.9 );


//report an error
5,866 0.0 3,402.6 if(items.size() != m_cols)
{
*failure = true;
*msg = "Not enough columns at line " + boost::lexical_cast<std::string>(row+1);
return;
}

//need to chance to tbb:vector if we want to paralize this
5,866 0.0 1,591.4 for(unsigned int col = 0; col < items.size(); col++)
{

try
48,435,562 0.7 12,385,339.0 {
48,435,562 88.2 1,612,247,535.9 m_array[row*m_cols+col] = boost::lexical_cast<float>(items[col]);
}
catch(boost::bad_lexical_cast e)
{
*failure = true;
*msg = "Failure to cast item to float";
return;
}
catch(...)
{
*failure = true;
*msg = "Unkown error";
return;
}

}


5,866 0.0 8,223.5 items.clear();
6 0.0 1.3 }

6 0.0 279.6 }



Third column is in microseconds after opening a 150mb file....
Culprit seems to be this awesomely redundant code (from an earlier version that was using strings)
1,612,247,535.9 m_array[row*m_cols+col] = boost::lexical_cast<float>(items[col]);
That is a full 26.8min worth of execution time. :( Remember this is a *slow* laptop and the profiler adds a little overhead. However I think it's clear where the problem is.

However, the Boost::Spirit call is still substantial - any ideas on how to reduce this down further?

However

Share this post


Link to post
Share on other sites
Quote:
Original post by _Sigma
Quote:
What compiler are you using?

VS 2005 & vs 2008
It looks like you found the main problem already, but you might be able to increase speed further by defining _SECURE_SCL to 0 (or you might not, depending on the optimizer, spirit's implementation, etc).

Share this post


Link to post
Share on other sites
Quote:
Original post by _Sigma
However, the Boost::Spirit call is still substantial - any ideas on how to reduce this down further?
Well, the first thing I'd try is to kick out Boost::Spirit and replace it with memory mapped files together with strtof (use the end pointer!). But still, you're fundamentally at the mercy of the quality of the floating point parser of a foreign library for a performance-critical operation. And float parsers tend to be optimized more for precision and code-size then performance.
So you should really consider hand-rolling and optimizing your own parser. The idea here is to read the digits as a normal integer, which is then adjusted by multiplying with an exponent table to get the final float. By skipping error/overflow checking and reading directly from a memory-mapped file you should be able to keep up with the drive without too much trouble.

Share this post


Link to post
Share on other sites
Quote:
Original post by implicitWell, the first thing I'd try is to kick out Boost::Spirit and replace it with memory mapped files together

Is this a good idea for 400mb files?
Quote:
http://en.wikipedia.org/wiki/Memory-mapped_file
Depending on the circumstances, memory mapped file I/O can actually be substantially slower than standard file I/O. For example: when reading in very large files, most of the data will not be cached by the kernel, meaning many page faults will occur when reading uncached data.

Share this post


Link to post
Share on other sites
Since you are doing a sequential run it should be okay. The first page miss loads the next whole page. Giving you a whole page of data to parse before the next page miss. If you were doing random access where data was more than a page apart then you would run into the problem mentioned. From the article
Quote:
Since the memory-mapped file is handled internally in pages, linear file access (as seen, for example, in flat file data storage or configuration files) requires disk access only when a new page boundary is crossed, and can write larger sections of the file to disk in a single operation.

Share this post


Link to post
Share on other sites
Quote:
Original post by _Sigma
Quote:
Original post by implicitWell, the first thing I'd try is to kick out Boost::Spirit and replace it with memory mapped files together

Is this a good idea for 400mb files?
Yes, you're doing a simple linear scan which is precisely the kind of thing memory-mapped files are optimized for. In Win32/POSIX you can even set flags to indicate that you're doing a sequential scan.
I've heard claims that it's beatable with uncached overlapped async IO under Win32 but the difference is minimal at best, and the real benefit of memory mapped files is that they're incredibly comfortable to use.

Then again you can't increase the size of the files much beyond this without risking to run out of address space on 32-bit systems, so you might have to use windowing to map a smaller part of the file at a time.

Share this post


Link to post
Share on other sites
One thing I have noticed is that your code doesnt use const to full effect, your operator is defined as a const function, but you don't use const_iterators, instead favouring the vector's indexing operator. Good use of const helps the compilers optimizer out.

Just pointing out also that you are using a random access container, but I don't think you actually need random access (though realistically, I doubt a list would be faster due to dereference overhead and because you're using primitive data types, but still something to think about).

Have you tried using std::stringstream, there is a chance it could be faster than boost::lexical_cast. Your Try/Catch block is inside your loop, in C++ exception traps are relatively expensive to setup, so make sure that the compiler is optimising this so that it is outside the loops.

Computer architecture wise, I would say memory mapped file of a good size to align with your systems page size. I'm a little suprised that you "only" got a 20% speed boost from Intel's TBB, do you have a Dual Core?

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this