[C++] Holy cow are string streams slow

Started by
14 comments, last by Antheus 12 years, 12 months ago

In your next function:
stream = stringstream( line );

should you not use..

stream << line ;

I haven't read every line of the code but I'm not sure you need to be creating a new stringstream on each read.


That one change actually makes things at least one order magnitude slower, that was the only change I made and my timing went over 800 seconds so I decided to stop. But thanks for trying to give me some practical advice anyway, its appreciated.
Advertisement
I've found that if you just want to write a straight forward implementation and don't want to spend a lot of time optimising it then the C-style stream functions are the way to go. There's nothing inherently slow with the C++ streams by design it's just that the VC++ implementation of them does a lot of extra work (even with the full array of optimisation flags).
Is using the trim line function actually faster than just letting the switch statements ignore the unnecessary characters?

While going through the file twice (to find number of verts etc.) may be faster, perhaps keeping the file in memory and using that the second time would be quicker than reading everything off disk twice?

Is using the trim line function actually faster than just letting the switch statements ignore the unnecessary characters?

While going through the file twice (to find number of verts etc.) may be faster, perhaps keeping the file in memory and using that the second time would be quicker than reading everything off disk twice?

The going through the file twice isn't slowed down because of the disk, as I am sure this gets cached by the OS / HDD anyway: its slowed down most by that string stream constructor. And yeah it is way faster finding the right amount of verts because for a large data set ( like lucy ) if you don't reserve vector space, the whole thing runs almost five times as long (184 seconds )

No, it is a little slower. A switch statement works well in ignoring lines that start with # as the comment character ( like in obj files ) but if you have a comment like /* this is a comment */ that can span over multiple lines or be in the middle of a line or even have a line /* blah blah blah */ full off /* blah blah blah */ such comments like that, then you need something better, and the parser class is used to parse a few types of text files.


So everyone can see here is an implementation of my parser class I just created which loads lucy in 10 seconds, compared to the 48 it takes with the before posted one using string streams.



Parser::Parser( wstring file )
{
input.open( file );
ignoring = -1;

if( !input.is_open() )
throw ExcFailed( L"[Parser::Parser] Could not open file " + file + L"\n" );
}

void Parser::Ignore( const std::string& start, const std::string& end )
{
excludeDelims.push_back( start );
includeDelims.push_back( end );
}

void Parser::Rewind( void )
{
input.seekg( 0, ios::beg );
input.clear();

ignoring = -1;
line.clear();
}

void Parser::Next( void )
{
getline( input, line );

if( !input.good() )
return;

if( line.empty() )
{
Next();
return;
}

TrimLine( line );
if( line.empty() )
{
Next();
return;
}
}

void Parser::GetLine( std::string& _line )
{
_line = line;
}

void Parser::GetTokens( std::vector<std::string>& tokens )
{
tokens.clear();
string buff;

size_t from = 0;
while( from < line.length() )
{
GetNextToken( buff, from );
tokens.push_back( buff );
}
}

void Parser::GetHeader( std::string& header )
{
header.clear();

size_t from = 0;
GetNextToken( header, from );
}

void Parser::GetBody( std::string& body )
{
body.clear();

size_t i = 0;
// Ignore any white spaces at the beginning of the line.
while( line == ' ' && line == '\r' && line == '\t' && i < line.length() )
i++;

// Ignore the first word
while( line != ' ' && line != '\r' && line != '\t' && i < line.length() )
i++;

body = line.substr( i, line.length() );
}

void Parser::GetBodyTokens( std::vector<std::string>& bodyTokens )
{
bodyTokens.clear();

string buff;

size_t from = 0;
GetNextToken( buff, from );
while( from < line.length() )
{
GetNextToken( buff, from );
bodyTokens.push_back( buff );
}
}

bool Parser::Good( void )
{
return input.good();
}

void Parser::TrimLine( string& line )
{
if( ignoring != -1 )
{
size_t incPos = line.find( includeDelims[ignoring] );
if( incPos != string::npos )
{
line = line.substr( incPos, line.length() );
ignoring = -1;
TrimLine( line );
}
else
line.clear();
}
else
{
for( size_t i = 0; i < excludeDelims.size(); i++ )
{
size_t excPos = line.find( excludeDelims );
if( excPos != string::npos )
{
string tail = line.substr( excPos, line.length() );
line = line.substr( 0, excPos );

// If the includeDelim is the end of the line just return the head.
if( includeDelims == "\n" )
return;

ignoring = i;
TrimLine( tail );
line += tail;
return;
}
}
}
}

void Parser::GetNextToken( string& container, size_t& from )
{
size_t to = from;

while( from != line.length() && ( line[from] == ' ' || line[from] == '\t' || line[from] == '\r' ) )
from++;

to = from + 1;
while( to != line.length() && line[to] != ' ' && line[to] != '\t' && line[to] != '\r' )
to++;

container = line.substr( from, to - from );

from = to;
}



Which is a shame because I think string streams are a really elegant way of parsing and formatting data, but I don't know how to use them in a way that isn't mega mega slow.

A switch statement works well in ignoring lines that start with # as the comment character ( like in obj files ) but if you have a comment like /* this is a comment */ that can span over multiple lines or be in the middle of a line or even have a line /* blah blah blah */ full off /* blah blah blah */ such comments like that, then you need something better.


But that's not a valid comment in a .obj file.

If the stringstream is really that slow, you could try processing the line string using boost string algorithms or something instead (I'm not sure it would be faster, but it's an option to try).
This is text book example of how costly memory allocations are.

There is no decent way around it using standard stream implementations.

One could use custom allocator which would go a long way, but very third-party libraries support such strings, so if you depend on any of them, it'll be a problem.


Parsing .obj files is also best done using standard FSM-based parser which can work with no overhead or extra allocations beyond the extracted data.

This topic is closed to new replies.

Advertisement