I learned an interesting lesson last night in over-reliance on optimizing compilers.
Deep in the core of the Epoch grammar is a rule that looks for generalized tokens from the lexer. This rule is designed to match user-defined identifiers like variable and function names - as opposed to reserved identifiers like "if" or "structure." As part of the interoperation between boost::spirit::qi (the parser generator) and boost::spirit::lex (the lexer generator), there's a way to tell the grammar to look for any token that the lexer has already matched to a certain regular expression.
So far so good. This rule is responsible for parsing a huge subset of a given Epoch program, because when it comes down to it there aren't that many reserved words in the language - even the "built in" types are just standard library functions that the compiler itself doesn't know about (i.e. it discovers them at build time when the standard library is silently included into your program).
Now we get to the messy bit.
Without going into painful detail, qi uses boost::fusion which heavily relies on boost::mpl to do things like variants, compile-time vectors, and so on. In a nutshell, lex produces tokens which carry a variant that can be assigned any arbitrary type. This is nice for tagging tokens in the lexer with metadata that the parser then needs (such as "hey, this token corresponds to the boolean value 'true' or maybe the integer 42"). In the case of the general-identifier token, this was a pair of std::wstring iterators that pointed into the original code stream. The point of this is that I can grab a token from the parser, then ask the lexer (via these iterators) where it found that token, and thereby recover the original string value of the token - all without doing any memory copies to hold the string data itself.
As it turns out, my implementation (which was adapted from the examples given on the qi/lex web site no less) used a variant holding a vector holding the pair of iterators to carry that information payload around.
Yes, you read that right. There was a variant. It only ever had one type of data in it, which was a vector. The vector only had exactly one element in it, which was a pair of iterators.
When I figured this out late last night, I could have slapped myself.
I kicked off a build of EpochCompiler, waited patiently for about 30 minutes for it to finish, and then woke up about an hour ago. Oops.
Parse times of the 2MB test case now come in around 945ms.
Please read that number carefully. It is accurate.
Yes, that means that the parsing phase of the Epoch compiler is now more than one thousand times faster.
I think I can finally sleep easy tonight. Time to move on to semantic validation and code generation!