This model is fragile, ad hoc, poorly designed, and incredibly slow. So my first goal for Release 12 was to rewrite the parser to use boost::spirit::qi and generate a true Abstract Syntax Tree from the parsed source, then do a series of refinement passes over that AST for actual compilation. This will involve implementing several similar but different AST representations, each one representing the work done in separate passes of the compiler. Instead of mutating the AST in-place as decorations and improvements are performed, the entire AST will be immutable and converted into a new, parallel representation in each pass.
Most production compilers operate this way, and for good reason; it is far easier to think about how the code works in this model than when a monolithic AST structure is used and manipulated in-place during compilation. Paradoxically, it's also faster - because despite the copying of the AST in each pass, each IR can be optimized to do exactly what it needs to do and no more.
As of a few days ago, I had qi generating a rough AST; pass times dropped from ~10 seconds in the old implementation to ~4 seconds just by upgrading the parser. But 4 seconds for a 20KB input source file is still painfully slow, so I broke out the profilers and started looking for ways to improve on the AST generation pass.
Turns out that the vast majority of execution time was spent allocating and freeing memory, which is due to the default semantics of qi. Essentially, things like lists and variants are constructed every time the parser tries to match a production in the grammar - and then destructed if the match fails. This means that for nontrivial grammars, a huge amount of time is spent allocating memory that is never used.
To attack this, I wrote a simple "deferred construction" template which lazily allocates memory for AST nodes only once the production succeeds. From there on out, the node's contents are copied around and the final AST is constructed using only allocations that are absolutely necessary.
This dropped parse times on the test input from ~4 seconds to ~75 milliseconds - which is a very, very nice gain indeed.
My next step is to eliminate excessive copying of nodes once they are successfully allocated; each branch of the AST is immutable once the parser has successfully constructed it, so there's no need to store copies of every branch as parsing continues. This involves changing the deferred construction wrapper from using raw pointers to boost::shared_ptr, and just handing around references to the allocated node instead of making full deep copies.
Using the same input source as before, tests reveal that this new minimal-copy approach can generate an AST in an average of 47 milliseconds. At this point, a single pass is now just over 200 times faster than it used to be.
Profiling indicates that there are still several points where I could make use of the deferred construction wrapper to further improve things, so I'm going to go ahead and try that next. I'll keep you posted!