Still Addicted to Speed (no, not that kind)

Published December 28, 2013
Advertisement
Over the past few days I've managed to hit what seems to be a local minimum in the compiler's speed: 1.1 seconds. Of course, this is slightly inflated, because the compiler is now substantially larger than it was before. If I do a pure apples-to-apples comparison, the compiler hits about 850 milliseconds, which isn't terrible, but isn't what I want either.

The C++ implementation of Epoch, by way of comparison, used to be able to build the Epoch compiler in about 250 milliseconds. Now, that took a huge amount of careful tuning and optimization, and I haven't invested nearly as much work into the new compiler yet, so I'm not totally abandoning hope. But it's definitely going to be a tough fight to get this thing another 3-4x faster.

Optimizing code at this level is an exercise in extreme patience, coupled with the discipline to be exactingly scientific about every single change, and profile the hell out of everything. It's also tricky because, like quantum mechanics, it is possible to disturb the speed of the program by observing it. Specifically, adding profiling code can cause ordinarily fast functions to become major bottlenecks if they are called frequently and exit relatively soon, and so on.

The profiling mechanism I'm using is crude and may not last much longer. I basically inject a timer into every function's entry and exit points in the JIT engine, and keep running track of how many milliseconds are spent in each function. When the program exits, this is dumped out in sorted order to the console, so I can see where things spend their time.

Analyzing these numbers is tricky because unlike many profilers this tactic gives me inclusive running times. Say we have a function Foo, which does nothing but call Bar, which takes a long time to run. You might expect Foo to show up with a low number and Bar with a high number - but in inclusive profiling, both Foo and Bar will be high-valued. So in order to find real hotspots, it's important to take into account all of the child and grandchild calls (and so on) a function might make, because all of those will contribute to the function's running time.

To supplement the timing data, I've started emitting a call count as well. This is a simple number that tells me how many times a function was entered over the lifetime of the program. Some are totally sane, like ParseCodeBlock with 2118 calls.

Others are ridiculous, like EnumerateOverloadsAndAddParameterTypes with over 3.5 million calls. There aren't 3.5 million statements in the compiler, so why is this getting called so much?!

Turns out I had a nasty problem where built-in functions would first search the entire list of programmer-defined functions looking for overloads. Every single time any overloaded function was called. Oops. Fixing the brain-dead bug dropped the call count to around 750,000. Still not great, but far better than 3.5 million.

The next major offender was the algorithm for trie searching. I use a trie data structure to hold the list of sttrings pooled by the program being compiled; this includes every identifier and string literal in the program, so naturally it gets rather large.

An iterative attempt at traversal actually turned out to be substantially slower, because of the way type decomposition is implemented. (It's complicated and I don't feel like explaining it right now, but if anyone cares, I'll be happy to dig into the details.)

My next improvement was to remove the string-length computation from the inner loop and call it once during the search setup, and cache it thereafter; this netted a tiny speedup, despite the 2-million-odd calls still occurring.

Inspired by my discovery that iterative traversals of some data structures is actually a performance liability, I tried A/B comparisons on the AVL-tree implementation next. Shockingly, a recursive implementation of that tree traversal would up almost twice as fast as the iterative version.

Not one to question good fortunate, I replaced several of the AVL-tree algorithms I'd implemented with recursive editions.


I take a moment to turn off profiling and get a raw speed reading. I'm pleasantly shocked to see the compiler set a new record of 963 milliseconds.

Of course, it only takes a few more minutes of probing through profiler results to drop that to 942.

At this point, the scheduler of the OS starts interfering with accurate measurement. If the compiler is context-switched out once too often, the compile time can jump by almost 20 milliseconds. It becomes painful to detect real speedups, since it takes dozens of runs to build an average that is trustworthy.

912 milliseconds stands as the current record.

I found a couple more places where I could take advantage of the binary search tree structure; this dropped things down to a blazing 860 milliseconds. Even at this speed, I'm craving for it to run faster. Until I'm in the ballpark of the C++ compiler, I will not consider my job complete.

Momentary panic... the compiler has stopped passing validation tests and is miscompiling code. I have to take a few minutes to revert some changes and see what caused the breakage. Eventually I pinpoint the bug and clean it up, but at the cost of regressing to 883 milliseconds.

At least, I thought I'd killed the bug... it seems to have returned with a vengeance! This turns out to be a nasty bit of fiddly semantics in the language itself. When passing a parameter of the "nothing" type, the compiler formerly assumed that the parameter would otherwise be passed by value (if it wasn't nothing). However, this is usually false, since most parameters are passed by reference. It's a dumb assumption and I'll have to revisit it and clean up the specification someday. On the plus side, I'm back to 870 milliseconds, which is nice.

It's time to go back to profiling. My first couple of attempts at locating and fixing hotspots go foul; I'm regressing instead of getting faster, which is both demoralizing and increasingly easy. When things are this close to the wire, oftentimes the cost of setting up complex data structures dwarfs the benefits they have over simpler constructs. Even though I'm using a lot of O(n) logic via linked lists, it's faster to traverse a set of simple nodes than to, say, try and maintain balance in an AVL tree.

Slowly it becomes clear that profiling is not going to glean much more speed from this thing, at least not without finding a different method of profiling. The instrumentation dramatically raises the cost of function calls and screws with inlining, meaning that the running code is not only much slower, it's slower in very different places.

None of my last several attempts at optimizing have actually netted any speed gain, primarily due to this skewing of the hotspot information. I'm going to need to devise a better way to time this code.

Out of curiosity, I do a quick comparison. I reach back through time, via the magic of version control, and retrieve the first copy of the compiler that successfully self-hosted. I then feed that through the current compiler - and it runs in 689 milliseconds.

Between the first self-hosting and now, the compiler source code has grown by 40KB. Obviously there's a lot of complexity represented there, because that 40KB translates into an extra 200 milliseconds of compile time. (For comparison, the entire compiler is about 375KB.)

Interestingly, the biggest difference in speed between compiling the old compiler and the new one is in semantic analysis - the phase of the compiler that makes sure all the data types are correct, expands templates, and so on. Parsing is barely a blip of difference, and code generation accounts for a small but appreciable chunk of the remainder. So semantic analysis seems to carry some bad behavior that slows it down disproportionately. I shall have to dig into this deeper when my brain is less fried.

Part of my obsession with speed is that I still haven't turned the garbage collector back on, and frankly I'm worried about what will happen when I do. I want the baseline to be as fast as possible so that GC doesn't make things painfully slow.


In any case, I'm too mentally tired to keep hacking on this for today.
5 likes 3 comments

Comments

3Ddreamer

Apoch,

As always, your blog is fun, inspiring, and informative.

Thanks again

December 28, 2013 06:25 AM
evolutional

Out of interest, why does getting the compiler speed any quicker motivate you? Having a sub second compile is a great achievement, especially considering your start point. Trying to optimize more seems like it'd suck up a lot of time into an endless loop with diminishing returns, detracting time from feature work.

December 29, 2013 04:46 PM
ApochPiQ
Fair question!

The main motivating factor is that the C++ implementation of this same compiler runs in about 300-350 milliseconds. I want to get at least close to that range, because "performance vs. C++" is going to be a major question of people considering adopting the language. If I have solid evidence that the core language is capable of rivaling a C++ program for speed (beyond the couple of artificial benchmarks I have which already do this) it makes the whole affair that much more appealing to someone looking to write "real" software in Epoch.
December 29, 2013 10:53 PM
You must log in to join the conversation.
Don't have a GameDev.net account? Sign up!
Profile
Author
Advertisement
Advertisement