Cleaning Up Garbage
Earlier today I decided to go ahead and turn the garbage collector back on, and see just how bad things are.
On the plus side, the compiler still self-hosts in only a few seconds, so it's not nearly as horrid as it could be.
On the down side, there's a persistent crash deep in the garbage collector that seems to be related to getting invalid stack information from LLVM. This sort of thing has been a plague in the past, and I've lost a lot of hair trying to figure out similar bugs before.
So today's adventure is going to be debugging-heavy!
The first thing I need is a repro. Thankfully, self-hosting the compiler faithfully crashes in the same spot every time. Initially, I run the compiler under the debugger with a Release runtime, hoping it will be fast enough to not drive me insane.
Unfortunately, there's a nasty gotcha to running even Release-built programs under the Visual Studio debugger: all memory allocations go through a slow "debug" allocator. This slows the debugged runtime down to a crawl, and after ten minutes of waiting for the compiler to parse a single file, I said screw it and looked up a way to kill that behavior.
(For the record, set the environment variable _NO_DEBUG_HEAP to 1, reboot, and VS will quit doing this.)
Sadly, even with a fast runtime to test against, it's a Release build, which means that there's far too much optimization going on to easily unravel why this crash is occurring. So back to Debug builds we go!
I get a debug build running soon enough, and after waiting the requisite ten years for LLVM to start up in Debug mode, I quickly discover a new problem: stack overflow!
This is an interesting side effect of debug builds. They don't inline functions, so common Standard Library containers end up using an order of magnitude more stack space for simple operations than they do in optimized builds. Epoch, being very happy to implement lots of things using recursion, finds stack space to be at a premium. Part of the complicating factor here is that the garbage collector itself is recursive, so it too wants a lot of stack space.
So it looks like my next task is to rewrite the recursive garbage collector to be iterative instead. Hurray. Thankfully, it's not terribly hard, and only takes a few minutes. Now to run this debug build again...
After what feels like an eternity or three, the debugger halts - access violation! We're making progress!
A careful examination of the crashed process reveals that the cause of the bug is probably a recent change to the way pattern matching and type decomposition are implemented; there are now locations in the code where stack roots are not initialized to zero, meaning that the garbage collector will happily walk into garbage random memory addresses trying to discover reachable objects.
In order to confirm this theory, I need to dump out the LLVM IR for the compiled program - which weighs in at around 6MB.
Analyzing the LLVM IR seems to corroborate my suspicions. There are indeed edge cases where pattern-matched functions could wind up passing bogus stack roots to the garbage collector. There are two options for fixing this: either figure out a way to eliminate the edge case at code-generation time, or just remove the hack that caused them to be created in the first place.
After waffling for a few minutes, I choose to remove the hack. It'll cost a bit of performance, but stability is more important.
Of course, nothing can ever be that easy. Even with the optimization hack removed, the crash occurs. There must be something deeper going on here.
It takes some deep diving, but I eventually notice something that might be related to the problem. In LLVM, the meta-instruction for flagging a variable as being a stack root is required to appear at the beginning of the function's code (first basic block, to be strictly accurate). In the dumped IR, there are many locations where this is being violated. I wonder if that's stomping on the ability of the JIT to accurately locate stack roots... in any case, it merits investigation.
Examining my code reveals that the fault doesn't lie in the emission of stack root markers; there must be something in LLVM that rearranges them somehow, although disabling all of the LLVM optimization passes fails to correct the crash, so maybe it was a red herring all along.
I need a simpler repro case, or I'm going to go insane waiting for this slow code to run. Thankfully, getting a working compiler back is just a matter of turning the GC back off for the time being, so I can build a simple test-case program.
Annoyingly, the first test case I build works fine, garbage collection and all.
Eventually, the head-scratching gets old and boring, and I think it's time to take a break from this sucker for a while. I'm not making any progress and I've quit thinking clearly; more just staring blankly at the screen hoping for inspiration.
So yeah... we'll try this again later!