A Ball of Yarn is not just one bug. It's a large number of bugs, all in close proximity, which influence and interact with each other. If you pull on one of the threads, one of two things will happen: either you will slowly gain insight into the mess, or (more likely), you'll inadvertently tighten a knot.
Garbage collectors are a prime source of Ball of Yarn bugfests. So many subtle factors can influence them going off the rails that when a GC blows up the cause is probably highly nontrivial. Here's a summary of the threads I've yanked on this past week.
These are always scary. You have a bug, that's bad enough. But then you go to investigate it, and the closer you peer, the more the behavior changes. This is a classic recipe for disaster and requires immense patience and focus to solve.
I had one earlier that was classic bizarre-o-world: if I ran my entire unit test suite against the compiler/runtime (and, by extension, the garbage collection system), certain GC-related tests would fail. Predictably. Every time.
If I ran the failing tests individually, they would pass. Predictably. Every time.
After scratching my head for a little while it occurred to me that a previous test in the suite must be setting up some kind of condition that affected the GC behavior. This was supposed to be impossible since I've gone to great lengths to isolate each unit test from the others.
Or so I thought.
Long story short: turns out global variables are bad, mmkay? No, but seriously. It wound up being a single innocuous global that did, in fact, alter the behavior of the GC indirectly. Resetting its value between unit tests fixed the Heisenbug.
Which way is up?
Epoch is compiled to an abstract machine language by the Ahead-of-Time compiler tools. This machine language is no longer spoken by anything, since I killed the VM that ran it a few months ago. However, it does serve as a first-pass form that is mechanically translated into LLVM bitcode and then JIT-compiled to native machine code when the program is loaded.
(Eventually I want to just serialize LLVM bitcode into the program image and load/JIT it directly instead of recreating it at runtime, but that's just an optimization for later.)
LLVM has a nice feature which is integral to doing precise garbage collection, known as stack maps. A stack map is simply a data structure that describes where certain variables live on the machine stack at a given moment in time.
This is vital because stack mapping allows us to know what roots exist in the program. Roots are the variables that, via traversing pointers/references, lead us eventually to all the valid objects that are allocated and need to be kept alive instead of recycled by the garbage collector. Without roots, our option is to do conservative garbage collection (i.e. if it looks like it might be a pointer to a legitimate object, treat it as one, even if it's just a lucky-shaped int or something). Conservative collection is gross and I refuse to do it for Epoch, so stack mapping it is!
Unfortunately, LLVM's documentation around how stack maps work is... lacking, to be extremely gracious.
I spent a lot of time this week waffling on how to actually use the data in a stack map from LLVM. Turns out that if a certain number is positive, you interpret it one way; and if it's negative, you interpret it an entirely different way.
Of course none of this is written down anywhere and trying to unravel the LLVM source to understand it was less than helpful. (Bonus points: another addition to the "less than helpful" camp is the LLVM community itself, which seems to treat any work on garbage collection as some kind of plague.)
Bottom line, it was hard enough to figure out my first semi-working attempt at understanding stack map data. Second-guessing myself all week did not help, and it turns out I was probably right the first time.
Validation is important, kids
A couple other fun bugs arose from not validating data correctly. The GC formerly had a few places where, if it couldn't recognize the type of an object, would just punt and assume it was legit and had some arbitrary type. This is a terrible idea.
What would happen is the GC would start traversing memory, and come across an object it didn't recognize. It would then dutifully add that to a list of things to go back and check up on later. In some cases, the bogus data would just be ignored, and everything would continue on as if nothing had happened. In other cases, the result was silent failures to correctly mark live objects, which would then be swept by the GC and destroyed while the program was still using them. Oops.
This kind of nonsense accounted for not one but three semi-independent bugs in the GC that I fixed over the past week.
FPO is evil
I discussed this in a previous entry, but the bottom line is that some compiler optimizations aren't worth the trouble. FPO is one of them.
I managed to work around the problem by introducing the concept of a "bookmark." A bookmark is a simple list of all the times we exit an Epoch stack frame and enter a foreign stack frame, such as when calling Win32 API functions. During garbage collection, we traverse stack frames that have been bookmarked in addition to any frames we can find using standard frame-pointer walking. This covers the gaps very effectively, at a small performance penalty. I'm planning on revisiting this someday and cleaning up the cases where bookmarks are stored so I can reclaim some of that speed.
But wait, there's more!
As of now, there's still at least one bug left in the GC. It either relates to algebraic sum types, variadic functions, or both. I'm not entirely sure how to resolve this last one, aside from just hammering away at it for a few days and hoping for the best.
Despite all this, there's definite and clear forward progress being made. I hope to wrap this up soon-ish and return to the self-hosting compiler project, although I have to admit it's been an entertaining diversion.