It doesn't have any guns, explosions, or crazy futuristic technology. There is no romance, no beautiful woman to be saved at the end, and no vague but stirring innuendo. There is no deep, insightful commentary on humanity or any of the goings-on of human beings.
There is, however, an interesting program bug; a brief glimpse into the guts of a commercial 3D game engine; some only-slightly censored "hot action"; and a shocking moral that will leave you breathless with introspection and personal revelation. I may also be making some of that up... but you'll have to read on to find out which bits aren't true!
4:00 AM, give or take.
Lately, I've been burning the proverbial candle, and not just at both ends; I gave up on that a couple of days ago and just chucked the whole thing into the fire. As part of some major code changes, I've checked in some code that, unfortunately, broke the build - shame, shame, I know.
I can't fix the build because the broken code file was locked for edits by another team member. What with time zones and all, I won't be able to sort things out until about 8AM, optimistically speaking. So I'm going to bed, and getting up again at 7:30. Ordinarily, I would just sleep in and maintain a bit of my (admittedly fragile) sanity, but I broke the build... no good being out of contact for a whole day while nobody else can get work done.
8:20 AM... approximately.
We got things sorted out and the build is now fixed. I'm now thoroughly loaded with caffiene, just to stay conscious, so I figure I may as well keep working for a while.
Another team member drops me an IM, reporting weird problems in the newly updated code.
Team member: if I get an exception I can't see the error message until I minimise the game window, but when I do that the rendering code crashes out losing me my message window and context
Huh. That's odd.
We spend some time talking out the details of the scenario, and then the phone rings. Bug is on hold.
I check in what I think is a fix for what I think is the problem. Turns out I'm completely wrong, and the fix actually makes things worse by masking some of the symptoms. We talk over the scenario again for a few minutes, but something else comes up, and my teammate has to fix something else unrelated.
Back on the case. We get a better handle on the problem. Part of the game database is failing to load, which is causing a NULL pointer to be passed around in the game logic. When it is finally dereferenced, the game crashes (as it should). I set up some hack changes to try and reproduce this problem.
I've reproduced the problem easily... but it's become clear that the problem I've created is not the real problem. There are actually two exceptions floating around - one C++ exception from the failing loader code, and an SEH exception from the NULL pointer dereference.
What's bizarre is that one exception is thrown; there's a debugging mechanism that pops up a modal MessageBox() when the exception object is constructed. This lets us immediately step the debugger into the code where the exception was thrown, before the stack unwinds and any catch handlers are invoked - which is a particularly useful tool.
Oddly, when the message box for the debug option pops up, notifying of the exception, code appears to still be running. We're still building a single-threaded branch of the project at this point, so the behaviour is completely unexpected.
What's worse, this only occurs when the game is running fullscreen, and the user has to Alt+Tab away from the fullscreen window to see the message. My original fix was to just hide the DirectX window when the message is displayed, then restore it afterwards; this is technically something we want to happen anyways, but it makes the buggy exception stuff disappear. Running in windowed mode also causes the problem to go away.
I decide to try and rig up a test case where I can see the call stack when the second, NULL-pointer exception is thrown.
The dreaded word heisenbug is invoked, multiple times. Spines in the immediate area are afflicted by a sudden onset of chills.
I successfully capture a call stack.
Unfortunately, the interesting bit is total crap, because I'm missing some debug symbols for some reason. We compare thoughts on the stack, noticing some weird voodoo going on in d3d9d.dll.
1:15 PM, roughly, thereabouts, etc.
After extensive buggering around with some Visual Studio settings (apparently, installing SP1 makes your symbol server settings go away magically) and downloading a few hundred MB of symbols from Microsoft, I'm back in business.
I recapture the call stack, and hit pay dirt:
Can you spot the source of the problem? Go ahead - give it a try before you read the answer.
You didn't try, you filthy cheater... you're just being lazy and skipping down to the good stuff.
Really, seriously - give it a shot. It's good insight into just how complex programming can be.
OK... either you've tried to figure it out now, or you're incorrigibly lazy. So here's the answer:
The bottom line is that when Direct3D runs in exclusive mode, it performs a window subclass on your main drawing window, to trap certain messages. Our code, when in fullscreen, was triggering this behaviour. The problem is, MessageBox() essentially runs its own little internal message pump loop, to ensure the box is modal.
Unfortunately, the process of Alt+Tab'ing out of the game into the debugger tripped a message which got pumped, by MessageBox(), to the main game window; this is then intercepted by the DXUT framework, which calls our rendering code.
The renderer isn't ready to render, because the game loading isn't done; the game loading, if you'll remember, threw an exception, which, thanks to our debugging mechanism, popped up the message box. Kaboom - nasty re-entrancy problem, and a mysterious null pointer exception that occurred when it "shouldn't have been possible."
The fix was trivial: the exception debug mechanism which shows the message box was tweaked slightly. Now, before showing the message, it explicitly hides and deactivates the render window, which prevents the re-entrancy problem. Rendering is restored after the modal message box returns.
The Moral of the Story
There's a couple of interesting lessons to be learned here.
First of all, never trust a call stack if you don't have symbols for all of the code in the stack. When your debugger says "frames below here may be incorrect", it means it. We spent a few minutes on a wild-goose chase looking at DXUT's keyboard hooks because the debugger spit up some random gook for stack frames it couldn't understand; as it happened, it pointed us at some internal Windows routines that are normally used for invoking a keyboard hook callback. Oops! Fortunately, we didn't waste much time on this.
Second, keep your tools up to date, and know them well. It didn't take long to get the correct debugging symbols; most of the time was spent downloading files from Microsoft's inexplicably sluggish file servers. Also, when you download symbols for an SDK, make sure you get symbol files that match the SDK version you're actually using; trying to load the December 2006 SDK symbols when compiling against the August 2006 SDK doesn't work. If I'd been keeping my stuff up to date, I could have saved some time.
Third, when debugging, precise replication is essential - there is a fine line between reproducing an existing bug, and creating new ones. I spent a lot of time creating new bugs; it took a few bogus attempts before I reproduced the bug. Although thankfully this was hardly a critical, blocking issue, the time was still wasted. In a real emergency situation, getting the bug right the first time is very important.
Fourth, fixing symptoms is not enough; never stop debugging until you know the root cause. In this case, my first fix (which masked the symptoms) was actually the right fix - but that was pure luck. If I had stopped when the symptoms went away, I could have potentially left a very serious concurrency problem lurking the code, waiting to bite us at some critical point - like, say, three days before gold-master deadline. Understanding the root cause of the bug is always critical to ensuring that the fix will kill the bug dead, and keep it dead.
Finally, it is always valuable to understand the inner workings of code - even if that code isn't yours. Knowing how window procedures work and how modal dialogs are typically implemented was crucial to solving this problem. However, what was even more important was understanding window subclassing; that was the crucial element that led us to understand Direct3D's role in the bug, and discover the nature of the problem. It took bare seconds on Google to confirm that Direct3D subclasses your window in certain circumstances.
Re-entrancy problems are sneaky and devious, especially when dealing with apparently blocking operations like a modal message dialog, that you do not expect to cause re-entrancy into your code. Sometimes, they can even cause apparently modal dialogs to disappear and be replaced with null pointer errors from code that shouldn't have even been running.
After several false starts, I finally finish this exceedingly captivating story, and submit it for publication on the Intertron.
And they all lived happily ever after to the end of their days.
 Yeah, I know, exclusive-mode is bad source control practice, blah blah. It's partly a cultural choice and partly just because it's easier than reworking a bunch of our pipeline; we mostly live with it and do a large portion of code changes locally and only check out files to merge changes. It just so happened that in this case things didn't work out that way. In any case, I'm not here to defend our source control practices, so bugger off [razz]