The problem: I've been working on and fleshing out my codebase for several years now and a vast majority of the time it works as expected. Nevertheless, there are occasional oddball situations that can easily lead to bug hunts that last for several days and often result in my finding various (seemingly) unrelated bugs that often also solve the problem at hand, but in a very obscure way. In such situations I often have nothing more to do than shrug and carry as if something so inexplicable happened that it defied the laws of physics. Some of this has to do with code complexity; however, I suspect more than that, this has to do with my knowledge of how to handle these situations.
The setup: I'm writing a multi-threaded application that currently branches into two primary threads - the render thread and the input thread. Both have SEH exception handlers at the stem: surrounding the entire body of code in the render thread and in the WNDPROC callback in the input thread. I'm catching all standard exceptions.
Where it gets complicated: every now and then I stumble upon a first-chance exception that throws off my entire program. Now, I know what a first-chance exception is and I know why it's being thrown the way it is. However, the root cause for it is usually an access violation, which is always terminal in nature. Using the Disassembly window I can look up what's at the address where the exception is thrown, but generally the it seems to be thrown in unmapped address space, which a) doesn't really help with pinpointing the cause and b) gives no indication as to where where or when it was thrown.
Where it gets more complicated: I'm using a custom memory allocator that I wrote. Is it bug-free? Good question. The bottom line is, it's not too complicated and doesn't support fancier features like reference counting or compaction. But it's fast, multithreaded and it gets the job done. At least as far as I can tell. It also makes debugging considerably more obscure.
The confusion: here's a very short snippet of code that exemplifies a common case of confusion. The following is a response to keyboard input and crashes always in the same way at the same moment. I've gone over the code preceding it and I can't find anything that might write into an invalid memory address (directly or remotely via another thread). I do recognize that this kind of inspection is concessional and doesn't really guarantee that I didn't miss a bug. Nevertheless, this is still a fairly strong indicator that by all logic the access violation cannot occur in any other thread (since it's temporally locked to user input) and has a low probability of occurring sometime before the below snippet is executed. This, in turn, completely screws up any and all logic when it comes to tracking down the cause:
int Editor::HandleKeyboardInput(...)
{
...
if(toolActive) { toolActive->Activate(false); }
toolActive = newTool;
if(toolActive) //all cool in the Watch window
toolActive->Activate(true); //BOOM! crash, because all of a sudden the 'this' pointer is NULL!
//EDIT: apparently the 'this' pointer is modified only occasionally; other times newTool's
//Activate() starts pointing to unmapped space
}
Running the debugger through this with application-side exception handling disabled just gives me an infinite loop of first-chance and second-chance exceptions that point to exotic memory addresses.
The solution? I've gone back to manually commenting out code blocks and ultimately it's not impossible to arrange code in a way that gets rid of the exception. However, the logic, which surrounds tracking something like this down still eludes me and I find myself resorting to trial and error, which frankly has a really poor probability of identifying and fixing the actual error that's causing this. After all, I've gone over everything ten times now and most permutations that do get ride of the crash (at least in terms of how I rearrange my code) don't make much sense.
So, to recap - if anyone can point out glaring holes in the way I'm handling exceptions in my code, comment or criticize on the way I'm tracking them down or provide overall suggestions, I'd appreciate it a lot. I realize the problem is likely something as silly as writing past a an array boundary (even though I'm using guarded arrays in debug mode...), but experience has proven that the more innocuous the bug, the more days or weeks it takes to track down.
Oh - in case it becomes relevant, I'm on VS2010, Windows 8.1.