Sounds like that game caused the programmer a few grey hairs! That's hellish!
I've had a similar experience, which was also hell. At the end of a project, we discovered a rare memory corruption bug, where some memory was just changing it's value randomly. Usually this appeared as players suddenly becoming a soup of triangles. This occurred on all our dev-kits, so we quickly ruled out hardware faults.
Because this happened randomly, you couldn't be sure if you'd fixed it or not either, because sometimes it wouldn't occur until the game had been running for hours...
It took weeks to even just diagnose the problem. Normally you'd find out which parts of memory are being corrupted, then set a Memory-Access-Trap so that when the CPU writes to that memory you get a breakpoint and you can discover the buggy code. Easy! However, the MAT was never triggered, which means that the CPU never did write to the memory... So, by divide an conquer, I'd register regions of RAM to be memcpy'ed and memcmp'ed at certian points, to manually check if they'd been modified, and eventually found patterns in which areas were being modified. Then I could just go over the code responsible for those areas with a fine-toothed comb for bugs (which took more weeks).
The real kick in the pants here was that there were two bugs.
- One was a bug in the GPU-side memory allocator, where render-targets were being told they had x memory to use, but it actually only allocated x-n memory, which caused the GPU to overflow this buffer by n bytes, randomly corrupting whatever happened to be allocated after that render-target.
This was the real bug I had to fix. The reason it didn't trigger the MAT was because the GPU accessed RAM through a different memory controller than the CPU... and even if it did trigger a CPU breakpoint, it would've been meaningless.
- The other was faulty RAM only in my own dev-kit that I was using for the diagnosis!!! Many of the systems where I'd diagnosed memory corruption were actually caused by a hardware fault on my machine, which meant I'd wasted weeks trying to find a cause in the code for corruption that didn't actually occur on normal hardware.
The reason this didn't trigger the MAT was because it was due to problems inside the RAM chip itself, nothing external ever did actually write corrupt values to RAM.