An update on this topic, our bugs were found and fixed. However this thread was me looking in the wrong direction. The deadlock happening was a side-effect of the real bug and not the source of my issues. The bug was coming from a system was acting erratically and sending way too many messages over the network and our code could not keep up and they would stack up. Overtime, having a ton of very small memory blocks for each message would fragment the memory and then various systems would fail. The deadlock was the most common effect but we also had thread initialization failing and sometimes a straight up memory allocation failure (malloc of a big chunk returns NULL and that pointer was then used).
So, you have memory allocation failures, and most of them do not crash your program immediately? And then you're stuck dealing with other bugs that look impossible? Let me guess, you have catch (exception) or, even worse, catch (...) everywhere, with maybe a log saying "unknown exception" if your programmers are slightly less lazy than the people who just leave the catch empty?
malloc does not throw exceptions. It simply returns NULL. On some systems, there is no built-in catch for NULL dereferences, and NULL+offsetof(SomeStruct, someMember) might reasonably point to memory used by the main thread's stack, some global variable, allocation records, or even the OS itself; any of which could have very unpredictable consequences. It's entirely possible no C++ exception was ever thrown and no OS exception/signal/etc was ever triggered, and memory was silently corrupted.
@OP: I assume you corrected these side-effects by checking the address returned by malloc, and probably corrected the system that was acting up. But you should still do something about the heap fragmentation. If one system sending too many messages causes a heap fragmentation, what happens when several systems start having to send that many messages just to handle their workloads? It might be something you can put off for awhile, but you may find yourself needing to resolve the fragmentation issue in the future, I'd treat this like an early warning.