Multi thread deadlock issue with a recursive mutex. Need ideas.

Chindril · 2016-04-16T00:37:50

I have a pretty big issue right now with a dead lock in a multi threaded software. I know which of my threads and mutexes cause the deadlock but I do not understand why. I have a setup that looks like this (Pseudo c++ code) AutoLock is a scoped lock class, nothing fancy. class Foo { public: void start(); //Spawns the thread that will call run() void doStuffA() {AutoLock lock(&mutexA); <DoStuffHere()> } void doStuffB() {AutoLock lock(&mutexB); <DoStuffHere()> } private: void extraWork() { AutoLock lock(&mutexB); //Processing here doStuffB(); } void run() //Threaded function { while(true) { AutoLock lock(&mutexB); //Lock the B Mutex //Do a lot of work, networking stuff, etc extraWork(); //This is fine since we're using a recursive mutex AutoLock lock2(&mutexA); //Dead lock here after a couple of hours of run time. } } boost::recursive_mutex mutexA; boost::recursive_mutex mutexB; }; int main() { Foo foo; foo.start(); while(true) { foo.doStuffA(); //Do stuff foo.doStuffA(); //Do stuff foo.doStuffA(); //Do some stuff foo.doStuffB(); //Usually hangs here a bit while foo finishes a loop } } So the code is obviously not exactly like this but the logic is the same. We ran this code with no problems for a long time and it just recently starting deadlocking. We traced the code with dumps and know this setup is causing the problem. Note that the main thread cannot lock A or B mutex directly but only by calling public functions of Foo. Since we lock only using the AutoLock class (Scoped lock), the main thread should never keep a lock on the mutexes. Yet, the thread sometimes hangs indefinitely when trying to lock mutex A. I know from looking at boost code that the hanging only happens if the current thread id inside the mutex is different from the one calling ->lock(). Therefore there's only 2 explanations to this problem. 1. The main thread somehow keeps a lock on mutex A. 2. There's memory corruption that messes the data of the mutex A. I'm really out of ideas and if some multithreading guru could give me tips on what to look for it would be greatly appreciated.

General and Gameplay Programming Programming

Started by Chindril March 20, 2016 12:39 AM

20 comments, last by Pink Horror 8 years ago

nfries88

1,154

April 15, 2016 05:26 AM

An update on this topic, our bugs were found and fixed. However this thread was me looking in the wrong direction. The deadlock happening was a side-effect of the real bug and not the source of my issues. The bug was coming from a system was acting erratically and sending way too many messages over the network and our code could not keep up and they would stack up. Overtime, having a ton of very small memory blocks for each message would fragment the memory and then various systems would fail. The deadlock was the most common effect but we also had thread initialization failing and sometimes a straight up memory allocation failure (malloc of a big chunk returns NULL and that pointer was then used).

So, you have memory allocation failures, and most of them do not crash your program immediately? And then you're stuck dealing with other bugs that look impossible? Let me guess, you have catch (exception) or, even worse, catch (...) everywhere, with maybe a log saying "unknown exception" if your programmers are slightly less lazy than the people who just leave the catch empty?

malloc does not throw exceptions. It simply returns NULL. On some systems, there is no built-in catch for NULL dereferences, and NULL+offsetof(SomeStruct, someMember) might reasonably point to memory used by the main thread's stack, some global variable, allocation records, or even the OS itself; any of which could have very unpredictable consequences. It's entirely possible no C++ exception was ever thrown and no OS exception/signal/etc was ever triggered, and memory was silently corrupted.

Sure, that's possible, but I still think that it's relatively unlikely to have memory corruption de-referencing null pointers from failed malloc calls, instead of segmentation faults, compared to the chance this code is throwing and catching bad_alloc exceptions. The code above is clearly C++. I would guess new is being used, even with malloc mentioned earlier.

I've never worked on a program that corrupted memory through an offset null pointer. I have worked on code where memory usage would spike up and cause allocation failures, because it was filled with try/catch statements.

It's pretty common for C++ projects to interface with C libraries, a great many of which perform internal allocation and deallocation, some of which might not check for failed allocation before access. It's not unheard of for C++ programmers to use malloc for buffers, this is actually my own practice. The new operator can be overloaded, an overloaded new might not throw std::bad_alloc. It is possible to disable exceptions in C++. It is possible that this is a non-conforming C++ implementation with no exceptions - some C++ implementations intended for embedded applications lack RTTI and exceptions, along with a large number of other C++ features; possibly some lack the new operator altogether, and malloc is the only way to allocate. The implementation of malloc might be non-conforming, or a custom allocator might be used. There's any number of reasons why an allocation might not throw which are perfectly reasonable. I understand that sloppy exception handling is all-too common, but not handling exceptions at all is even more common (I'm guilty of this), so it just seems odd to me to assume that's why the source of this bug was never caught. Also, I need to point out that on systems with NULL pointer protection, dereferencing memory near 0 does not generate a C++ exception, it generates a fatal signal - which few people know how to recover from - or an SEH exception on Windows, the typical handling of which is to quietly close the program. That is the first thing that indicated to me that something else was the problem.

Pink Horror

2,459

April 16, 2016 12:37 AM

I was not assuming anything. I was guessing. It was a question.

Multi thread deadlock issue with a recursive mutex. Need ideas.

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Multi thread deadlock issue with a recursive mutex. Need ideas.

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines