This topic is 822 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

I have a pretty big issue right now with a dead lock in a multi threaded software. I know which of my threads and mutexes cause the deadlock but I do not understand why.

I have a setup that looks like this (Pseudo c++ code)

AutoLock is a scoped lock class, nothing fancy.

class Foo
{
public:
void start();   //Spawns the thread that will call run()

void doStuffA() {AutoLock lock(&mutexA); <DoStuffHere()> }
void doStuffB() {AutoLock lock(&mutexB); <DoStuffHere()> }

private:
void extraWork()
{
AutoLock lock(&mutexB);

//Processing here

doStuffB();
}

{
while(true)
{
AutoLock lock(&mutexB);   //Lock the B Mutex

//Do a lot of work, networking stuff, etc
extraWork();   //This is fine since we're using a recursive mutex

AutoLock lock2(&mutexA); //Dead lock here after a couple of hours of run time.
}
}

boost::recursive_mutex mutexA;
boost::recursive_mutex mutexB;
};

int main()
{
Foo foo;
foo.start();

while(true)
{
foo.doStuffA();
//Do stuff
foo.doStuffA();
//Do stuff
foo.doStuffA();

//Do some stuff

foo.doStuffB();    //Usually hangs here a bit while foo finishes a loop
}
}

So the code is obviously not exactly like this but the logic is the same. We ran this code with no problems for a long time and it just recently starting deadlocking. We traced the code with dumps and know this setup is causing the problem.

Note that the main thread cannot lock A or B mutex directly but only by calling public functions of Foo. Since we lock only using the AutoLock class (Scoped lock), the main thread should never keep a lock on the mutexes. Yet, the thread sometimes hangs indefinitely when trying to lock mutex A.

I know from looking at boost code that the hanging only happens if the current thread id inside the mutex is different from the one calling ->lock(). Therefore there's only 2 explanations to this problem.

1. The main thread somehow keeps a lock on mutex A.
2. There's memory corruption that messes the data of the mutex A.

I'm really out of ideas and if some multithreading guru could give me tips on what to look for it would be greatly appreciated.

##### Share on other sites

I wouldn't call myself a guru, but I've been touching concurrent code for over a decade, a few years ago I wised up and avoid techniques like you're employing here though.

1) I recommend you don't use recursive locks. You'll actually get better concurrent performance if you release the lock around function calls. Remember: locks prevent concurrency. You want to hold a lock as briefly as possible.
2) You have two threads that could possibly be holding this lock, right? So check the thread holding the lock. recursive_mutex is implemented in headers, just find your system's implementation and change recursive_mutex::owner's visibility to public, then do the same owner check your system's recursive_mutex implementation does. If neither thread matches, you must have memory corruption. If the main thread matches, you're not releasing the lock somewhere. If the "Foo" thread matches, I don't know what the problem is.

Edited by nfries88

##### Share on other sites

DoStuffHere doesn't touch mutexA/mutexB in any way?

Yeah it could be a memory corruption issue... Though I've had incorrect threading code actually work fine and consistently pass unit test for a year, before it decided to fail the unit tests once. An audit of the code did find a subtle logic bug...

You can see if this "fixes" the issue:

int pad1[1024];
boost::recursive_mutex mutexA;
boost::recursive_mutex mutexB;
int pad3[1024];

Why is mutexA being locked/unlocked at the end of that loop anyway?

##### Share on other sites

@nfries88: I admit I should look back at the whole class to see if recursive mutex are really needed, but it is useful when your code is split in multiple functions and each one of those needs the lock.

As for your 2nd point, the main thread does lock the mutexes throught the Foo class but never access them directly. Technically it cannot keep a lock on a mutex when the stack isn't in the Foo class. We actually updated our code to save the recusive mutex information about thread info so we will see if it is a memory corruption.

@Hodgman: DoStuffHere isn't a function, just to say there's some code in there that the mutex protect. I agree the code seems ridiculous because I removed most of the important processing. Would it make more sense if I said the Lock on MutexA at the end of a thread loop is really coming from a network event processed that needs to modify variables protected by that mutex ?

I'll try the patch of padding the mutexes with char buffers. I didn't think about it but it might point me to the correct direction.

Thank you both.

##### Share on other sites

@nfries88: I admit I should look back at the whole class to see if recursive mutex are really needed, but it is useful when your code is split in multiple functions and each one of those needs the lock.

It makes cleaner code at the cost of performance.
The only reasonable argument I could hear in favor of recursive mutex stems from a genuine need for a guarantee of non-interruption, but in this case a second function can be used to achieve the same effect, IE:
(obviously a sane implementation would do data processing in the function. This is just an example)

class Example
{
public:
private:
Data *data;
}
Edited by nfries88

##### Share on other sites

Hi.

It could be because the function extraWork you call doworkB which locks b but b is locked already in the call to extraWork.

Maybe boost::strands could help you here. or lockfree queue's. The only place for mutex is in the connection phase of the networking, where you have simultaneous clients connecting to a data base or some thing even then you could go lock free as well.

##### Share on other sites

I'm not sure how I missed it before, but Frob's comment seems to have shaken the gunk from my brain.

Based on what you've shared, the most probable location of a deadlock is in Foo::doStuffA or functions it calls. Either:
1) You have an infinite loop.
2) You're locking mutexB.

##### Share on other sites

@ankhd: That's why a recursive mutex is used. As long as you are in the same thread you can look a mutex any number of times. It is reference counted and the lock is released only when all of the unlock() functions are called.

@frob: I was burned in the past with the ordering of multiple mutex locks and that's the first thing I look for when I have a multi thread. I suppose this code is risky and should be done in a better way. I never thought before of a class to handle multiple locks to prevent ordering issues but it sounds like a good idea for the future.

I need to review a larger scope of the code since the cause of this deadlock isn't obvious at first sight. I got 3 programmer to check the code and all of them think a deadlock shouldn't be happening. Perhaps the problem is elsewhere and the deadlock is a symptom of a bigger problem. Anyhow I'll keep in mind all of your suggestions and keep you posted when we find the issue :).

Thanks.

##### Share on other sites

all of them think a deadlock shouldn't be happening.

A single lock is not a problem.

Any time you acquire more than one lock a deadlock is not only possible, it is likely going to happen unless a try-else-rollback system is in place.

1. 1
Rutin
24
2. 2
3. 3
JoeJ
18
4. 4
5. 5

• 38
• 23
• 13
• 13
• 17
• ### Forum Statistics

• Total Topics
631709
• Total Posts
3001844
×