Multi thread deadlock issue with a recursive mutex. Need ideas.

Started by
20 comments, last by Pink Horror 8 years ago

I have a pretty big issue right now with a dead lock in a multi threaded software. I know which of my threads and mutexes cause the deadlock but I do not understand why.

I have a setup that looks like this (Pseudo c++ code)

AutoLock is a scoped lock class, nothing fancy.


class Foo
{
public:
    void start();   //Spawns the thread that will call run()

    void doStuffA() {AutoLock lock(&mutexA); <DoStuffHere()> }
    void doStuffB() {AutoLock lock(&mutexB); <DoStuffHere()> }

private:
    void extraWork()
    {
        AutoLock lock(&mutexB);
        
        //Processing here

        doStuffB();
    }

    void run()   //Threaded function
    {
        while(true)
        {
            AutoLock lock(&mutexB);   //Lock the B Mutex

            //Do a lot of work, networking stuff, etc
            extraWork();   //This is fine since we're using a recursive mutex

            AutoLock lock2(&mutexA); //Dead lock here after a couple of hours of run time.
        }
    }

    boost::recursive_mutex mutexA;
    boost::recursive_mutex mutexB;
};


int main()
{
    Foo foo;
    foo.start();

    while(true)
    {
        foo.doStuffA();
        //Do stuff
        foo.doStuffA();
        //Do stuff
        foo.doStuffA();

        //Do some stuff

        foo.doStuffB();    //Usually hangs here a bit while foo finishes a loop
    }
}


So the code is obviously not exactly like this but the logic is the same. We ran this code with no problems for a long time and it just recently starting deadlocking. We traced the code with dumps and know this setup is causing the problem.

Note that the main thread cannot lock A or B mutex directly but only by calling public functions of Foo. Since we lock only using the AutoLock class (Scoped lock), the main thread should never keep a lock on the mutexes. Yet, the thread sometimes hangs indefinitely when trying to lock mutex A.

I know from looking at boost code that the hanging only happens if the current thread id inside the mutex is different from the one calling ->lock(). Therefore there's only 2 explanations to this problem.

1. The main thread somehow keeps a lock on mutex A.
2. There's memory corruption that messes the data of the mutex A.

I'm really out of ideas and if some multithreading guru could give me tips on what to look for it would be greatly appreciated.

Advertisement

I wouldn't call myself a guru, but I've been touching concurrent code for over a decade, a few years ago I wised up and avoid techniques like you're employing here though.

1) I recommend you don't use recursive locks. You'll actually get better concurrent performance if you release the lock around function calls. Remember: locks prevent concurrency. You want to hold a lock as briefly as possible.
2) You have two threads that could possibly be holding this lock, right? So check the thread holding the lock. recursive_mutex is implemented in headers, just find your system's implementation and change recursive_mutex::owner's visibility to public, then do the same owner check your system's recursive_mutex implementation does. If neither thread matches, you must have memory corruption. If the main thread matches, you're not releasing the lock somewhere. If the "Foo" thread matches, I don't know what the problem is.

DoStuffHere doesn't touch mutexA/mutexB in any way?

Yeah it could be a memory corruption issue... Though I've had incorrect threading code actually work fine and consistently pass unit test for a year, before it decided to fail the unit tests once. An audit of the code did find a subtle logic bug...

You can see if this "fixes" the issue:

int pad1[1024];
    boost::recursive_mutex mutexA;
int pad2[1024];
    boost::recursive_mutex mutexB;
int pad3[1024];

Why is mutexA being locked/unlocked at the end of that loop anyway?

@nfries88: I admit I should look back at the whole class to see if recursive mutex are really needed, but it is useful when your code is split in multiple functions and each one of those needs the lock.

As for your 2nd point, the main thread does lock the mutexes throught the Foo class but never access them directly. Technically it cannot keep a lock on a mutex when the stack isn't in the Foo class. We actually updated our code to save the recusive mutex information about thread info so we will see if it is a memory corruption.

@Hodgman: DoStuffHere isn't a function, just to say there's some code in there that the mutex protect. I agree the code seems ridiculous because I removed most of the important processing. Would it make more sense if I said the Lock on MutexA at the end of a thread loop is really coming from a network event processed that needs to modify variables protected by that mutex ?

I'll try the patch of padding the mutexes with char buffers. I didn't think about it but it might point me to the correct direction.

Thank you both.


@nfries88: I admit I should look back at the whole class to see if recursive mutex are really needed, but it is useful when your code is split in multiple functions and each one of those needs the lock.

It makes cleaner code at the cost of performance.
The only reasonable argument I could hear in favor of recursive mutex stems from a genuine need for a guarantee of non-interruption, but in this case a second function can be used to achieve the same effect, IE:
(obviously a sane implementation would do data processing in the function. This is just an example)


class Example
{
public:
    void SimpleTask(){ lock(); UnsafeSimpleTask(); unlock();
    void ComplexTask(){ lock(); data->process(); UnsafeSimpleTask(); data->processMore(); unlock(); }
private:
    void UnsafeSimpleTask(){ data->processYetAgain(); }
    Data *data;
}

Hi.

It could be because the function extraWork you call doworkB which locks b but b is locked already in the call to extraWork.

Maybe boost::strands could help you here. or lockfree queue's. The only place for mutex is in the connection phase of the networking, where you have simultaneous clients connecting to a data base or some thing even then you could go lock free as well.


            AutoLock lock(&mutexB);   //Lock the B Mutex

            //Do a lot of work, networking stuff, etc
extraWork(); //This is fine since we're using a recursive mutex

            AutoLock lock2(&mutexA); //Dead lock here after a couple of hours of run time.

Right there you acquire multiple locks. That is a common source of deadlocks. Additionally you are holding a lock for a long time to "do a lot of work" and "networking stuff" which is going to be slow. Holding locks over time is a good way to starve your system.

A common problem in multiple locks is getting ordering wrong. One locks A then B, another place you lock B then A. It may take some time to stumble over it, but you've just introduced a deadlock by doing them in a different order. One system holds A, the other holds B, and neither will release as they're waiting for each other.

When you've got multiple locks in place you need a way to attempt to get all the locks all at once, and if you cannot get all of them, back off and wait until all are available, then repeat.

Lock ordering and lock hierarchies help reduce the difficulties in this type of tasks.

Several systems I've worked with used a class do that work. Every lock is named and numbered in code as an enumeration. You can request a set of locks used by the block as a single call, or acquire a single lock if you don't have any other locks out. Internally you can only add a higher-number lock, never add a lower-number lock, which helps preserve ordering; attempting to add a lower-priority lock triggers all kinds of alerts and logging. If acquiring multiple locks the system can attempt to gain them all, and on failure release back down the tree and wait for availability. Finally, if a system holds a lock for longer than a specified time, alerts and logging are triggered.

I'm not sure how I missed it before, but Frob's comment seems to have shaken the gunk from my brain.

Based on what you've shared, the most probable location of a deadlock is in Foo::doStuffA or functions it calls. Either:
1) You have an infinite loop.
2) You're locking mutexB.
3) You're stuck on another deadlock that's causing this deadlock.

@ankhd: That's why a recursive mutex is used. As long as you are in the same thread you can look a mutex any number of times. It is reference counted and the lock is released only when all of the unlock() functions are called.

@frob: I was burned in the past with the ordering of multiple mutex locks and that's the first thing I look for when I have a multi thread. I suppose this code is risky and should be done in a better way. I never thought before of a class to handle multiple locks to prevent ordering issues but it sounds like a good idea for the future.

I need to review a larger scope of the code since the cause of this deadlock isn't obvious at first sight. I got 3 programmer to check the code and all of them think a deadlock shouldn't be happening. Perhaps the problem is elsewhere and the deadlock is a symptom of a bigger problem. Anyhow I'll keep in mind all of your suggestions and keep you posted when we find the issue :).

Thanks.


all of them think a deadlock shouldn't be happening.

A single lock is not a problem.

Any time you acquire more than one lock a deadlock is not only possible, it is likely going to happen unless a try-else-rollback system is in place.

This topic is closed to new replies.

Advertisement