• 9
• 10
• 9
• 10
• 10

# C++ Weird Behaviors With Multi-Threading

This topic is 899 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

I'm processing video frame data via HLSL Pixel Shaders with DirectX 9. I want to run all operations in a row via a command buffer. Once the basic version is working (running commands one by one), then the execution of the following command will return the result of the previously generated frame, as there is a 1-frame delay with DirectX9 so that the GPU works on the previous frame while we're filling in the current frame data.

If I run 4 shaders with a single thread, I cannot get the frame for the next shader in the chain until I get the result of the first one, so the only way to get the correct output is to ask DX9 to render a dummy scene which cuts performance by half. When running on 8 threads, however, I should be able to have a command buffer that contains up to 8 commands and these can be chained one after the other. Right now I'm still processing them one by one (not chaining output yet), and it's not working the way I want.

There are all kinds of weird multi-threading issues.

First, performance is VERY slow. If ProcessFrames is initiated for each class instance, then performance is MUCH higher; but then I end up having 32 DirectX devices which take up a lot of memory. In theory, the final version should provide twice better performance than when running with 32 devices.

Second, StartWorkerThread is sometimes called twice. There seems to be a race condition I haven't been able to solve.

Third, sometimes the code works fine, sometimes it deadlocks, sometimes if freezes for a while, randomly.

Fourth, I added a bunch of mutex locks and it works this way, but if I remove some of those locks, then I start getting all kinds of weird behaviors.

Fifth, even though I'm clearly adding the command to the buffer before calling SetEvent(WorkerWaiting), within the thread, WaitForSingleObject(WorkerWaiting, 10000) gets released while the buffer is still empty!? And the thread exits repeatedly.

How can I "fix" this code to make it work in an acceptable way?

#include <list>
#include <mutex>
#include <concurrent_queue.h>
using namespace concurrency;

static concurrent_queue<CommandStruct> cmdBuffer;
static std::mutex initLock, startLock, waiterLock;
static HANDLE WorkerWaiting;
static void AddCommandToQueue(CommandStruct* cmd, IScriptEnvironment* env);

std::once_flag OnceFlag;

cmdBuffer.push(*cmd);

// If thread is idle or not running, make it run.
if (WorkerWaiting != NULL)
SetEvent(WorkerWaiting);
startLock.lock();
startLock.unlock();
}
}

return;

ProcessFrames Worker(env);

// Start waiting event with a state meaning it's not waiting.
WorkerWaiting = CreateEvent(NULL, TRUE, TRUE, NULL);

// Process all commands in the queue.
CommandStruct CurrentCmd, PreviousCmd;
if (!cmdBuffer.try_pop(CurrentCmd))
CurrentCmd.Path = NULL;
while (CurrentCmd.Path != NULL) {
// The result of the 1st execution will be returned on the 2nd call.
if (FAILED(Worker.Execute(&CurrentCmd, NULL)))
initLock.lock();
SetEvent(CurrentCmd.Event); // Notify that processing is completed.
initLock.unlock();

PreviousCmd = CurrentCmd;
if (!cmdBuffer.try_pop(CurrentCmd))
CurrentCmd.Path = NULL;

while (CurrentCmd.Path != NULL) {
if (FAILED(Worker.Execute(&CurrentCmd, NULL)))
initLock.lock();
SetEvent(CurrentCmd.Event); // Notify that processing is completed.
initLock.unlock();
PreviousCmd = CurrentCmd;
if (!cmdBuffer.try_pop(CurrentCmd))
CurrentCmd.Path = NULL;
}

// Flush the device to get last frame.
//Worker.Flush(&PreviousCmd);
//SetEvent(PreviousCmd.WorkerEvent); // Notify that processing is completed.

// When queue is empty, Wait for event to be set by AddCommandToQueue.
waiterLock.lock();
WaitForSingleObject(WorkerWaiting, 10000);
ResetEvent(WorkerWaiting);

// If there are still no commands after timeout, stop thread.
if (!cmdBuffer.try_pop(CurrentCmd))
CurrentCmd.Path = NULL;
waiterLock.unlock();
}

startLock.lock();
CloseHandle(WorkerWaiting);
WorkerWaiting = NULL;
startLock.unlock();
}


##### Share on other sites

I've been reviewing the basics of the theory of multithreading in C++, after realizing that it is a serious issue even for expert programmers. C# doesn't have these kind of issues, but C++ compilers really can screw things up because it has no awareness of threads.

I was doing a bunch of things wrong already. Static variables should NOT be accessed by different threads unless they are wrapped within an atomic class. I fixed that, making sure nothing is shared unless it is atomic.

If I don't let the thread exit anymore, then it's not creating a bunch of threads anymore. The lock when adding commands is working.

What baffles me, however, is that the line "CurrentCmd.Path = NULL;" should never be called because I'm always adding an item before calling SetEvent. However, that line still gets called! If I don't hack around and create a never-ending loop, then the thread would exit. Why is this happening?

WaitForSingleObject(WorkerWaiting, 10000);

while (CurrentCmd.Path == NULL) {
Sleep(200);
// If there are still no commands after timeout, stop thread.
if (!cmdBuffer.try_pop(CurrentCmd))
CurrentCmd.Path = NULL;
}

static void AddCommandToQueue(CommandStruct cmd, IScriptEnvironment* env) {
cmdBuffer.push(cmd);

// If thread is idle or not running, make it run.
//if (WorkerWaiting != NULL)
SetEvent(WorkerWaiting);
else {
}
}


The updated source code is here

Which gets called from this

After I can get this basic to work, then I'm wondering about a more serious problem. I saw that when I pass the commands into the buffer as a STRUCT instead of a pointer, it is more stable. Now it makes sense why. When passing pointers, concurrent_queue would make the pointer thread-safe, but not the data itself. When passing the data itself, then it avoids this problem. Except that the STRUCT contains several pointers. It contains several char* pointers for strings, and up to 10 pointers to video frame buffers. Am I going to run into problems with those? And if so, what's the right way of ensuring thread safety with all this data?

Some people go as far as saying that it's better to run separate threads in their own process to avoid these kinds of issues, but then the only way to share data would be with a shared memory buffer in the paging file. Writing all the video frame data onto the hard drive back and forth would be a huge performance hit, so that would not be a good idea at all!

##### Share on other sites

Your taking out locks when you raise events would be an area of concern, You are at the mercy of whatever the event listener is doing and if this were to call back into something that wants the same lock on another thread you have the building of deadlocks.

I would say is it better to just protect mutable data that is shared across threads such that when it is being accessed/updated the access is controlled. These are the flash points that cause race conditions. Too much locking removes the benefits of threads as you find they are all waiting on locks.

C# does suffer locking/thread issues as much as any other language, it is the nature of threads and why they can be hard to work with.

Personally I have moved to immutable data to deal with threading as you never have any locking issues with data that cant change. If you want to manipulate you build a new instance of the data. Then you have a single point that integrates the data back in and a far easier time with locking. That said you suffer more with object churn due to the amount of new objects you create.

##### Share on other sites

Thanks guys! I have reviewed the code with your suggestions.

I have removed the atomic wrappers. As for semaphones, I haven't found the way to increase the count so I have let it aside for now. The problem I was having was that the newly inserted item could already be read before the event gets handled, so I just keep looping and asking for data until I get a clear timeout and it works.

The work gets done in a single thread, and with a simple lock, new commands are only added one at a time, so this simplifies it a lot. It appears to be working stable now.

Still a few issues.

1. Performance is still ridiculously slow. With 8 threads, I'm getting less than 5fps, while the original implementation with 32 devices is running at 39fps. CPU usage is 16% and GPU usage is very low. What is causing such a bottleneck?

2. While CommandStruct being passed to concurrent_queue is thread-safe, I believe the pointers targets are not. How do I ensure the strings and frame buffers are thread-safe? It doesn't seem to be causing any serious visible issues so far.

EDIT: I've now edited the code to chain the commands so that the 2nd execution fills the result of the 1st one. Performance is about twice better as expected, but now it's acting bad again. There is tremendous flickering in the image. Furthermore, I cannot benchmark the performance in Release build because it crashes. What could be causing this?

Perhaps the unprotected video buffers could create such an issue, but then why am I not having any such issue when running commands one by one? Is there a command to mark an area of memory as being ready for another thread?

Edited by MysteryX

##### Share on other sites

Performance is still ridiculously slow. With 8 threads, I'm getting less than 5fps, while the original implementation with 32 devices is running at 39fps. CPU usage is 16% and GPU usage is very low. What is causing such a bottleneck?

How are we supposed to know? Run a profiler and find out.

It could be one of your algorithms, lock contention, the D3D9 driver, etc.

##### Share on other sites

The Visual Studio 2015 profiler isn't giving me anything useful.

So we can discard my algorithms. As for D3D9 driver, the exact same code runs much faster when initiating 32 devices.... which really should only create tons of overhead. A single device, in theory, should do the job faster and keep the GPU command buffer full, if I chain my commands properly.

As for lock contention? As of right now there is only a single lock. I create a lock when I add a new command to the buffer, which means that if I'm running 6 shaders on 8 threads, this creates 6 locks per frame, and 8 frames get processed at once.

I should get 39fps instead of 5fps when I'm still running the commands one by one, and if passing the output of the 2nd command as the result of the 1st comment, I should get 78fps. Not 5fps.

Is there any possibility that the lock and threading in my code could interfere in some way with the AviSynth engine that is calling my code?

Edited by MysteryX