Back to General and Gameplay Programming

Poor STL threads performance

General and Gameplay Programming Programming

Started by Yann_A September 19, 2014 01:23 PM

10 comments, last by Shannon Barber 9 years, 5 months ago

Yann_A

258

Author

September 19, 2014 01:23 PM

UPDATE : I've edited the source code since the original post as it made no sense at all.

Hi there,

So, i ported this ocean rendering algorithm to DirectX : http://www.keithlantz.net/2011/11/ocean-simulation-part-two-using-the-fast-fourier-transform/

It works great, but slow as the FFT is computed on the CPU.

I've noticed that this part of the code is very costly :


for (int m_prime = 0; m_prime < N; m_prime++) {
        fft->fft(h_tilde, h_tilde, 1, m_prime * N);
        fft->fft(h_tilde_slopex, h_tilde_slopex, 1, m_prime * N);
        fft->fft(h_tilde_slopez, h_tilde_slopez, 1, m_prime * N);
        fft->fft(h_tilde_dx, h_tilde_dx, 1, m_prime * N);
        fft->fft(h_tilde_dz, h_tilde_dz, 1, m_prime * N);
    }

so i tried to use c++11 threads to make this faster, and i ended up with worse performances (went from 16fps to 5fps in Debug)

I don't have my code in front of me, but it basically looked like this :


Ocean::Update(float tick)
{
    ...

    std::vector<std::thread> threads;
    for (int m_prime = 0; m_prime < N; m_prime++) 
    {
        threads.push_back(std::thread(&Ocean::DoFFT, this, h_tilde, h_tilde, 1, m_prime * N));
        threads.push_back(std::thread(&Ocean::DoFFT, this, h_tilde_slopex, h_tilde_slopex, 1, m_prime * N));
        threads.push_back(std::thread(&Ocean::DoFFT, this, h_tilde_slopez, h_tilde_slopez, 1, m_prime * N));
        threads.push_back(std::thread(&Ocean::DoFFT, this, h_tilde_dx, h_tilde_dx, 1, m_prime * N));
        threads.push_back(std::thread(&Ocean::DoFFT, this, h_tilde_dz, h_tilde_dz, 1, m_prime * N));

        for(int i = 0 ; i < threads.size();i++)
        {
            threads[i].join();
        }

        threads.clear();
    }   

    ...
}

Ocean::DoFFT(Complex* in, Complex* out, int stride, int offset)
{
    fft->fft(in, out, stride, offset);
}

With N = 64, so 64 threads. There are probably a couple of syntax errors in there as i am not fluent in C++, but you get the idea.

I also tried to create a maximum of 4 threads at the time, but it didn't help much (barelly reached 12fps)

Any idea what could be wrong here ?

Ultimately i'd like to move this code to a compute shader (OpenCL) but i first wanted to test this thing on CPU first

Thanks,

Yann

Hodgman

52,717

September 19, 2014 01:25 PM

Creating and destroying threads are costly operations. You ideally want to have a small number of permanent threads (about the same number has you have hardware-threads) and keep them busy over the life of your app.

. 22 Racing Series .

BitMaster

8,652

September 19, 2014 01:30 PM

(went from 16fps to 5fps in Debug)

I don't think a statement about speed is at all viable when talking about MSVC's Debug builds. Debug builds are for finding bugs, not testing the speed of anything.

Yann_A

258

Author

September 19, 2014 01:36 PM

Creating and destroying threads are costly operations. You ideally want to have a small number of permanent threads (about the same number has you have hardware-threads) and keep them busy over the life of your app.

Hummm interesting, i'll try this when i get home then, thanks !

Yann_A

258

Author

September 19, 2014 01:37 PM

(went from 16fps to 5fps in Debug)

I don't think a statement about speed is at all viable when talking about MSVC's Debug builds. Debug builds are for finding bugs, not testing the speed of anything.

Sure, but as my reference fps was also in debug, i figured i could compare apples to apples.

I'll keep that in mind though :-)

NDIR

189

September 19, 2014 02:39 PM

Do you actually split the work between the threads or do you simply multiply the work with every thread?

Ohforf sake

2,052

September 19, 2014 02:44 PM

The problem is that debug builds don't slow down everything. When you profile debug builds you might find hotspots in areas that are perfectly fine in release builds.

Also keep in mind that with Visual Studio, you must launch the program without the debugger attached. Otherwise you get the extremely slow debug heap and every new/delete is a hotspot. This also holds for release builds.

Yann_A

258

Author

September 19, 2014 04:12 PM

Do you actually split the work between the threads or do you simply multiply the work with every thread?

Not sure what you mean, but i tried multiple approaches.

The one i wrote above, and this one too :

for(int m_prime = 0; i < N; i++)

{

threads.push_back(std::thread(&Ocean::DoFFT, this, 1, m_prime * N));

if(threads.size() == 4)

{

//.. join all 4 threads before moving to the next batch.

}

But like Hogman mentioned, i am spawning 64 threads even with this approach, i'll just create some kind of threadpool of 4 permanent threads instead.

Side question though, assuming i make this thing work properly with expected fps boost, would i get better performance with OpenMP (so code still executed on CPU), or should i jump directly to a OpenCL implementation ?

Thanks for your help !

SeanMiddleditch

17,596

September 19, 2014 04:27 PM

for(int m_prime = 0; i < N; i++)
{
threads.push_back(std::thread(&Ocean::DoFFT, this, 1, m_prime * N));
if(threads.size() == 4)
{
//.. join all 4 threads before moving to the next batch.
}
}

No, still wrong. _do not spawn threads_ for each piece of work. That's costly. Spawn a bunch of threads (one per hardware thread minus one is a good first-cut number) early in your app's life and then hand over tasks to those pre-existing threads to execute. Doing this efficiently requires some kind of task queue system, which C++ does not provide.

Use an existing library like Microsoft's PPL (Parallel Programming Library) or Intel's TBB (Threaded Building Blocks) or OpenMP (Open Multi Processing; VisualC++ support OMP 2.0) which handles all of this far more efficiently than anything you're likely to build on your own. C++ still only provides the low-level primitives (thread, atomics) used to build a high-performance multithreaded task system but none of the extremely-complex high-level modules (might change by C++17, hopefully). PPL/TBB/OMP handle all that for you.

Sean Middleditch – Game Systems Engineer – Join my team!

NDIR

189

September 19, 2014 07:59 PM

Do you actually split the work between the threads or do you simply multiply the work with every thread?

Not sure what you mean, but i tried multiple approaches.

The one i wrote above, and this one too :

for(int m_prime = 0; i < N; i++)

{

threads.push_back(std::thread(&Ocean::DoFFT, this, 1, m_prime * N));

if(threads.size() == 4)

{

//.. join all 4 threads before moving to the next batch.

}

}

But like Hogman mentioned, i am spawning 64 threads even with this approach, i'll just create some kind of threadpool of 4 permanent threads instead.

Side question though, assuming i make this thing work properly with expected fps boost, would i get better performance with OpenMP (so code still executed on CPU), or should i jump directly to a OpenCL implementation ?

Thanks for your help !

Take 4 threads (from a pool, don't create new ones every frame / update) and split the work between those 4 threads.

So thread 1 executes (in the Ocean::DoFFT):

for (int m_prime = 0; m_prime < N/4; m_prime++) {
fft->fft(h_tilde, h_tilde, 1, m_prime * N);
fft->fft(h_tilde_slopex, h_tilde_slopex, 1, m_prime * N);
fft->fft(h_tilde_slopez, h_tilde_slopez, 1, m_prime * N);
fft->fft(h_tilde_dx, h_tilde_dx, 1, m_prime * N);
fft->fft(h_tilde_dz, h_tilde_dz, 1, m_prime * N);
}

Thread 2 executes:

for (int m_prime = N/4; m_prime < N/4+N/4; m_prime++) {
fft->fft(h_tilde, h_tilde, 1, m_prime * N);
fft->fft(h_tilde_slopex, h_tilde_slopex, 1, m_prime * N);
fft->fft(h_tilde_slopez, h_tilde_slopez, 1, m_prime * N);
fft->fft(h_tilde_dx, h_tilde_dx, 1, m_prime * N);
fft->fft(h_tilde_dz, h_tilde_dz, 1, m_prime * N);
}

and so on.

Poor STL threads performance

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Poor STL threads performance

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines