Poor STL threads performance

Started by
10 comments, last by Shannon Barber 9 years, 5 months ago

UPDATE : I've edited the source code since the original post as it made no sense at all.

Hi there,

So, i ported this ocean rendering algorithm to DirectX : http://www.keithlantz.net/2011/11/ocean-simulation-part-two-using-the-fast-fourier-transform/

It works great, but slow as the FFT is computed on the CPU.

I've noticed that this part of the code is very costly :


for (int m_prime = 0; m_prime < N; m_prime++) {
        fft->fft(h_tilde, h_tilde, 1, m_prime * N);
        fft->fft(h_tilde_slopex, h_tilde_slopex, 1, m_prime * N);
        fft->fft(h_tilde_slopez, h_tilde_slopez, 1, m_prime * N);
        fft->fft(h_tilde_dx, h_tilde_dx, 1, m_prime * N);
        fft->fft(h_tilde_dz, h_tilde_dz, 1, m_prime * N);
    }

so i tried to use c++11 threads to make this faster, and i ended up with worse performances (went from 16fps to 5fps in Debug)

I don't have my code in front of me, but it basically looked like this :


Ocean::Update(float tick)
{
    ...

    std::vector<std::thread> threads;
    for (int m_prime = 0; m_prime < N; m_prime++) 
    {
        threads.push_back(std::thread(&Ocean::DoFFT, this, h_tilde, h_tilde, 1, m_prime * N));
        threads.push_back(std::thread(&Ocean::DoFFT, this, h_tilde_slopex, h_tilde_slopex, 1, m_prime * N));
        threads.push_back(std::thread(&Ocean::DoFFT, this, h_tilde_slopez, h_tilde_slopez, 1, m_prime * N));
        threads.push_back(std::thread(&Ocean::DoFFT, this, h_tilde_dx, h_tilde_dx, 1, m_prime * N));
        threads.push_back(std::thread(&Ocean::DoFFT, this, h_tilde_dz, h_tilde_dz, 1, m_prime * N));

        for(int i = 0 ; i < threads.size();i++)
        {
            threads[i].join();
        }

        threads.clear();
    }   

    ...
}

Ocean::DoFFT(Complex* in, Complex* out, int stride, int offset)
{
    fft->fft(in, out, stride, offset);
}

With N = 64, so 64 threads. There are probably a couple of syntax errors in there as i am not fluent in C++, but you get the idea.

I also tried to create a maximum of 4 threads at the time, but it didn't help much (barelly reached 12fps)

Any idea what could be wrong here ?

Ultimately i'd like to move this code to a compute shader (OpenCL) but i first wanted to test this thing on CPU first

Thanks,

Yann

Advertisement

Creating and destroying threads are costly operations. You ideally want to have a small number of permanent threads (about the same number has you have hardware-threads) and keep them busy over the life of your app.

(went from 16fps to 5fps in Debug)


I don't think a statement about speed is at all viable when talking about MSVC's Debug builds. Debug builds are for finding bugs, not testing the speed of anything.

Creating and destroying threads are costly operations. You ideally want to have a small number of permanent threads (about the same number has you have hardware-threads) and keep them busy over the life of your app.

Hummm interesting, i'll try this when i get home then, thanks !

(went from 16fps to 5fps in Debug)


I don't think a statement about speed is at all viable when talking about MSVC's Debug builds. Debug builds are for finding bugs, not testing the speed of anything.

Sure, but as my reference fps was also in debug, i figured i could compare apples to apples.

I'll keep that in mind though :-)

Do you actually split the work between the threads or do you simply multiply the work with every thread?

The problem is that debug builds don't slow down everything. When you profile debug builds you might find hotspots in areas that are perfectly fine in release builds.

Also keep in mind that with Visual Studio, you must launch the program without the debugger attached. Otherwise you get the extremely slow debug heap and every new/delete is a hotspot. This also holds for release builds.

Do you actually split the work between the threads or do you simply multiply the work with every thread?

Not sure what you mean, but i tried multiple approaches.

The one i wrote above, and this one too :

for(int m_prime = 0; i < N; i++)

{

threads.push_back(std::thread(&Ocean::DoFFT, this, 1, m_prime * N));

if(threads.size() == 4)

{

//.. join all 4 threads before moving to the next batch.

}

}

But like Hogman mentioned, i am spawning 64 threads even with this approach, i'll just create some kind of threadpool of 4 permanent threads instead.

Side question though, assuming i make this thing work properly with expected fps boost, would i get better performance with OpenMP (so code still executed on CPU), or should i jump directly to a OpenCL implementation ?

Thanks for your help !

for(int m_prime = 0; i < N; i++)
{
threads.push_back(std::thread(&Ocean::DoFFT, this, 1, m_prime * N));
if(threads.size() == 4)
{
//.. join all 4 threads before moving to the next batch.
}
}


No, still wrong. _do not spawn threads_ for each piece of work. That's costly. Spawn a bunch of threads (one per hardware thread minus one is a good first-cut number) early in your app's life and then hand over tasks to those pre-existing threads to execute. Doing this efficiently requires some kind of task queue system, which C++ does not provide.

Use an existing library like Microsoft's PPL (Parallel Programming Library) or Intel's TBB (Threaded Building Blocks) or OpenMP (Open Multi Processing; VisualC++ support OMP 2.0) which handles all of this far more efficiently than anything you're likely to build on your own. C++ still only provides the low-level primitives (thread, atomics) used to build a high-performance multithreaded task system but none of the extremely-complex high-level modules (might change by C++17, hopefully). PPL/TBB/OMP handle all that for you.

Sean Middleditch – Game Systems Engineer – Join my team!

Do you actually split the work between the threads or do you simply multiply the work with every thread?

Not sure what you mean, but i tried multiple approaches.

The one i wrote above, and this one too :

for(int m_prime = 0; i < N; i++)

{

threads.push_back(std::thread(&Ocean::DoFFT, this, 1, m_prime * N));

if(threads.size() == 4)

{

//.. join all 4 threads before moving to the next batch.

}

}

But like Hogman mentioned, i am spawning 64 threads even with this approach, i'll just create some kind of threadpool of 4 permanent threads instead.

Side question though, assuming i make this thing work properly with expected fps boost, would i get better performance with OpenMP (so code still executed on CPU), or should i jump directly to a OpenCL implementation ?

Thanks for your help !

Take 4 threads (from a pool, don't create new ones every frame / update) and split the work between those 4 threads.

So thread 1 executes (in the Ocean::DoFFT):

for (int m_prime = 0; m_prime < N/4; m_prime++) {
fft->fft(h_tilde, h_tilde, 1, m_prime * N);
fft->fft(h_tilde_slopex, h_tilde_slopex, 1, m_prime * N);
fft->fft(h_tilde_slopez, h_tilde_slopez, 1, m_prime * N);
fft->fft(h_tilde_dx, h_tilde_dx, 1, m_prime * N);
fft->fft(h_tilde_dz, h_tilde_dz, 1, m_prime * N);
}

Thread 2 executes:

for (int m_prime = N/4; m_prime < N/4+N/4; m_prime++) {
fft->fft(h_tilde, h_tilde, 1, m_prime * N);
fft->fft(h_tilde_slopex, h_tilde_slopex, 1, m_prime * N);
fft->fft(h_tilde_slopez, h_tilde_slopez, 1, m_prime * N);
fft->fft(h_tilde_dx, h_tilde_dx, 1, m_prime * N);
fft->fft(h_tilde_dz, h_tilde_dz, 1, m_prime * N);
}

and so on.

This topic is closed to new replies.

Advertisement