Sign in to follow this  

Poor STL threads performance

This topic is 1144 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

UPDATE : I've edited the source code since the original post as it made no sense at all.

 

Hi there,

 

So, i ported this ocean rendering algorithm to DirectX : http://www.keithlantz.net/2011/11/ocean-simulation-part-two-using-the-fast-fourier-transform/

It works great, but slow as the FFT is computed on the CPU.

 

I've noticed that this part of the code is very costly :

for (int m_prime = 0; m_prime < N; m_prime++) {
        fft->fft(h_tilde, h_tilde, 1, m_prime * N);
        fft->fft(h_tilde_slopex, h_tilde_slopex, 1, m_prime * N);
        fft->fft(h_tilde_slopez, h_tilde_slopez, 1, m_prime * N);
        fft->fft(h_tilde_dx, h_tilde_dx, 1, m_prime * N);
        fft->fft(h_tilde_dz, h_tilde_dz, 1, m_prime * N);
    }

so i tried to use c++11 threads to make this faster, and i ended up with worse performances (went from 16fps to 5fps in Debug)

 

I don't have my code in front of me, but it basically looked like this :

Ocean::Update(float tick)
{
    ...

    std::vector<std::thread> threads;
    for (int m_prime = 0; m_prime < N; m_prime++) 
    {
        threads.push_back(std::thread(&Ocean::DoFFT, this, h_tilde, h_tilde, 1, m_prime * N));
        threads.push_back(std::thread(&Ocean::DoFFT, this, h_tilde_slopex, h_tilde_slopex, 1, m_prime * N));
        threads.push_back(std::thread(&Ocean::DoFFT, this, h_tilde_slopez, h_tilde_slopez, 1, m_prime * N));
        threads.push_back(std::thread(&Ocean::DoFFT, this, h_tilde_dx, h_tilde_dx, 1, m_prime * N));
        threads.push_back(std::thread(&Ocean::DoFFT, this, h_tilde_dz, h_tilde_dz, 1, m_prime * N));

        for(int i = 0 ; i < threads.size();i++)
        {
            threads[i].join();
        }

        threads.clear();
    }   

    ...
}

Ocean::DoFFT(Complex* in, Complex* out, int stride, int offset)
{
    fft->fft(in, out, stride, offset);
}

With N = 64, so 64 threads. There are probably a couple of syntax errors in there as i am not fluent in C++, but you get the idea.

I also tried to create a maximum of 4 threads at the time, but it didn't help much (barelly reached 12fps)

 

Any idea what could be wrong here ?

Ultimately i'd like to move this code to a compute shader (OpenCL) but i first wanted to test this thing on CPU first

 

Thanks,

Yann

Edited by Yann ALET

Share this post


Link to post
Share on other sites

Creating and destroying threads are costly operations. You ideally want to have a small number of permanent threads (about the same number has you have hardware-threads) and keep them busy over the life of your app.

 

Hummm interesting, i'll try this when i get home then, thanks !

Share this post


Link to post
Share on other sites

 

(went from 16fps to 5fps in Debug)


I don't think a statement about speed is at all viable when talking about MSVC's Debug builds. Debug builds are for finding bugs, not testing the speed of anything.

 

Sure, but as my reference fps was also in debug, i figured i could compare apples to apples.

I'll keep that in mind though :-)

Share this post


Link to post
Share on other sites
The problem is that debug builds don't slow down everything. When you profile debug builds you might find hotspots in areas that are perfectly fine in release builds.

Also keep in mind that with Visual Studio, you must launch the program without the debugger attached. Otherwise you get the extremely slow debug heap and every new/delete is a hotspot. This also holds for release builds.

Share this post


Link to post
Share on other sites

Do you actually split the work between the threads or do you simply multiply the work with every thread?

 

Not sure what you mean, but i tried multiple approaches.

The one i wrote above, and this one too :

 

for(int m_prime = 0; i < N; i++)

{

     threads.push_back(std::thread(&Ocean::DoFFT, this, 1, m_prime * N));

     if(threads.size() == 4)

     {

           //.. join all 4 threads before moving to the next batch.

     }

}

 

But like Hogman mentioned, i am spawning 64 threads even with this approach, i'll just create some kind of threadpool of 4 permanent threads instead.

 

Side question though, assuming i make this thing work properly with expected fps boost, would i get better performance with OpenMP (so code still executed on CPU), or should i jump directly to a OpenCL implementation ?

 

Thanks for your help !

Share this post


Link to post
Share on other sites

 

Do you actually split the work between the threads or do you simply multiply the work with every thread?

 

Not sure what you mean, but i tried multiple approaches.

The one i wrote above, and this one too :

 

for(int m_prime = 0; i < N; i++)

{

     threads.push_back(std::thread(&Ocean::DoFFT, this, 1, m_prime * N));

     if(threads.size() == 4)

     {

           //.. join all 4 threads before moving to the next batch.

     }

}

 

But like Hogman mentioned, i am spawning 64 threads even with this approach, i'll just create some kind of threadpool of 4 permanent threads instead.

 

Side question though, assuming i make this thing work properly with expected fps boost, would i get better performance with OpenMP (so code still executed on CPU), or should i jump directly to a OpenCL implementation ?

 

Thanks for your help !

 

 

Take 4 threads (from a pool, don't create new ones every frame / update) and split the work between those 4 threads.

 

So thread 1 executes (in the Ocean::DoFFT):

 

for (int m_prime = 0; m_prime < N/4; m_prime++) {
fft->fft(h_tilde, h_tilde, 1, m_prime * N);
fft->fft(h_tilde_slopex, h_tilde_slopex, 1, m_prime * N);
fft->fft(h_tilde_slopez, h_tilde_slopez, 1, m_prime * N);
fft->fft(h_tilde_dx, h_tilde_dx, 1, m_prime * N);
fft->fft(h_tilde_dz, h_tilde_dz, 1, m_prime * N);
}

 

Thread 2 executes:

 

for (int m_prime = N/4; m_prime < N/4+N/4; m_prime++) {
fft->fft(h_tilde, h_tilde, 1, m_prime * N);
fft->fft(h_tilde_slopex, h_tilde_slopex, 1, m_prime * N);
fft->fft(h_tilde_slopez, h_tilde_slopez, 1, m_prime * N);
fft->fft(h_tilde_dx, h_tilde_dx, 1, m_prime * N);
fft->fft(h_tilde_dz, h_tilde_dz, 1, m_prime * N);
}

 

and so on.

Share this post


Link to post
Share on other sites

In case you are still interested, why not use std::async for this?

std::vector<std::future<void>> futures;
for (int m_prime = 0; m_prime < N; ++m_prime)
  {
  futures.push_back(std::async([... capture what you require in the lambda ...]
    {
    fft->fft(h_tilde, h_tilde, 1, m_prime * N);
    fft->fft(h_tilde_slopex, h_tilde_slopex, 1, m_prime * N);
    fft->fft(h_tilde_slopez, h_tilde_slopez, 1, m_prime * N);
    fft->fft(h_tilde_dx, h_tilde_dx, 1, m_prime * N);
    fft->fft(h_tilde_dz, h_tilde_dz, 1, m_prime * N);
    //do work (don't forget to capture what you need)
    }));
  }
for (auto& fut : futures)
  fut.wait(); //wait for all the work to be finished

This will leave thread creation, allocation and termination up to the runtime. In MSVC this the concurrency runtime, which dynamically allocates threads based on available resources in the system. Not sure how it works on other platforms though.

Depending what N is expected to be, you might also want to call std::async for each function call to fft, but if N is mostly > number of hardware threads, you can just as well use a block of fft calls per std::async call.

Share this post


Link to post
Share on other sites

Timing debug code is a waste of effort. Always time release builds.

You don't have 64 cores so making 64 threads will not help.

 

You have to create the threads in advance and they need to be waiting there ready to go. As mentioned above create and destroying threads is, relatively, expensive.

You should only make 2 to 8 threads depending on the capabilities of the hardware it's on.

 

 

FFT's are one of the most computationally intensive things you can do. They produce mathematically "perfect" frequency results.

This is probably not needed for a game. Look for simpler approximations. I would start with the Hartley transform.

Also, somewhat ironically, I have an old FFT article here on gamedev with an optimized routine for it.

Share this post


Link to post
Share on other sites

This topic is 1144 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this