Jump to content
  • Advertisement
Sign in to follow this  
Yann ALET

Poor STL threads performance

This topic is 1419 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

UPDATE : I've edited the source code since the original post as it made no sense at all.

 

Hi there,

 

So, i ported this ocean rendering algorithm to DirectX : http://www.keithlantz.net/2011/11/ocean-simulation-part-two-using-the-fast-fourier-transform/

It works great, but slow as the FFT is computed on the CPU.

 

I've noticed that this part of the code is very costly :

for (int m_prime = 0; m_prime < N; m_prime++) {
        fft->fft(h_tilde, h_tilde, 1, m_prime * N);
        fft->fft(h_tilde_slopex, h_tilde_slopex, 1, m_prime * N);
        fft->fft(h_tilde_slopez, h_tilde_slopez, 1, m_prime * N);
        fft->fft(h_tilde_dx, h_tilde_dx, 1, m_prime * N);
        fft->fft(h_tilde_dz, h_tilde_dz, 1, m_prime * N);
    }

so i tried to use c++11 threads to make this faster, and i ended up with worse performances (went from 16fps to 5fps in Debug)

 

I don't have my code in front of me, but it basically looked like this :

Ocean::Update(float tick)
{
    ...

    std::vector<std::thread> threads;
    for (int m_prime = 0; m_prime < N; m_prime++) 
    {
        threads.push_back(std::thread(&Ocean::DoFFT, this, h_tilde, h_tilde, 1, m_prime * N));
        threads.push_back(std::thread(&Ocean::DoFFT, this, h_tilde_slopex, h_tilde_slopex, 1, m_prime * N));
        threads.push_back(std::thread(&Ocean::DoFFT, this, h_tilde_slopez, h_tilde_slopez, 1, m_prime * N));
        threads.push_back(std::thread(&Ocean::DoFFT, this, h_tilde_dx, h_tilde_dx, 1, m_prime * N));
        threads.push_back(std::thread(&Ocean::DoFFT, this, h_tilde_dz, h_tilde_dz, 1, m_prime * N));

        for(int i = 0 ; i < threads.size();i++)
        {
            threads[i].join();
        }

        threads.clear();
    }   

    ...
}

Ocean::DoFFT(Complex* in, Complex* out, int stride, int offset)
{
    fft->fft(in, out, stride, offset);
}

With N = 64, so 64 threads. There are probably a couple of syntax errors in there as i am not fluent in C++, but you get the idea.

I also tried to create a maximum of 4 threads at the time, but it didn't help much (barelly reached 12fps)

 

Any idea what could be wrong here ?

Ultimately i'd like to move this code to a compute shader (OpenCL) but i first wanted to test this thing on CPU first

 

Thanks,

Yann

Edited by Yann ALET

Share this post


Link to post
Share on other sites
Advertisement

Creating and destroying threads are costly operations. You ideally want to have a small number of permanent threads (about the same number has you have hardware-threads) and keep them busy over the life of your app.

 

Hummm interesting, i'll try this when i get home then, thanks !

Share this post


Link to post
Share on other sites

 

(went from 16fps to 5fps in Debug)


I don't think a statement about speed is at all viable when talking about MSVC's Debug builds. Debug builds are for finding bugs, not testing the speed of anything.

 

Sure, but as my reference fps was also in debug, i figured i could compare apples to apples.

I'll keep that in mind though :-)

Share this post


Link to post
Share on other sites

Do you actually split the work between the threads or do you simply multiply the work with every thread?

Share this post


Link to post
Share on other sites
The problem is that debug builds don't slow down everything. When you profile debug builds you might find hotspots in areas that are perfectly fine in release builds.

Also keep in mind that with Visual Studio, you must launch the program without the debugger attached. Otherwise you get the extremely slow debug heap and every new/delete is a hotspot. This also holds for release builds.

Share this post


Link to post
Share on other sites

Do you actually split the work between the threads or do you simply multiply the work with every thread?

 

Not sure what you mean, but i tried multiple approaches.

The one i wrote above, and this one too :

 

for(int m_prime = 0; i < N; i++)

{

     threads.push_back(std::thread(&Ocean::DoFFT, this, 1, m_prime * N));

     if(threads.size() == 4)

     {

           //.. join all 4 threads before moving to the next batch.

     }

}

 

But like Hogman mentioned, i am spawning 64 threads even with this approach, i'll just create some kind of threadpool of 4 permanent threads instead.

 

Side question though, assuming i make this thing work properly with expected fps boost, would i get better performance with OpenMP (so code still executed on CPU), or should i jump directly to a OpenCL implementation ?

 

Thanks for your help !

Share this post


Link to post
Share on other sites

 

Do you actually split the work between the threads or do you simply multiply the work with every thread?

 

Not sure what you mean, but i tried multiple approaches.

The one i wrote above, and this one too :

 

for(int m_prime = 0; i < N; i++)

{

     threads.push_back(std::thread(&Ocean::DoFFT, this, 1, m_prime * N));

     if(threads.size() == 4)

     {

           //.. join all 4 threads before moving to the next batch.

     }

}

 

But like Hogman mentioned, i am spawning 64 threads even with this approach, i'll just create some kind of threadpool of 4 permanent threads instead.

 

Side question though, assuming i make this thing work properly with expected fps boost, would i get better performance with OpenMP (so code still executed on CPU), or should i jump directly to a OpenCL implementation ?

 

Thanks for your help !

 

 

Take 4 threads (from a pool, don't create new ones every frame / update) and split the work between those 4 threads.

 

So thread 1 executes (in the Ocean::DoFFT):

 

for (int m_prime = 0; m_prime < N/4; m_prime++) {
fft->fft(h_tilde, h_tilde, 1, m_prime * N);
fft->fft(h_tilde_slopex, h_tilde_slopex, 1, m_prime * N);
fft->fft(h_tilde_slopez, h_tilde_slopez, 1, m_prime * N);
fft->fft(h_tilde_dx, h_tilde_dx, 1, m_prime * N);
fft->fft(h_tilde_dz, h_tilde_dz, 1, m_prime * N);
}

 

Thread 2 executes:

 

for (int m_prime = N/4; m_prime < N/4+N/4; m_prime++) {
fft->fft(h_tilde, h_tilde, 1, m_prime * N);
fft->fft(h_tilde_slopex, h_tilde_slopex, 1, m_prime * N);
fft->fft(h_tilde_slopez, h_tilde_slopez, 1, m_prime * N);
fft->fft(h_tilde_dx, h_tilde_dx, 1, m_prime * N);
fft->fft(h_tilde_dz, h_tilde_dz, 1, m_prime * N);
}

 

and so on.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!