This topic is 1329 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

UPDATE : I've edited the source code since the original post as it made no sense at all.

Hi there,

So, i ported this ocean rendering algorithm to DirectX : http://www.keithlantz.net/2011/11/ocean-simulation-part-two-using-the-fast-fourier-transform/

It works great, but slow as the FFT is computed on the CPU.

I've noticed that this part of the code is very costly :

for (int m_prime = 0; m_prime < N; m_prime++) {
fft->fft(h_tilde, h_tilde, 1, m_prime * N);
fft->fft(h_tilde_slopex, h_tilde_slopex, 1, m_prime * N);
fft->fft(h_tilde_slopez, h_tilde_slopez, 1, m_prime * N);
fft->fft(h_tilde_dx, h_tilde_dx, 1, m_prime * N);
fft->fft(h_tilde_dz, h_tilde_dz, 1, m_prime * N);
}


so i tried to use c++11 threads to make this faster, and i ended up with worse performances (went from 16fps to 5fps in Debug)

I don't have my code in front of me, but it basically looked like this :

Ocean::Update(float tick)
{
...

for (int m_prime = 0; m_prime < N; m_prime++)
{

for(int i = 0 ; i < threads.size();i++)
{
}

}

...
}

Ocean::DoFFT(Complex* in, Complex* out, int stride, int offset)
{
fft->fft(in, out, stride, offset);
}


With N = 64, so 64 threads. There are probably a couple of syntax errors in there as i am not fluent in C++, but you get the idea.

I also tried to create a maximum of 4 threads at the time, but it didn't help much (barelly reached 12fps)

Any idea what could be wrong here ?

Ultimately i'd like to move this code to a compute shader (OpenCL) but i first wanted to test this thing on CPU first

Thanks,

Yann

Edited by Yann ALET

##### Share on other sites

Creating and destroying threads are costly operations. You ideally want to have a small number of permanent threads (about the same number has you have hardware-threads) and keep them busy over the life of your app.

Hummm interesting, i'll try this when i get home then, thanks !

##### Share on other sites

(went from 16fps to 5fps in Debug)

I don't think a statement about speed is at all viable when talking about MSVC's Debug builds. Debug builds are for finding bugs, not testing the speed of anything.

Sure, but as my reference fps was also in debug, i figured i could compare apples to apples.

I'll keep that in mind though :-)

##### Share on other sites

Do you actually split the work between the threads or do you simply multiply the work with every thread?

##### Share on other sites
The problem is that debug builds don't slow down everything. When you profile debug builds you might find hotspots in areas that are perfectly fine in release builds.

Also keep in mind that with Visual Studio, you must launch the program without the debugger attached. Otherwise you get the extremely slow debug heap and every new/delete is a hotspot. This also holds for release builds.

##### Share on other sites

Do you actually split the work between the threads or do you simply multiply the work with every thread?

Not sure what you mean, but i tried multiple approaches.

The one i wrote above, and this one too :

for(int m_prime = 0; i < N; i++)

{

{

//.. join all 4 threads before moving to the next batch.

}

}

But like Hogman mentioned, i am spawning 64 threads even with this approach, i'll just create some kind of threadpool of 4 permanent threads instead.

Side question though, assuming i make this thing work properly with expected fps boost, would i get better performance with OpenMP (so code still executed on CPU), or should i jump directly to a OpenCL implementation ?

##### Share on other sites

Do you actually split the work between the threads or do you simply multiply the work with every thread?

Not sure what you mean, but i tried multiple approaches.

The one i wrote above, and this one too :

for(int m_prime = 0; i < N; i++)

{

{

//.. join all 4 threads before moving to the next batch.

}

}

But like Hogman mentioned, i am spawning 64 threads even with this approach, i'll just create some kind of threadpool of 4 permanent threads instead.

Side question though, assuming i make this thing work properly with expected fps boost, would i get better performance with OpenMP (so code still executed on CPU), or should i jump directly to a OpenCL implementation ?

Take 4 threads (from a pool, don't create new ones every frame / update) and split the work between those 4 threads.

So thread 1 executes (in the Ocean::DoFFT):

for (int m_prime = 0; m_prime < N/4; m_prime++) {
fft->fft(h_tilde, h_tilde, 1, m_prime * N);
fft->fft(h_tilde_slopex, h_tilde_slopex, 1, m_prime * N);
fft->fft(h_tilde_slopez, h_tilde_slopez, 1, m_prime * N);
fft->fft(h_tilde_dx, h_tilde_dx, 1, m_prime * N);
fft->fft(h_tilde_dz, h_tilde_dz, 1, m_prime * N);
}

for (int m_prime = N/4; m_prime < N/4+N/4; m_prime++) {
fft->fft(h_tilde, h_tilde, 1, m_prime * N);
fft->fft(h_tilde_slopex, h_tilde_slopex, 1, m_prime * N);
fft->fft(h_tilde_slopez, h_tilde_slopez, 1, m_prime * N);
fft->fft(h_tilde_dx, h_tilde_dx, 1, m_prime * N);
fft->fft(h_tilde_dz, h_tilde_dz, 1, m_prime * N);
}

and so on.

• 18
• 11
• 16
• 9
• 51
• ### Forum Statistics

• Total Topics
631396
• Total Posts
2999784
×