Jump to content
  • Advertisement
Sign in to follow this  
Martin Perry

DX11 [DX11] Compute Shader - global memory

This topic is 2480 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi,

my compute shader program is slow because of write data to RWStructuredBuffer... so far, i am running only 1 thread group with 1 thread (pretty useless, but i decompress some data.. it is still faster than decompress on CPU and transfer data to GPU).
If I break program to multiple thread groups (each with 1 thread only) and run it, will it be faster ? Is GPU capable to write data to global memory from more threads at once, or there is memory lock and only 1 thread actually writes and other must wait.

Writing is done to diferent locations, no data are overwritten. Each thread works within unique interval in buffer.

Thanks

Share this post


Link to post
Share on other sites
Advertisement
Hi Martin,

GPU threads can write concurrently to global memory, so it won’t get slower.

How often do you have to read from global memory? Perhaps you can optimize this by caching read data in shared memory. (This can help a lot.)

You might get more performance out of it, if you put a few more threads into each group, so that the warps are actually busy. There is somewhere a sweet spot for the optimal number of threads to use in a group (depends on the GPU and should be a multiple of the number of threads per warp).

Since you most definitely have to read from global memory you’ll need enough threads to compensate the latency caused by waiting for the memory accesses to finish. The gpu will switch the execution to other warps, so make sure that you have enough warps (threads) ready. smile.png

It might help you to look at the slides from Kayvon Fatahalian or Justin Luitjens to find out more.

Cheers!

Share this post


Link to post
Share on other sites
Hi... thanks for answer

Reading from memory is not problem (well.. it is.. but i read compressed data, so it not as slow as writing).. if i comment out writing to memory and let only reading, speed is sufficient even in 1 thread. I read 4 values at once (uint) and parse them to "char", so reading is only every 4th value.
Actually.. writing is done same way, but write must be much more values (decompressed). Shared memory is used for temorary buffers during decompression.
Used more thread within group is quite not possible, algorithm is highly serial. So i "emulate" serial within group with 1 thread.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!