Jump to content
  • Advertisement
Sign in to follow this  
Martin Perry

How to improve this code

This topic is 2336 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi... I have written LZSS decompression in HLSL Compute shader.

CODE

Does anyone have any ideas, how to improve this code ?

Current times:

Decompressed size: 16 777 216
Compressed size: 4 987 146
Decompression time (Compute Shader): 86.68 ms
Thread groups count: 512 (threads per group 1)

Thanks :)

Share this post


Link to post
Share on other sites
Advertisement

Thread groups count: 512 (threads per group 1)


That's problematic right there. If you can increase the number of threads per group to like 16-32 then you can expect to see (at least) some slight speedup.

Share this post


Link to post
Share on other sites
Number of thread per groups must be 1 because of serial nature of algorithm... If i paralelize this part, ist slower because of lots of "if" branches

Share this post


Link to post
Share on other sites
Why don't you just use a single thread group and do all the work you're doing in different thread groups in threads that run concurrently instead of thread groups?

Share this post


Link to post
Share on other sites
Thread groups dont run concurently ? I have the feeling, that thread groups act like threads... GPU differ only number of running threads, but they can be from different groups. Only Sum is important,.... or not ?

Share this post


Link to post
Share on other sites

Number of thread per groups must be 1 because of serial nature of algorithm... If i paralelize this part, ist slower because of lots of "if" branches


Then, respectfully, why are you trying to do this on the GPU? They're as ludicrously fast as they are because you have literally hundreds of individual cores working in parallel. If you can't really feed them, then that power is wasted.

EDIT: I think that came out a little more harsh than intended. Put another way: I think this might be best implemented CPU-side so you can let the GPU work on problems it's more suited to solving.

Share this post


Link to post
Share on other sites
Well.. mor threads != faster execution.. code is bound with global memory reads/writes... And again 1 thread per group should be OK, if I have more groups. Number of running threads = groups count * thread per group and it can not be bigger than max. alloved threads fro gpu (or at least it was written in doc)

Share this post


Link to post
Share on other sites
You're never going to get good performance with 1 thread per thread group. GPU's can be really fast for 2 primary reasons:

1. You can run lots of threads in parallel
2. The latency of memory accesses can be hidden by switching threads

If you don't have enough threads in a thread group to fill a warp/wavefront, then you'll waste hardware. If you don't have multiple warps/wavefronts in flight, then the hardware won't be able to hide latency. Without those things, you're not going to get very far.

Share this post


Link to post
Share on other sites

Number of thread per groups must be 1 because of serial nature of algorithm... If i paralelize this part, ist slower because of lots of "if" branches

Actually, according to this paper and what others have already said you're not going to get very far with the serial version!

CULZSS: LZSS Lossless Data Compression on CUDA

You need to break up the work to take advantage of the task or data parallelism of the gpu ph34r.png

Share this post


Link to post
Share on other sites
Well... LZ algorithms are serial... nothing can be done about it. What can be done (and its done for example in LZMA 7-zip paralel version or in CULZSS or in my code) is to break data to several blocks and compress/decompress each block separatly. Thats what i did... but each block is serail code.

I tried to make thread groups and threads within groups...

Dispatch(64,1,1) - numthreads(8,1,1)

has same (well... even slower) performance as

Dispatch(512, 1, 1) - numthreads (1,1,1)

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!