How to improve this code

Started by
8 comments, last by Martin Perry 12 years ago
Hi... I have written LZSS decompression in HLSL Compute shader.

CODE

Does anyone have any ideas, how to improve this code ?

Current times:

Decompressed size: 16 777 216
Compressed size: 4 987 146
Decompression time (Compute Shader): 86.68 ms
Thread groups count: 512 (threads per group 1)

Thanks :)
Advertisement

Thread groups count: 512 (threads per group 1)


That's problematic right there. If you can increase the number of threads per group to like 16-32 then you can expect to see (at least) some slight speedup.
clb: At the end of 2012, the positions of jupiter, saturn, mercury, and deimos are aligned so as to cause a denormalized flush-to-zero bug when computing earth's gravitational force, slinging it to the sun.
Number of thread per groups must be 1 because of serial nature of algorithm... If i paralelize this part, ist slower because of lots of "if" branches
Why don't you just use a single thread group and do all the work you're doing in different thread groups in threads that run concurrently instead of thread groups?
Thread groups dont run concurently ? I have the feeling, that thread groups act like threads... GPU differ only number of running threads, but they can be from different groups. Only Sum is important,.... or not ?

Number of thread per groups must be 1 because of serial nature of algorithm... If i paralelize this part, ist slower because of lots of "if" branches


Then, respectfully, why are you trying to do this on the GPU? They're as ludicrously fast as they are because you have literally hundreds of individual cores working in parallel. If you can't really feed them, then that power is wasted.

EDIT: I think that came out a little more harsh than intended. Put another way: I think this might be best implemented CPU-side so you can let the GPU work on problems it's more suited to solving.
clb: At the end of 2012, the positions of jupiter, saturn, mercury, and deimos are aligned so as to cause a denormalized flush-to-zero bug when computing earth's gravitational force, slinging it to the sun.
Well.. mor threads != faster execution.. code is bound with global memory reads/writes... And again 1 thread per group should be OK, if I have more groups. Number of running threads = groups count * thread per group and it can not be bigger than max. alloved threads fro gpu (or at least it was written in doc)
You're never going to get good performance with 1 thread per thread group. GPU's can be really fast for 2 primary reasons:

1. You can run lots of threads in parallel
2. The latency of memory accesses can be hidden by switching threads

If you don't have enough threads in a thread group to fill a warp/wavefront, then you'll waste hardware. If you don't have multiple warps/wavefronts in flight, then the hardware won't be able to hide latency. Without those things, you're not going to get very far.

Number of thread per groups must be 1 because of serial nature of algorithm... If i paralelize this part, ist slower because of lots of "if" branches

Actually, according to this paper and what others have already said you're not going to get very far with the serial version!

CULZSS: LZSS Lossless Data Compression on CUDA

You need to break up the work to take advantage of the task or data parallelism of the gpu ph34r.png
[size="2"]Don't talk about writing games, don't write design docs, don't spend your time on web boards. Sit in your house write 20 games when you complete them you will either want to do it the rest of your life or not * Andre Lamothe
Well... LZ algorithms are serial... nothing can be done about it. What can be done (and its done for example in LZMA 7-zip paralel version or in CULZSS or in my code) is to break data to several blocks and compress/decompress each block separatly. Thats what i did... but each block is serail code.

I tried to make thread groups and threads within groups...

Dispatch(64,1,1) - numthreads(8,1,1)

has same (well... even slower) performance as

Dispatch(512, 1, 1) - numthreads (1,1,1)

This topic is closed to new replies.

Advertisement