How to improve this code

Graphics and GPU Programming Programming

Started by Martin Perry March 22, 2012 01:14 PM

8 comments, last by Martin Perry 12 years ago

1,557

Author

March 22, 2012 01:14 PM

Hi... I have written LZSS decompression in HLSL Compute shader.

CODE

Does anyone have any ideas, how to improve this code ?

Current times:

Decompressed size: 16 777 216
Compressed size: 4 987 146
Decompression time (Compute Shader): 86.68 ms
Thread groups count: 512 (threads per group 1)

Thanks

InvalidPointer

1,842

March 22, 2012 03:07 PM

Thread groups count: 512 (threads per group 1)

That's problematic right there. If you can increase the number of threads per group to like 16-32 then you can expect to see (at least) some slight speedup.

clb: At the end of 2012, the positions of jupiter, saturn, mercury, and deimos are aligned so as to cause a denormalized flush-to-zero bug when computing earth's gravitational force, slinging it to the sun.

Martin Perry

1,557

Author

March 22, 2012 03:38 PM

Number of thread per groups must be 1 because of serial nature of algorithm... If i paralelize this part, ist slower because of lots of "if" branches

CryZe

773

March 22, 2012 04:33 PM

Why don't you just use a single thread group and do all the work you're doing in different thread groups in threads that run concurrently instead of thread groups?

Martin Perry

1,557

Author

March 22, 2012 04:48 PM

Thread groups dont run concurently ? I have the feeling, that thread groups act like threads... GPU differ only number of running threads, but they can be from different groups. Only Sum is important,.... or not ?

InvalidPointer

1,842

March 23, 2012 12:59 AM

Number of thread per groups must be 1 because of serial nature of algorithm... If i paralelize this part, ist slower because of lots of "if" branches

Then, respectfully, why are you trying to do this on the GPU? They're as ludicrously fast as they are because you have literally hundreds of individual cores working in parallel. If you can't really feed them, then that power is wasted.

EDIT: I think that came out a little more harsh than intended. Put another way: I think this might be best implemented CPU-side so you can let the GPU work on problems it's more suited to solving.

Martin Perry

1,557

Author

March 23, 2012 11:36 AM

Well.. mor threads != faster execution.. code is bound with global memory reads/writes... And again 1 thread per group should be OK, if I have more groups. Number of running threads = groups count * thread per group and it can not be bigger than max. alloved threads fro gpu (or at least it was written in doc)

MJP

20,295

March 23, 2012 08:56 PM

You're never going to get good performance with 1 thread per thread group. GPU's can be really fast for 2 primary reasons:

1. You can run lots of threads in parallel
2. The latency of memory accesses can be hidden by switching threads

If you don't have enough threads in a thread group to fill a warp/wavefront, then you'll waste hardware. If you don't have multiple warps/wavefronts in flight, then the hardware won't be able to hide latency. Without those things, you're not going to get very far.

The Blog | The Book

daviangel

604

March 23, 2012 11:56 PM

Number of thread per groups must be 1 because of serial nature of algorithm... If i paralelize this part, ist slower because of lots of "if" branches

Actually, according to this paper and what others have already said you're not going to get very far with the serial version!

CULZSS: LZSS Lossless Data Compression on CUDA

You need to break up the work to take advantage of the task or data parallelism of the gpu

[size="2"]Don't talk about writing games, don't write design docs, don't spend your time on web boards. Sit in your house write 20 games when you complete them you will either want to do it the rest of your life or not * Andre Lamothe

Martin Perry

1,557

Author

March 24, 2012 10:47 AM

Well... LZ algorithms are serial... nothing can be done about it. What can be done (and its done for example in LZMA 7-zip paralel version or in CULZSS or in my code) is to break data to several blocks and compress/decompress each block separatly. Thats what i did... but each block is serail code.

I tried to make thread groups and threads within groups...

Dispatch(64,1,1) - numthreads(8,1,1)

has same (well... even slower) performance as

Dispatch(512, 1, 1) - numthreads (1,1,1)

How to improve this code

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

How to improve this code

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines