Weird Direct Compute performance issue

Started by
1 comment, last by french_hustler 9 years, 2 months ago

Hello,

I implemented a radix sort algorithm in Direct Compute (Dx 11) based off this paper: www.sci.utah.edu/~csilva/papers/cgf.pdf

I created a simple application that uses the algorithm and benchmarks its efficiency. I am, however, seeing very weird results.

My GPU is a GTX 770.

Sorting 100,000 values:

2-bits integers: ~0.38ms

4-bits integers: ~0.75ms

6-bits integers: ~1.12ms

8-bits integers: ~1.48ms

10-bits integers: ~1.84ms

12-bits integers: ~2.21ms

14-bits integers: ~10.46ms

16-bits integers: ~11.12ms

32-bits integers: ~12.74ms

I'm having a hard time understanding the drastic increase when using more than 12-bits keys. The algorithm processes 2-bits per pass... so 12-bits requires 6-passes, 14 requires 7. Can any-one point me in the right direction in figuring out why this would happen?

Thank you.

Advertisement

Cache issues maybe? What happens if you sort 50,000 values of each size?

You should also take a look at the shader assembly for the 12 and 14 bit cases, in case the compiler has done something odd like say not unrolling the bigger loop.

Thank you for your reply.

Each pass sorts 2-bits at a time. Each pass calls 3 dispatches: 1st creates the block sums buffer mentioned in the paper, 2nd does a prefix sum on it, 3rd scatters and sorts. So 12-bits keys need 6*3 = 18 dispatches, 14-bits keys need 21 dispatches. Regardless the number of bits the keys have, they'll always use the same kernels for those 3 dispatches per pass.

Here are some more benchmarks.

Sorting 50,000 values:

2-bits integers: ~0.18ms

4-bits integers: ~0.37ms

6-bits integers: ~0.55ms

8-bits integers: ~0.74ms

10-bits integers: ~0.92ms

12-bits integers: ~1.09ms

14-bits integers: ~8.92ms

16-bits integers: ~10.00ms

32-bits integers: ~11.45ms

Sorting 10,000 values:

2-bits integers: ~0.10ms

4-bits integers: ~0.19ms

6-bits integers: ~0.27ms

8-bits integers: ~0.36ms

10-bits integers: ~0.45ms

12-bits integers: ~0.54ms

14-bits integers: ~8.08ms

16-bits integers: ~9.47ms

32-bits integers: ~11.46ms

If interested, I could provide source code (which is already on bitbucket), or upload the executable so that you can benchmark on your own computer.

This topic is closed to new replies.

Advertisement