# Weird Direct Compute performance issue

This topic is 1356 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hello,

I implemented a radix sort algorithm in Direct Compute (Dx 11) based off this paper: www.sci.utah.edu/~csilva/papers/cgf.pdf

I created a simple application that uses the algorithm and benchmarks its efficiency. I am, however, seeing very weird results.

My GPU is a GTX 770.

Sorting 100,000 values:

2-bits integers:   ~0.38ms

4-bits integers:   ~0.75ms

6-bits integers:   ~1.12ms

8-bits integers:   ~1.48ms

10-bits integers: ~1.84ms

12-bits integers: ~2.21ms

14-bits integers: ~10.46ms

16-bits integers: ~11.12ms

32-bits integers: ~12.74ms

I'm having a hard time understanding the drastic increase when using more than 12-bits keys. The algorithm processes 2-bits per pass... so 12-bits requires 6-passes, 14 requires 7.  Can any-one point me in the right direction in figuring out why this would happen?

Thank you.

##### Share on other sites

Cache issues maybe? What happens if you sort 50,000 values of each size?

You should also take a look at the shader assembly for the 12 and 14 bit cases, in case the compiler has done something odd like say not unrolling the bigger loop.

##### Share on other sites

Each pass sorts 2-bits at a time. Each pass calls 3 dispatches: 1st creates the block sums buffer mentioned in the paper, 2nd does a prefix sum on it, 3rd scatters and sorts. So 12-bits keys need 6*3 = 18 dispatches, 14-bits keys need 21 dispatches.  Regardless the number of bits the keys have, they'll always use the same kernels for those 3 dispatches per pass.

Here are some more benchmarks.

Sorting 50,000 values:

2-bits integers:   ~0.18ms

4-bits integers:   ~0.37ms

6-bits integers:   ~0.55ms

8-bits integers:   ~0.74ms

10-bits integers: ~0.92ms

12-bits integers: ~1.09ms

14-bits integers: ~8.92ms

16-bits integers: ~10.00ms

32-bits integers: ~11.45ms

Sorting 10,000 values:

2-bits integers:   ~0.10ms

4-bits integers:   ~0.19ms

6-bits integers:   ~0.27ms

8-bits integers:   ~0.36ms

10-bits integers: ~0.45ms

12-bits integers: ~0.54ms

14-bits integers: ~8.08ms

16-bits integers: ~9.47ms

32-bits integers: ~11.46ms

If interested, I could provide source code (which is already on bitbucket), or upload the executable so that you can benchmark on your own computer.

Edited by french_hustler

1. 1
Rutin
35
2. 2
3. 3
4. 4
5. 5

• 12
• 14
• 9
• 9
• 14
• ### Forum Statistics

• Total Topics
633343
• Total Posts
3011431
• ### Who's Online (See full list)

There are no registered users currently online

×