Hello,

I implemented a radix sort algorithm in Direct Compute (Dx 11) based off this paper: *www.sci.utah.edu/~csilva/papers/cgf.pdf*

I created a simple application that uses the algorithm and benchmarks its efficiency. I am, however, seeing very weird results.

My GPU is a GTX 770.

Sorting 100,000 values:

2-bits integers: ~0.38ms

4-bits integers: ~0.75ms

6-bits integers: ~1.12ms

8-bits integers: ~1.48ms

10-bits integers: ~1.84ms

12-bits integers: ~2.21ms

14-bits integers: ~10.46ms

16-bits integers: ~11.12ms

32-bits integers: ~12.74ms

I'm having a hard time understanding the drastic increase when using more than 12-bits keys. The algorithm processes 2-bits per pass... so 12-bits requires 6-passes, 14 requires 7. Can any-one point me in the right direction in figuring out why this would happen?

Thank you.