Hello,
I implemented a radix sort algorithm in Direct Compute (Dx 11) based off this paper: www.sci.utah.edu/~csilva/papers/cgf.pdf
I created a simple application that uses the algorithm and benchmarks its efficiency. I am, however, seeing very weird results.
My GPU is a GTX 770.
Sorting 100,000 values:
2-bits integers: ~0.38ms
4-bits integers: ~0.75ms
6-bits integers: ~1.12ms
8-bits integers: ~1.48ms
10-bits integers: ~1.84ms
12-bits integers: ~2.21ms
14-bits integers: ~10.46ms
16-bits integers: ~11.12ms
32-bits integers: ~12.74ms
I'm having a hard time understanding the drastic increase when using more than 12-bits keys. The algorithm processes 2-bits per pass... so 12-bits requires 6-passes, 14 requires 7. Can any-one point me in the right direction in figuring out why this would happen?
Thank you.