Thank you for your reply.

Each pass sorts 2-bits at a time. Each pass calls 3 dispatches: 1st creates the block sums buffer mentioned in the paper, 2nd does a prefix sum on it, 3rd scatters and sorts. So 12-bits keys need 6*3 = 18 dispatches, 14-bits keys need 21 dispatches. Regardless the number of bits the keys have, they'll always use the same kernels for those 3 dispatches per pass.

Here are some more benchmarks.

Sorting 50,000 values:

2-bits integers: ~0.18ms

4-bits integers: ~0.37ms

6-bits integers: ~0.55ms

8-bits integers: ~0.74ms

10-bits integers: ~0.92ms

12-bits integers: ~1.09ms

14-bits integers: ~8.92ms

16-bits integers: ~10.00ms

32-bits integers: ~11.45ms

Sorting 10,000 values:

2-bits integers: ~0.10ms

4-bits integers: ~0.19ms

6-bits integers: ~0.27ms

8-bits integers: ~0.36ms

10-bits integers: ~0.45ms

12-bits integers: ~0.54ms

14-bits integers: ~8.08ms

16-bits integers: ~9.47ms

32-bits integers: ~11.46ms

If interested, I could provide source code (which is already on bitbucket), or upload the executable so that you can benchmark on your own computer.