Jump to content
  • Advertisement
Sign in to follow this  
french_hustler

Weird Direct Compute performance issue

This topic is 1356 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hello,

 

I implemented a radix sort algorithm in Direct Compute (Dx 11) based off this paper: www.sci.utah.edu/~csilva/papers/cgf.pdf

 

I created a simple application that uses the algorithm and benchmarks its efficiency. I am, however, seeing very weird results.

My GPU is a GTX 770.

 

Sorting 100,000 values:

2-bits integers:   ~0.38ms

4-bits integers:   ~0.75ms

6-bits integers:   ~1.12ms

8-bits integers:   ~1.48ms

10-bits integers: ~1.84ms

12-bits integers: ~2.21ms

14-bits integers: ~10.46ms

16-bits integers: ~11.12ms

32-bits integers: ~12.74ms

 

I'm having a hard time understanding the drastic increase when using more than 12-bits keys. The algorithm processes 2-bits per pass... so 12-bits requires 6-passes, 14 requires 7.  Can any-one point me in the right direction in figuring out why this would happen?

 

Thank you.

Share this post


Link to post
Share on other sites
Advertisement

Cache issues maybe? What happens if you sort 50,000 values of each size?

 

You should also take a look at the shader assembly for the 12 and 14 bit cases, in case the compiler has done something odd like say not unrolling the bigger loop.

Share this post


Link to post
Share on other sites

Thank you for your reply.

Each pass sorts 2-bits at a time. Each pass calls 3 dispatches: 1st creates the block sums buffer mentioned in the paper, 2nd does a prefix sum on it, 3rd scatters and sorts. So 12-bits keys need 6*3 = 18 dispatches, 14-bits keys need 21 dispatches.  Regardless the number of bits the keys have, they'll always use the same kernels for those 3 dispatches per pass.

 

Here are some more benchmarks.

 

Sorting 50,000 values:

2-bits integers:   ~0.18ms

4-bits integers:   ~0.37ms

6-bits integers:   ~0.55ms

8-bits integers:   ~0.74ms

10-bits integers: ~0.92ms

12-bits integers: ~1.09ms

14-bits integers: ~8.92ms

16-bits integers: ~10.00ms

32-bits integers: ~11.45ms

 

Sorting 10,000 values:

2-bits integers:   ~0.10ms

4-bits integers:   ~0.19ms

6-bits integers:   ~0.27ms

8-bits integers:   ~0.36ms

10-bits integers: ~0.45ms

12-bits integers: ~0.54ms

14-bits integers: ~8.08ms

16-bits integers: ~9.47ms

32-bits integers: ~11.46ms

 

 

If interested, I could provide source code (which is already on bitbucket), or upload the executable so that you can benchmark on your own computer.

Edited by french_hustler

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!