Jump to content
  • Advertisement
Sign in to follow this  
Dingleberry

Does anyone use counter/Append/Consume buffers very much?

This topic is 1072 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I've been using append buffers for a while out of convenience, with the assumption that I'll perhaps switch to other algorithms after I get other things working. Kind of like a temporary thing that works and isn't prohibitively slow for what I want. I find them kind of nice for what they are but I haven't gotten around to comparing their performance to other algorithms. I know I could do this myself but I was hoping someone else might have experience with this.

 

One thing I use them for is to generate arbitrary amounts of data from a compute thread. A thread might output 0-N entries of data and you just call buffer.Append however many times you want and it works. There's a lot of other ways you could do this though. For example, is there an advantage to using append buffers over InterlockedIncrement'ing an index value? This is probably the simplest test showing off how lazy I am for not trying it out, but I always wondered if append buffers did something more complex than this. The actual counter variable needs to be 4096 byte aligned for some reason (is this a bug?) unlike a regular integer, so maybe this strict alignment rule offers some sort of better perf somehow? 

 

There's also algorithms like scan or histopyramid that could be used to do a data compaction operation, is it assumed that these will generally be faster than using an append buffer? I understand they're rather different and performance varies based on the data in the latter case, but it would be nice to know if append buffers are just always the slowest option for things like stream compaction or filter operations.

 

E: Also, is there an advantage to using consume buffers instead of addressing the buffer directly? Like if I create a buffer with N items, I can just reference the items like buffer[index], why bother consuming? I guess it resets the counter but that's kind of a trivial thing to do.

Edited by Dingleberry

Share this post


Link to post
Share on other sites
Advertisement


For example, is there an advantage to using append buffers over InterlockedIncrement'ing an index value?

I don't know how append buffers work but IIRC an InterlockedIncrement needs to access the L2 which might make your shader slower if append buffers have something special about there implementation.

Share this post


Link to post
Share on other sites

Yeah I'd really hope they're faster than InterlockedIncrement since that's about as brute force as you can get. Maybe the whole thing is just like MSFT saying "dear vendors, please make this fast somehow, thank you"? In that case I imagine they could use special instructions unavailable to HLSL like CUDA's vote functions. I mean, that's what I would want to do if I were making the drivers.

Share this post


Link to post
Share on other sites


There's a lot of other ways you could do this though. For example, is there an advantage to using append buffers over InterlockedIncrement'ing an index value?
A lot of GPU's don't actually implement Append/Consumer buffers as a hardware feature. On these GPU's, the driver will be doing exactly this - pairing a buffer with an atomic (interlocked) counter. If you tested on one of those GPU's, you'd probably find no difference in performance at all.

However, some other GPU's might have special hardware that allows optimized implementation of append/consume buffers, so the  D3D abstraction allows these GPUs to do their thing.

Share this post


Link to post
Share on other sites
The only catch with an append buffer is that you can only append one element at a time. This can be wasteful if a single thread decides to append multiple elements, since a lot of hardware implements an append buffer by performing atomic increments on a "hidden" counter variable. For such cases, you can potentially get better performance (as well as better data coherency) by performing a single atomic add in order to "reserve" multiple elements in the output buffer.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!