Does anyone use counter/Append/Consume buffers very much?

Started by
3 comments, last by MJP 8 years, 3 months ago

I've been using append buffers for a while out of convenience, with the assumption that I'll perhaps switch to other algorithms after I get other things working. Kind of like a temporary thing that works and isn't prohibitively slow for what I want. I find them kind of nice for what they are but I haven't gotten around to comparing their performance to other algorithms. I know I could do this myself but I was hoping someone else might have experience with this.

One thing I use them for is to generate arbitrary amounts of data from a compute thread. A thread might output 0-N entries of data and you just call buffer.Append however many times you want and it works. There's a lot of other ways you could do this though. For example, is there an advantage to using append buffers over InterlockedIncrement'ing an index value? This is probably the simplest test showing off how lazy I am for not trying it out, but I always wondered if append buffers did something more complex than this. The actual counter variable needs to be 4096 byte aligned for some reason (is this a bug?) unlike a regular integer, so maybe this strict alignment rule offers some sort of better perf somehow?

There's also algorithms like scan or histopyramid that could be used to do a data compaction operation, is it assumed that these will generally be faster than using an append buffer? I understand they're rather different and performance varies based on the data in the latter case, but it would be nice to know if append buffers are just always the slowest option for things like stream compaction or filter operations.

E: Also, is there an advantage to using consume buffers instead of addressing the buffer directly? Like if I create a buffer with N items, I can just reference the items like buffer[index], why bother consuming? I guess it resets the counter but that's kind of a trivial thing to do.

Advertisement


For example, is there an advantage to using append buffers over InterlockedIncrement'ing an index value?

I don't know how append buffers work but IIRC an InterlockedIncrement needs to access the L2 which might make your shader slower if append buffers have something special about there implementation.

-potential energy is easily made kinetic-

Yeah I'd really hope they're faster than InterlockedIncrement since that's about as brute force as you can get. Maybe the whole thing is just like MSFT saying "dear vendors, please make this fast somehow, thank you"? In that case I imagine they could use special instructions unavailable to HLSL like CUDA's vote functions. I mean, that's what I would want to do if I were making the drivers.


There's a lot of other ways you could do this though. For example, is there an advantage to using append buffers over InterlockedIncrement'ing an index value?
A lot of GPU's don't actually implement Append/Consumer buffers as a hardware feature. On these GPU's, the driver will be doing exactly this - pairing a buffer with an atomic (interlocked) counter. If you tested on one of those GPU's, you'd probably find no difference in performance at all.

However, some other GPU's might have special hardware that allows optimized implementation of append/consume buffers, so the D3D abstraction allows these GPUs to do their thing.

The only catch with an append buffer is that you can only append one element at a time. This can be wasteful if a single thread decides to append multiple elements, since a lot of hardware implements an append buffer by performing atomic increments on a "hidden" counter variable. For such cases, you can potentially get better performance (as well as better data coherency) by performing a single atomic add in order to "reserve" multiple elements in the output buffer.

This topic is closed to new replies.

Advertisement