Direct3D 11 Programming Tip #9: Append and Consume Buffers
D3D11 D3D11 Append Buffers Consume Buffers
It has been quite some time since I wrote my last Direct3D 11 Programming Tip, and its about time to add another one...
Many interesting and useful features have been added to Direct3D 11. As far as the available resources go, one can argue that the Append/Consume structured buffer is one of the most interesting choices due to the fact that it works very nicely with parallel reading and writing of data - and hence is perfect for use in many compute shader programs. Even though they are so useful, these buffer types are somewhat under-documented in general. Several searches don't turn up very many tutorials or samples. After several long hours spent debugging an Append/Consume based algorithm, I thought I would share some of my own findings and experiences, and give some general tips for people to get started.
What Are Append/Consume Buffers?
Before jumping into the details, we should quickly cover the basics of what the Append and Consume buffers actually are. In D3D11, the only resources available to the developer are textures (1D, 2D, and 3D) and buffers. There is actually only a single buffer type, but that type has a fairly diverse set of options for configuring the buffer exactly how you want it. In the case of append and consume buffers, we are interested in created a structured buffer. A structured buffer is created with the Miscellaneous flag set accordingly with the structured buffer flag. The buffer description would be configured like the following:
[source lang="cpp"] D3D11_BUFFER_DESC desc; desc.ByteWidth = count * structsize; desc.BindFlags = D3D11_BIND_SHADER_RESOURCE | D3D11_BIND_UNORDERED_ACCESS; desc.MiscFlags = D3D11_RESOURCE_MISC_BUFFER_STRUCTURED; desc.StructureByteStride = structsize; desc.Usage = D3D11_USAGE_DEFAULT; desc.CPUAccessFlags = 0;[/source]It is fairly straightforward to create the buffer. The count variable is the number of structures we want to have in our resource, and the structsize provides the size in bytes for the structure. We add the shader resource and unordered access bind flags, since they will be required for using the append / consume functionality and subsequently reading data out of the buffer. In addition, the default usage is needed to give read/write access to the GPU.
After creating the resource, we need the appropriate resource views to connect it to the pipeline. Append and consume buffers are actually just structured buffers that utilize specially created unordered access views to manipulate its contents. These UAVs are also pretty easy to set up, and would utilize a description structure such as the following:
[source lang="cpp"] D3D11_BUFFER_UAV uav; uav.FirstElement = 0; uav.NumElements = count; uav.Flags = D3D11_BUFFER_UAV_FLAG_APPEND;[/source]That's all it takes - now when you bind our buffer resource to the pipeline with our UAV, it can be declared and used in HLSL as either an AppendStructuredBuffer<T> or a ConsumeStructuredBuffer<T>, where T is the structure type definition that will be used. The same UAV can be used for both HLSL resource object types, but of course not at the same time.
How Do They Work?
A shader program uses these two resource object types to provide special data access methods. The AppendStructuredBuffer<T> provides the .append(T) method, and the ConsumeStructuredBuffer<T> provides the T .consume() method. These methods allow functionality similar to a stack or a queue, although in this parallel context there is no preservation of ordering. This allows many threads to be reading or writing to the buffer simultaneously. As long as you design your algorithm appropriately, this provides a great mechanism for processing large amounts of data when their stored order doesn't matter. One thread can consume a structure from one buffer, then process it, then append it to an output buffer.
There is a special hidden structure count that determines how many structure elements have been appended to the buffer and how many have been consumed from it. This count is actually just a number to allow the application to figure out what your shaders have done after a particular pipeline execution, since the resource itself doesn't change the number of elements it can hold - it is just a way to keep track of what number of elements are currently in the buffer according to the append/consume paradigm. This hidden structure count can be initialized to a desired value when the UAV is bound to the pipeline by supplying an initial count value. However, if you want to just use the current value of the internal count you can simply pass a -1 as the initial count argument.
You may have noticed that I called it a hidden structure count - so what good is it if the count is hidden??? Well, it isn't really hidden completely - it is just hidden from HLSL code. An application gains access to the count by utilizing the device context's copy structure count method:
[source lang="cpp"][/size][/font][font="Arial"]void CopyStructureCount( ID3D11Buffer *pDstBuffer, UINT DstAlignedByteOffset, ID3D11UnorderedAccessView *pSrcView );[/font][font="Arial, sans-serif"][size="2"][/source]This method simply copies the structure count to the specified buffer resource. The target buffer can then later be used for a number of different actions, ranging from being read back to the CPU, being used as a constant buffer, or even being used to control pipeline execution with one of the indirect rendering methods. The only real requirement is that the destination buffer be properly configured for the desired use.
So Where is the Programming Tip?
All of this sounds really cool, right? So what do we need a programming tip for? The truth is, this mechanism allows moving more control logic from the CPU to the GPU. This shift of responsibility can produce some potential problems while using the append and consume buffers, since the control is performed on the GPU and isn't directly visible while debugging. For example, if used in a particle system, the application may not know how many particles are in the system at any given time. I have personally run into these types of problems, where the buffers honestly seem possessed and there are some seemingly illogical results. However, there are some simple steps to help minimize this type of pain. So here they are, in no particular order:
- Pay attention to the initialization of the hidden structure counts! Since the initialization counts are provided in the binding method call, it can be somewhat inconvenient to perform individual initialization steps. However, the count is critically important to have correct, so add the extra code step to properly initialize the buffers.
- Know that the counter can over- and under-flow! The counter is a dumb counter - it will increment when append is called and decrement when consume is called, regardless of if it fits into your algorithms It is essential to understand completely how many threads will be using your buffers, and how the count will be moving between each pipeline execution call.
- It is very helpful to get the structure count to the CPU for debugging! This can be accomplished by copying the structure count to a buffer that was created with the staging usage type, which can then be mapped to the CPU memory space. The value can then be logged or displayed on screen.
- Tip #3 has some caveats! The mapping of the count back to the CPU is slow since the value has to be read back from the GPU. During debug this works fine, but be sure to disable these copies when you are done debugging. This indicates that you have to validate your algorithm first, then remove the debugging calls. You can also use a larger buffer to store the results in, and sequentially increment the storage location for the call to copy structure count. This will create somewhat of a log that can be retrieved after the fact, which should have a much lower impact on performance.
- Tip #3 has lots of caveats! The mapping only works on the immediate context. This means you either have to develop your algorithm in single threaded mode (i.e. without deferred contexts) or you could copy the count to the buffer in a deferred context, and then map the buffer on the immediate context after the corresponding command list has been executed.
I have found each of these to be quite useful in the course of developing some of the sample applications for our upcoming book: Practical Rendering and Computation with Direct3D11. If you find them useful and/or would like to support this type of development work, please consider taking a look at the book when it comes out!