State vs Stateless Designing a modern GPU Interface

Started by
14 comments, last by Jman2 5 years, 10 months ago

Hello

State based render architecture has many problems such as leakage of states and naive setting of states on each draw call, a lot of different sources recommend stateless rendering architecture which makes sense for DX12  as it uses a single object bind the PSO.

Take a look at the following:

Designing a Modern GPU Interface

stateless-layered-multi-threaded-rendering-part-2-stateless-api-design

Firaxis Lore System: CIV V

Is this not causing the same problem though? as you are passing all the state commands within a DrawCommand object for them to be set during the draw call? yes you are hiding the state machine by not exposing the functions directly but you are just deferring the state changes to the command queue.

You can sort by index using  this method:

Real Time Collision: Draw Call Key

But that means each DrawCommand is passing in the entire PSO structure (as in the state's you want) with each command and storing it, just for you to sort by the key and elect the first object to bind  its PSO for the rest within the group to use. It seems like a lot of wasted memory to pass all the PSO in to use just one, although it does prevent any slow down from swapping PSO for every single object.

How are you handling state changes? am i missing some critical piece of information about stateless (note i am aiming towards the stateless method for DX12 just want some opinions on it :))

Thanks.

Advertisement
11 hours ago, Jemme said:

But that means each DrawCommand is passing in the entire PSO structure (as in the state's you want) with each command and storing it, just for you to sort by the key and elect the first object to bind  its PSO for the rest within the group to use. It seems like a lot of wasted memory to pass all the PSO in to use just one, although it does prevent any slow down from swapping PSO for every single object.

I wrote the system in your "Designing a Modern GPU Interface" link :) My draw items have an 8 byte header (containing PSO data) followed by a resource binding table -- 8 bytes for the input assembler bindings (though this should really be smaller), 2 bytes per cbuffer, 1 byte per dynamic sampler state, 2 bytes per group of textures. Most draw items end up under 32 bytes. If you put the items themselves into a queue, then the queue's size is ~8 to 80 bytes * num items. If you put pointers to the items into the queue, then there's 8 bytes per item (size of a pointer).

If you have a stateful rendering API, but still use a sorting queue, you probably still put pointers to some kind of "drawable" into that queue, so again, 8 bytes per item of queue storage space is required ;) 

The other comparison to keep in mind, is that objects in a stateful graphics engine still need to store their PSO/texture/buffer pointers somewhere. When a tree draws itself, it needs to bind the "bark" and "leaves" textures to the GPU, regardless of whether you're using a stateful or stateless abstraction. The most straightforward way is for each tree model instance to contain a pointer to a material, which itself contains two texture pointers ("bark" and "leaves"). That's 8 bytes (1 pointer) per model instance for the material pointer, and then 16 bytes (2 pointers) within the shared material.
With my draw items, each tree draw-item contains a 2-byte resource-list ID, referencing a shared resource list that contains two 2-byte texture ID's ("bark" and "leaves").
In that made-up comparison, my stateless API actually uses less memory than the stateful system based on pointers :)

12 hours ago, Jemme said:

Is this not causing the same problem though? as you are passing all the state commands within a DrawCommand object for them to be set during the draw call? yes you are hiding the state machine by not exposing the functions directly but you are just deferring the state changes to the command queue.

It solves the problem of state leakage, because every draw-item has a full description of all pipeline-state/resources. There is no way for states/resources from an earlier draw to accidentally be applied to a later draw.

To avoid re-setting states on each draw, I rely heavily on XOR operations. E.g. in D3D11 you have blend, depth-stencil, and raster states. I represent these with small ID's, which are all packed into the 8-byte draw header. If I XOR the current draw item with the previous one and then check if the bits that represent these ID's are zero, then I know whether those pipeline states have changed or not. There's some similar tricks used for resources -- e.g. CBuffer resources are represented as 2-byte IDs, and using SSE intrinsics, you can XOR an array of these ID's in one CPU instruction to very quickly identify if the cbuffer bindings need to be updated or can be skipped.

In a stateful API:
* the best case is that the rendering logic is very carefully structured by a human, in a way where the minimal amount of pipeline/resource binding updates occur thanks to careful reasoning. However, this is hard to maintain over time when changing the rendering code...
* the worst case is that, as strange bugs start occurring, the rendering programmers start getting paranoid and redundantly setting every single state per draw anyway... and then adding a redundant call filter (e.g. if new blend state == previous, do nothing) however, these will typically be slower than the centralized/optimized/small redundant state filtering code that's at the heart of a stateless API :)

Ah okay that makes more sense, I use handles so I guess the PSO can be stored and referenced that way , for the xor does that mean your making common states like:

Microsoft Common States

So you can reference them in the drawitem like:

drawItem.blend = BLEND_ALPHA;

That would work and keep the size down , stop leakage and save time on state swapping.

Thanks for the clarification so far :)

19 hours ago, Jemme said:

does that mean your making common states ... so you can reference them in the drawitem like: drawItem.blend = BLEND_ALPHA;
That would work and keep the size down , stop leakage and save time on state swapping.

If you go this way, then yeah you can use really small state identifiers. Many games probably only need 4 or less blend states, which is just 2 bits in your state header! I have seen this technique used to great success in proprietary console game engines before.

I do something similar but a bit more generic / general purpose ( / bloated). Unique states are cached in a hash-map, and given ID's at runtime, instead of compile-time ID's like in your example. IIRC I'm currently using 7 bits for a blend ID, allowing up to 128 unique blend modes, which, frankly is overkill :) 

Quick question Hodgman, in order to support the simple <8 bit handles, are you using some kind of aggregating resource database structure that sits alongside a set of draw items? Is that just a set of arrays (like, are the handles just indices into that local array?).

I seem to remember looking at some code you'd published a while back and it seemed like you were aggregating all the draw items / resource aggregation into a single data stream.

8 hours ago, ZachBethel said:

Quick question Hodgman, in order to support the simple <8 bit handles, are you using some kind of aggregating resource database structure that sits alongside a set of draw items? Is that just a set of arrays (like, are the handles just indices into that local array?).

Pretty much, yeah. The device itself contains an array/pool of blend/depth-stencil/raster states and an array of shader programs (or, alternatively an array of PSOs), and the draw headers index into those arrays. The device also has pools of SRVs/UAVs/CBVs, and resource ID's are just indices into those pools (kind of like ECS where the device is the "texture system", a TextureID is an entity, and an SRV/UAV are components). There's also a pool of resource-lists, which are themselves just arrays of resource-ID's.

8 hours ago, ZachBethel said:

I seem to remember looking at some code you'd published a while back and it seemed like you were aggregating all the draw items / resource aggregation into a single data stream.

One nice thing I've found with stateless is you can change a lot of how it works behind the scenes without changing the API :)

In our first implementation which shipped on one XbOne/PS4 game, the back-end was actually a stateful VM that consumed a stream of command packets. Each command had an 8 bit header identifying the type of command (set blend state, set shader, draw, etc) followed by a variable amount of data depending on the command type.
When submitting a list of draw-items, there was a layer that converted them into a stream of commands (by finding the full set of commands required for each draw-item, and then efficiently filtering out redundant commands). The nice thing about this was that it could be largely multi-core, even on old APIs like D3D9 -- just the final VM loop had to be on the actual D3D thread... After the draw-item -> command conversion stage, this did produce a single, linear, condensed stream of memory for the render thread to consume, which is nice.

However, on the next game we had to support Xb360/PS3, and also wanted much better performance, so I put a lot of work into optimization... I found that by converting commands into small ID's, I could actually get the draw-items to be small enough to get rid of the entire "command stream"/VM concept altogether -- a complete re-architecture of the back-end with very minimal changes to the API :D
 Now to submit a collection of draw-items (which themselves are variable size) to the back-end, you can either pass it a compacted stream of draw-items (each immediately after the previous one in memory), or, you can just send it an array of pointers to draw-items (which has much worse memory accesses for the back-end, but makes life a lot easier for the layers that produce lists of draw-items, as they only need to deal with lists of pointers).

10 hours ago, Hodgman said:

The device itself contains an array/pool of blend/depth-stencil/raster states and an array of shader programs (or, alternatively an array of PSOs), and the draw headers index into those arrays. The device also has pools of SRVs/UAVs/CBVs, and resource ID's are just indices into those pools

Are you storing all your data on the RenderDevice? for example lets say i have a Mesh which needs a Vertex buffer and index buffer. are you just storing them as handles like VertexBufferHandle inside the mesh but creating the actual buffers on the device such that:
 


void Init(RenderDevice* device, char* data) //function in mesh?
{
    //Load data into some internal representation like MeshData
  
    VertexBufferDesc desc; //agnostic desc NOT GL or DX
    //fill in descusing MeshData
    device.CreateBuffer(desc, &vertexBufferHandle)
}

Then when you submit your DrawItem your just passing in all the handles for the vert, index and constant buffers for the RenderDevice to just fetch and set from its pool's? you would think the fetch via a handle would be slower then just a pointer chase. But cache could be better? 

Do you have any suggestions on how to handle texture bindings? Should I even bother to minimize the number of swaps? I.e. in deferred rendering, I could keep the gbuffer textures bound to targets 0-3 throughout the rest of the rendering phase, but that requires making sure all shaders are bound to them correctly, and shaders that don't need them use the remaining targets.

Or I could assume that every time I switch shaders all textures have to be re-bound, which simplifies things greatly.

I'm targeting WebGL by the way, so no resource lists, and many of these bitwise optimizations are hard to apply there.

On 6/13/2018 at 8:14 PM, Jemme said:

Are you storing all your data on the RenderDevice? for example lets say i have a Mesh which needs a Vertex buffer and index buffer. are you just storing them as handles like VertexBufferHandle inside the mesh but creating the actual buffers on the device such that:

Then when you submit your DrawItem your just passing in all the handles for the vert, index and constant buffers for the RenderDevice to just fetch and set from its pool's? you would think the fetch via a handle would be slower then just a pointer chase. But cache could be better? 

Yes, exactly.

Yes, the handle lookups involve pool[handle].pointer->, instead of just pointer-> which is an extra layer of indirection. This adds an extra cache-miss penalty if the pool isn't present in the cache. If the pool is present in the cache, then the cost of this extra indirection is negligible. My main priority is to keep the draw-items themselves as small as possible, which lets more of them fit into the cache. It's adding a performance problem in one area to reduce a problem elsewhere :| 

42 minutes ago, d07RiV said:

Do you have any suggestions on how to handle texture bindings? Should I even bother to minimize the number of swaps?

Reducing state changes helps a lot on the CPU-side, as GL/D3D calls can be relatively expensive, especially ones that interact with resource management. On the GPU side, changing states/resource-bindings constantly can also be a performance issue if your draws aren't big enough.

42 minutes ago, d07RiV said:

Or I could assume that every time I switch shaders all textures have to be re-bound, which simplifies things greatly.

Assuming you don't use too many different shaders, this can be a decent sacrifice. Slightly less accurate redundancy filtering, but much simpler/faster code :) 

42 minutes ago, d07RiV said:

I'm targeting WebGL by the way, so no resource lists, and many of these bitwise optimizations are hard to apply there.

Yeah in C++ you can do 128 bit logical operations, but in Javascript I guess you're limited to 32bit logical operations? You should still be able to do bitwise stuff, just not on as many bits at once...
My resource-list stuff was inspired by Mantle/D3D12/Vulkan, but it's still very useful as far back as D3D9/GL too :) I allow shaders to define 8 resource lists, which is 8 x 16bit IDs, or a single 128bit SSE register (or 4 javascript 32bit integers ?). I XOR these (with a single SSE XOR) to quickly tell if any resources need to be rebound. Once a dirty/changed resource-list binding is detected, I check the actual texture bindings within that list for changes.

Thanks, I'm still not sure how much abstraction I need since API is always going to be the same.

Another thing - when you put all passes in the same shader file, do you run a lexer on them, or do you just feed everything to shader compiler and let it figure out what to optimize away? The former option would us to know which options affect which passes, so we don't have to make redundant copies (instead of having to manually specify them for every pass).

edit: I guess this is partially answered by bonus slides.

This topic is closed to new replies.

Advertisement