Jump to content
  • Advertisement
Jemme

3D State vs Stateless Designing a modern GPU Interface

Recommended Posts

Posted (edited)

Hello

State based render architecture has many problems such as leakage of states and naive setting of states on each draw call, a lot of different sources recommend stateless rendering architecture which makes sense for DX12  as it uses a single object bind the PSO.

Take a look at the following:

Designing a Modern GPU Interface

stateless-layered-multi-threaded-rendering-part-2-stateless-api-design

Firaxis Lore System: CIV V

Is this not causing the same problem though? as you are passing all the state commands within a DrawCommand object for them to be set during the draw call? yes you are hiding the state machine by not exposing the functions directly but you are just deferring the state changes to the command queue.

You can sort by index using  this method:

Real Time Collision: Draw Call Key

But that means each DrawCommand is passing in the entire PSO structure (as in the state's you want) with each command and storing it, just for you to sort by the key and elect the first object to bind  its PSO for the rest within the group to use. It seems like a lot of wasted memory to pass all the PSO in to use just one, although it does prevent any slow down from swapping PSO for every single object.

How are you handling state changes? am i missing some critical piece of information about stateless (note i am aiming towards the stateless method for DX12 just want some opinions on it :))

Thanks.

Edited by Jemme
made question easier to understand + Cleaned up the questions grammar and spelling mistakes

Share this post


Link to post
Share on other sites
Advertisement
11 hours ago, Jemme said:

But that means each DrawCommand is passing in the entire PSO structure (as in the state's you want) with each command and storing it, just for you to sort by the key and elect the first object to bind  its PSO for the rest within the group to use. It seems like a lot of wasted memory to pass all the PSO in to use just one, although it does prevent any slow down from swapping PSO for every single object.

I wrote the system in your "Designing a Modern GPU Interface" link :) My draw items have an 8 byte header (containing PSO data) followed by a resource binding table -- 8 bytes for the input assembler bindings (though this should really be smaller), 2 bytes per cbuffer, 1 byte per dynamic sampler state, 2 bytes per group of textures. Most draw items end up under 32 bytes. If you put the items themselves into a queue, then the queue's size is ~8 to 80 bytes * num items. If you put pointers to the items into the queue, then there's 8 bytes per item (size of a pointer).

If you have a stateful rendering API, but still use a sorting queue, you probably still put pointers to some kind of "drawable" into that queue, so again, 8 bytes per item of queue storage space is required ;) 

The other comparison to keep in mind, is that objects in a stateful graphics engine still need to store their PSO/texture/buffer pointers somewhere. When a tree draws itself, it needs to bind the "bark" and "leaves" textures to the GPU, regardless of whether you're using a stateful or stateless abstraction. The most straightforward way is for each tree model instance to contain a pointer to a material, which itself contains two texture pointers ("bark" and "leaves"). That's 8 bytes (1 pointer) per model instance for the material pointer, and then 16 bytes (2 pointers) within the shared material.
With my draw items, each tree draw-item contains a 2-byte resource-list ID, referencing a shared resource list that contains two 2-byte texture ID's ("bark" and "leaves").
In that made-up comparison, my stateless API actually uses less memory than the stateful system based on pointers :)

12 hours ago, Jemme said:

Is this not causing the same problem though? as you are passing all the state commands within a DrawCommand object for them to be set during the draw call? yes you are hiding the state machine by not exposing the functions directly but you are just deferring the state changes to the command queue.

It solves the problem of state leakage, because every draw-item has a full description of all pipeline-state/resources. There is no way for states/resources from an earlier draw to accidentally be applied to a later draw.

To avoid re-setting states on each draw, I rely heavily on XOR operations. E.g. in D3D11 you have blend, depth-stencil, and raster states. I represent these with small ID's, which are all packed into the 8-byte draw header. If I XOR the current draw item with the previous one and then check if the bits that represent these ID's are zero, then I know whether those pipeline states have changed or not. There's some similar tricks used for resources -- e.g. CBuffer resources are represented as 2-byte IDs, and using SSE intrinsics, you can XOR an array of these ID's in one CPU instruction to very quickly identify if the cbuffer bindings need to be updated or can be skipped.

In a stateful API:
* the best case is that the rendering logic is very carefully structured by a human, in a way where the minimal amount of pipeline/resource binding updates occur thanks to careful reasoning. However, this is hard to maintain over time when changing the rendering code...
* the worst case is that, as strange bugs start occurring, the rendering programmers start getting paranoid and redundantly setting every single state per draw anyway... and then adding a redundant call filter (e.g. if new blend state == previous, do nothing) however, these will typically be slower than the centralized/optimized/small redundant state filtering code that's at the heart of a stateless API :)

Share this post


Link to post
Share on other sites

Ah okay that makes more sense, I use handles so I guess the PSO can be stored and referenced that way , for the xor does that mean your making common states like:

Microsoft Common States

So you can reference them in the drawitem like:

drawItem.blend = BLEND_ALPHA;

That would work and keep the size down , stop leakage and save time on state swapping.

Thanks for the clarification so far :)

Share this post


Link to post
Share on other sites
19 hours ago, Jemme said:

does that mean your making common states ... so you can reference them in the drawitem like: drawItem.blend = BLEND_ALPHA;
That would work and keep the size down , stop leakage and save time on state swapping.

If you go this way, then yeah you can use really small state identifiers. Many games probably only need 4 or less blend states, which is just 2 bits in your state header! I have seen this technique used to great success in proprietary console game engines before.

I do something similar but a bit more generic / general purpose ( / bloated). Unique states are cached in a hash-map, and given ID's at runtime, instead of compile-time ID's like in your example. IIRC I'm currently using 7 bits for a blend ID, allowing up to 128 unique blend modes, which, frankly is overkill :) 

Share this post


Link to post
Share on other sites

Quick question Hodgman, in order to support the simple <8 bit handles, are you using some kind of aggregating resource database structure that sits alongside a set of draw items? Is that just a set of arrays (like, are the handles just indices into that local array?).

I seem to remember looking at some code you'd published a while back and it seemed like you were aggregating all the draw items / resource aggregation into a single data stream.

Share this post


Link to post
Share on other sites
8 hours ago, ZachBethel said:

Quick question Hodgman, in order to support the simple <8 bit handles, are you using some kind of aggregating resource database structure that sits alongside a set of draw items? Is that just a set of arrays (like, are the handles just indices into that local array?).

Pretty much, yeah. The device itself contains an array/pool of blend/depth-stencil/raster states and an array of shader programs (or, alternatively an array of PSOs), and the draw headers index into those arrays. The device also has pools of SRVs/UAVs/CBVs, and resource ID's are just indices into those pools (kind of like ECS where the device is the "texture system", a TextureID is an entity, and an SRV/UAV are components). There's also a pool of resource-lists, which are themselves just arrays of resource-ID's.

8 hours ago, ZachBethel said:

I seem to remember looking at some code you'd published a while back and it seemed like you were aggregating all the draw items / resource aggregation into a single data stream.

One nice thing I've found with stateless is you can change a lot of how it works behind the scenes without changing the API :)

In our first implementation which shipped on one XbOne/PS4 game, the back-end was actually a stateful VM that consumed a stream of command packets. Each command had an 8 bit header identifying the type of command (set blend state, set shader, draw, etc) followed by a variable amount of data depending on the command type.
When submitting a list of draw-items, there was a layer that converted them into a stream of commands (by finding the full set of commands required for each draw-item, and then efficiently filtering out redundant commands). The nice thing about this was that it could be largely multi-core, even on old APIs like D3D9 -- just the final VM loop had to be on the actual D3D thread... After the draw-item -> command conversion stage, this did produce a single, linear, condensed stream of memory for the render thread to consume, which is nice.

However, on the next game we had to support Xb360/PS3, and also wanted much better performance, so I put a lot of work into optimization... I found that by converting commands into small ID's, I could actually get the draw-items to be small enough to get rid of the entire "command stream"/VM concept altogether -- a complete re-architecture of the back-end with very minimal changes to the API :D
 Now to submit a collection of draw-items (which themselves are variable size) to the back-end, you can either pass it a compacted stream of draw-items (each immediately after the previous one in memory), or, you can just send it an array of pointers to draw-items (which has much worse memory accesses for the back-end, but makes life a lot easier for the layers that produce lists of draw-items, as they only need to deal with lists of pointers).

Share this post


Link to post
Share on other sites
10 hours ago, Hodgman said:

The device itself contains an array/pool of blend/depth-stencil/raster states and an array of shader programs (or, alternatively an array of PSOs), and the draw headers index into those arrays. The device also has pools of SRVs/UAVs/CBVs, and resource ID's are just indices into those pools

Are you storing all your data on the RenderDevice? for example lets say i have a Mesh which needs a Vertex buffer and index buffer. are you just storing them as handles like VertexBufferHandle inside the mesh but creating the actual buffers on the device such that:
 

void Init(RenderDevice* device, char* data) //function in mesh?
{
    //Load data into some internal representation like MeshData
  
    VertexBufferDesc desc; //agnostic desc NOT GL or DX
    //fill in descusing MeshData
    device.CreateBuffer(desc, &vertexBufferHandle)
}

Then when you submit your DrawItem your just passing in all the handles for the vert, index and constant buffers for the RenderDevice to just fetch and set from its pool's? you would think the fetch via a handle would be slower then just a pointer chase. But cache could be better? 

Share this post


Link to post
Share on other sites

Do you have any suggestions on how to handle texture bindings? Should I even bother to minimize the number of swaps? I.e. in deferred rendering, I could keep the gbuffer textures bound to targets 0-3 throughout the rest of the rendering phase, but that requires making sure all shaders are bound to them correctly, and shaders that don't need them use the remaining targets.

Or I could assume that every time I switch shaders all textures have to be re-bound, which simplifies things greatly.

I'm targeting WebGL by the way, so no resource lists, and many of these bitwise optimizations are hard to apply there.

Share this post


Link to post
Share on other sites
On 6/13/2018 at 8:14 PM, Jemme said:

Are you storing all your data on the RenderDevice? for example lets say i have a Mesh which needs a Vertex buffer and index buffer. are you just storing them as handles like VertexBufferHandle inside the mesh but creating the actual buffers on the device such that:

Then when you submit your DrawItem your just passing in all the handles for the vert, index and constant buffers for the RenderDevice to just fetch and set from its pool's? you would think the fetch via a handle would be slower then just a pointer chase. But cache could be better? 

Yes, exactly.

Yes, the handle lookups involve pool[handle].pointer->, instead of just pointer-> which is an extra layer of indirection. This adds an extra cache-miss penalty if the pool isn't present in the cache. If the pool is present in the cache, then the cost of this extra indirection is negligible. My main priority is to keep the draw-items themselves as small as possible, which lets more of them fit into the cache. It's adding a performance problem in one area to reduce a problem elsewhere :| 

42 minutes ago, d07RiV said:

Do you have any suggestions on how to handle texture bindings? Should I even bother to minimize the number of swaps?

Reducing state changes helps a lot on the CPU-side, as GL/D3D calls can be relatively expensive, especially ones that interact with resource management. On the GPU side, changing states/resource-bindings constantly can also be a performance issue if your draws aren't big enough.

42 minutes ago, d07RiV said:

Or I could assume that every time I switch shaders all textures have to be re-bound, which simplifies things greatly.

Assuming you don't use too many different shaders, this can be a decent sacrifice. Slightly less accurate redundancy filtering, but much simpler/faster code :) 

42 minutes ago, d07RiV said:

I'm targeting WebGL by the way, so no resource lists, and many of these bitwise optimizations are hard to apply there.

Yeah in C++ you can do 128 bit logical operations, but in Javascript I guess you're limited to 32bit logical operations? You should still be able to do bitwise stuff, just not on as many bits at once...
My resource-list stuff was inspired by Mantle/D3D12/Vulkan, but it's still very useful as far back as D3D9/GL too :) I allow shaders to define 8 resource lists, which is 8 x 16bit IDs, or a single 128bit SSE register (or 4 javascript 32bit integers 😉). I XOR these (with a single SSE XOR) to quickly tell if any resources need to be rebound. Once a dirty/changed resource-list binding is detected, I check the actual texture bindings within that list for changes.

Share this post


Link to post
Share on other sites

Thanks, I'm still not sure how much abstraction I need since API is always going to be the same.

Another thing - when you put all passes in the same shader file, do you run a lexer on them, or do you just feed everything to shader compiler and let it figure out what to optimize away? The former option would us to know which options affect which passes, so we don't have to make redundant copies (instead of having to manually specify them for every pass).

edit: I guess this is partially answered by bonus slides.

Edited by d07RiV

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Advertisement
  • Advertisement
  • Popular Tags

  • Similar Content

    • By NanaMarfo
      Hello Everyone!
      I am looking for a small team to do a rendering project with me. The roles I need are:
      -Character Modeller
      -Environment Designer
      -Environment Modeller(Found)
      You can use this in your portfolio and you will be credited at the end.
      If you are interested, please email me at marfo343@gmail.com. Thank you!
    • By D34DPOOL
      Edit Your Profile D34DPOOL 0 Threads 0 Updates 0 Messages Network Mod DB GameFront Sign Out Add jobEdit jobDeleteC# Programmer for a Unity FPS at Anywhere   Programmers located Anywhere.
      Posted by D34DPOOL on May 20th, 2018
      Hello, my name is Mason, and I've been working on a Quake style arena shooter about destroying boxes on and off for about a year now. I have a proof of concept with all of the basic features, but as an artist with little programming skill I've reached the end of my abilities as a programmer haha. I need someone to help fix bugs, optomize code, and to implent new features into the game. As a programmer you will have creative freedom to suggest new features and modes to add into the game if you choose to, I'm usually very open to suggestions :).
      What is required:
      Skill using C#
      Experience with Unity
      Experience using UNET (since it is a multiplayer game), or the effort and ability to learn it
      Compensation:
      Since the game currently has no funding, we can split whatever revenue the game makes in the future. However if you would perfer I can create 2D and/or 3D assets for whatever you need in return for your time and work.
      It's a very open and chill enviornment, where you'll have relative creative freedom. I hope you are interested in joining the team, and have a good day!
       
      To apply email me at mangemason@yahoo.com
    • By Andrew Parkes
      I am a talented 2D/3D artist with 3 years animation working experience and a Degree in Illustration and Animation. I have won a world-wide art competition hosted by SFX magazine and am looking to develop a survival game. I have some knowledge of C sharp and have notes for a survival based game with flexible storyline and PVP. Looking for developers to team up with. I can create models, animations and artwork and I have beginner knowledge of C sharp with Unity. The idea is Inventory menu based gameplay and is inspired by games like DAYZ.
      Here is some early sci-fi concept art to give you an idea of the work level. Hope to work with like minded people and create something special. email me andrewparkesanim@gmail.com.
      Developers who share the same passion please contact me, or if you have a similar project and want me to join your team email me. 
      Many thanks, Andrew.

    • By thecheeselover
      I made this post on Reddit. I need ideas and information on how to create the ground mesh for my specifications.
    • By Canadian Map Makers
      GOVERNOR is a modernized version of the highly popular series of “Caesar” games. Our small team has already developed maps, written specifications, acquired music and performed the historical research needed to create a good base for the programming part of the project.

      Our ultimate goal is to create a world class multi-level strategic city building game, but to start with we would like to create some of the simpler modules to demonstrate proof of concept and graphical elegance.

       

      We would like programmers and graphical artists to come onboard to (initially) create:

      A module where Province wide infrastructure can be built on an interactive 3D map of one of the ancient Roman Provinces.
      A module where city infrastructure can be built on a real 3D interactive landscape.
      For both parts, geographically and historically accurate base maps will be prepared by our team cartographer. Graphics development will be using Blender. The game engine will be Unity.

       

      More information, and examples of the work carried out so far can be found at http://playgovernor.com/ (most of the interesting content is under the Encyclopedia tab).

       

      This project represents a good opportunity for upcoming programmers and 3D modeling artists to develop something for their portfolios in a relatively short time span, working closely with one of Canada’s leading cartographers. There is also the possibility of being involved in this project to the point of a finished game and commercial success! Above all, this is a fun project to work on.

       

      Best regards,

      Steve Chapman (Canadian Map Makers)

       
    • By RobMaddison
      Hi
      I’ve been working on a game engine for years and I’ve recently come back to it after a couple of years break.  Because my engine uses DirectX9.0c I thought maybe it would be a good idea to upgrade it to DX11. I then installed Windows 10 and starting tinkering around with the engine trying to refamiliarise myself with all the code.
      It all seems to work ok in the new OS but there’s something I’ve noticed that has caused a massive slowdown in frame rate. My engine has a relatively sophisticated terrain system which includes the ability to paint roads onto it, ala CryEngine. The roads are spline curves and built up with polygons matching the terrain surface. It used to work perfectly but I’ve noticed that when I’m dynamically adding the roads, which involves moving the spline curve control points around the surface of the terrain, the frame rate comes to a grinding halt.
      There’s some relatively complex processing going on each time the mouse moves - the road either side of the control point(s) being moved, is reconstructed in real time so you can position and bend the road precisely. On my previous OS, which was Win2k Pro, this worked really smoothly and in release mode there was barely any slow down in frame rate, but now it’s unusable. As part of the road reconstruction, I lock the vertex and index buffers and refill them with the new values so my question is, on windows 10 using DX9, is anyone aware of any locking issues? I’m aware that there can be contention when locking buffers dynamically but I’m locking with LOCK_DISCARD and this has never been an issue before.
      Any help would be greatly appreciated.
    • By MikhailGorobets
      I have a problem with SSAO. On left hand black area.
      Code shader:
      Texture2D<uint> texGBufferNormal : register(t0); Texture2D<float> texGBufferDepth : register(t1); Texture2D<float4> texSSAONoise : register(t2); float3 GetUV(float3 position) { float4 vp = mul(float4(position, 1.0), ViewProject); vp.xy = float2(0.5, 0.5) + float2(0.5, -0.5) * vp.xy / vp.w; return float3(vp.xy, vp.z / vp.w); } float3 GetNormal(in Texture2D<uint> texNormal, in int3 coord) { return normalize(2.0 * UnpackNormalSphermap(texNormal.Load(coord)) - 1.0); } float3 GetPosition(in Texture2D<float> texDepth, in int3 coord) { float4 position = 1.0; float2 size; texDepth.GetDimensions(size.x, size.y); position.x = 2.0 * (coord.x / size.x) - 1.0; position.y = -(2.0 * (coord.y / size.y) - 1.0); position.z = texDepth.Load(coord); position = mul(position, ViewProjectInverse); position /= position.w; return position.xyz; } float3 GetPosition(in float2 coord, float depth) { float4 position = 1.0; position.x = 2.0 * coord.x - 1.0; position.y = -(2.0 * coord.y - 1.0); position.z = depth; position = mul(position, ViewProjectInverse); position /= position.w; return position.xyz; } float DepthInvSqrt(float nonLinearDepth) { return 1 / sqrt(1.0 - nonLinearDepth); } float GetDepth(in Texture2D<float> texDepth, float2 uv) { return texGBufferDepth.Sample(samplerPoint, uv); } float GetDepth(in Texture2D<float> texDepth, int3 screenPos) { return texGBufferDepth.Load(screenPos); } float CalculateOcclusion(in float3 position, in float3 direction, in float radius, in float pixelDepth) { float3 uv = GetUV(position + radius * direction); float d1 = DepthInvSqrt(GetDepth(texGBufferDepth, uv.xy)); float d2 = DepthInvSqrt(uv.z); return step(d1 - d2, 0) * min(1.0, radius / abs(d2 - pixelDepth)); } float GetRNDTexFactor(float2 texSize) { float width; float height; texGBufferDepth.GetDimensions(width, height); return float2(width, height) / texSize; } float main(FullScreenPSIn input) : SV_TARGET0 { int3 screenPos = int3(input.Position.xy, 0); float depth = DepthInvSqrt(GetDepth(texGBufferDepth, screenPos)); float3 normal = GetNormal(texGBufferNormal, screenPos); float3 position = GetPosition(texGBufferDepth, screenPos) + normal * SSAO_NORMAL_BIAS; float3 random = normalize(2.0 * texSSAONoise.Sample(samplerNoise, input.Texcoord * GetRNDTexFactor(SSAO_RND_TEX_SIZE)).rgb - 1.0); float SSAO = 0; [unroll] for (int index = 0; index < SSAO_KERNEL_SIZE; index++) { float3 dir = reflect(SamplesKernel[index].xyz, random); SSAO += CalculateOcclusion(position, dir * sign(dot(dir, normal)), SSAO_RADIUS, depth); } return 1.0 - SSAO / SSAO_KERNEL_SIZE; }  



    • By Ike aka Dk
      Hello everyone 
      I am a programmer from Baku.
      I need a 3D Modeller for my shooter project in unity.I have 2 years Unity exp.
      Project will paid when we finish the work 
      If you interested write me on email:
      mr.danilo911@gmail.com
    • By MarcusAseth
      Hi guys, I'm trying to learn this stuff but running into some problems 😕
      I've compiled my .hlsl into a header file which contains the global variable with the precompiled shader data:
      //... // Approximately 83 instruction slots used #endif const BYTE g_vs[] = { 68, 88, 66, 67, 143, 82, 13, 236, 152, 133, 219, 113, 173, 135, 18, 87, 122, 208, 124, 76, 1, 0, 0, 0, 16, 76, 0, 0, 6, 0, //.... And now following the "Compiling at build time to header files" example at this msdn link , I've included the header files in my main.cpp and I'm trying to create the vertex shader like this:
      hr = g_d3dDevice->CreateVertexShader(g_vs, sizeof(g_vs), nullptr, &g_d3dVertexShader); if (FAILED(hr)) { return -1; } and this is failing, entering the if and returing -1.
      Can someone point out what I'm doing wrong? 😕 
    • By Toastmastern
      Hello everyone,
      After a few years of break from coding and my planet render game I'm giving it a go again from a different angle. What I'm struggling with now is that I have created a Frustum that works fine for now atleast, it does what it's supose to do alltho not perfect. But with the frustum came very low FPS, since what I'm doing right now just to see if the Frustum worked is to recreate the vertex buffer every frame that the camera detected movement. This is of course very costly and not the way to do it. Thats why I'm now trying to learn how to create a dynamic vertexbuffer instead and to map and unmap the vertexes, in the end my goal is to update only part of the vertexbuffer that is needed, but one step at a time ^^

      So below is my code which I use to create the Dynamic buffer. The issue is that I want the size of the vertex buffer to be big enough to handle bigger vertex buffers then just mPlanetMesh.vertices.size() due to more vertices being added later when I start to do LOD and stuff, the first render isn't the biggest one I will need.
      vertexBufferDesc.Usage = D3D11_USAGE_DYNAMIC; vertexBufferDesc.ByteWidth = mPlanetMesh.vertices.size(); vertexBufferDesc.BindFlags = D3D11_BIND_VERTEX_BUFFER; vertexBufferDesc.CPUAccessFlags = D3D11_CPU_ACCESS_WRITE; vertexBufferDesc.MiscFlags = 0; vertexBufferDesc.StructureByteStride = 0; vertexData.pSysMem = &mPlanetMesh.vertices[0]; vertexData.SysMemPitch = 0; vertexData.SysMemSlicePitch = 0; result = device->CreateBuffer(&vertexBufferDesc, &vertexData, &mVertexBuffer); if (FAILED(result)) { return false; } What happens is that the 
      result = device->CreateBuffer(&vertexBufferDesc, &vertexData, &mVertexBuffer); Makes it crash due to Access Violation. When I put the vertices.size() in it works without issues, but when I try to set it to like vertices.size() * 2 it crashes.
      I googled my eyes dry tonight but doesn't seem to find people with the same kind of issue, I've read that the vertex buffer can be bigger if needed. What I'm I doing wrong here?
       
      Best Regards and Thanks in advance
      Toastmastern
  • Advertisement
  • Popular Now

  • Forum Statistics

    • Total Topics
      631395
    • Total Posts
      2999780
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

Participate in the game development conversation and more when you create an account on GameDev.net!

Sign me up!