How much redundant state checking?

Started by
5 comments, last by Juliean 8 years, 5 months ago

Hello,

quick question about redundant state checking:

Is there actually a performance overhead if you bind resources that are already bound (via VSSetConstantBuffers, ...) in case that you need to call it anyway to update at least one other resource? To give an example, say I allow 6 bound textures. At one draw call, texture 2 and 4 changes. What I do now is check for the first and last resource that changed, fill a static array with those, and bind that array, which would translate to:


VSSetShaderResources(2, 3, textures); // 2 = first slot, 3 = number of textures

This reduces the number of API calls to an absolute minimum, but also requires a quite complicated state checking per draw call, per resource type and per resource:


template<typename ResourceType, int NumResources>
CheckedResource<ResourceType, NumResources> checkSomething(const std::array<ResourceType*, NumResources>& resources, std::array<ResourceType*, NumResources>& lastResources, unsigned int startSlot, unsigned int endSlot)
{
    unsigned int firstSlotChanged = NumResources, lastSlotChanged = 0;
    // resources changed
    for(unsigned int i = startSlot; i <= endSlot; i++)
    {
        if(resources[i] != lastResources[i])
        {
            lastResources[i] = resources[i];

            firstSlotChanged = min(firstSlotChanged, i);
            lastSlotChanged = max(lastSlotChanged, i);
        }
    }

    CheckedResource<ResourceType, NumResources> checkedResources;

    if(firstSlotChanged != NumResources)
    {
        checkedResources.firstSlot = firstSlotChanged;
        checkedResources.numResources = lastSlotChanged - firstSlotChanged + 1;

        // todo: memcpy or std::copy
        for(unsigned int i = 0; i < checkedResources.numResources; i++)
        {
            checkedResources.buffers[i] = resources[i + firstSlotChanged];
        }
    }

    return checkedResources;
}

I need to loop through every resource and see whats the first and whats the last changed resource.

What I could do alternatively is just see if any texture changed, and then shove the whole array of bound textures onto the API:


VSSetShaderResources(0, MAX_TEXTURES, boundTextures); // also saves me constructing the array

Which would change the above method to a simple:


template<typename ResourceType, int NumResources>
bool checkSomething(const std::array<ResourceType*, NumResources>& resources, std::array<ResourceType*, NumResources>& lastResources, unsigned int startSlot, unsigned int endSlot)
{
    for(unsigned int i = startSlot; i <= endSlot; i++)
    {
        if(resources[i] != lastResources[i])
        {
            lastResources = resources; // copy whole resource block, since more might have changed but we are merely checking the first

            return true;
        }
    }
    
    return false;
}

So way less iterations, and way less instructions/branches, but now I'm potentially telling the API that it needs to change 6 texture, when only the first one really changed. Is there any (CPU) overhead here, or do I save more time by doing the more minimalistic approach of checking (like in my second method)?

Also, since I already wrote a little more about state checking than probably required for this simple API-specific question:

How much state checking do you actually do?

Currently I have a system where I collect all states in the device, plus what states are currently bound (m_state and m_lastState), and before every draw call, I check which states to update. For most things this is pretty simple, I need to compare if what is bound differs from the current state (shader, depth state, rasterizer state, ...), but for resources (sampler, cbuffer, textures), this is way more complicated.

While on the high-level, I only have shared resources between shader stages (like the effect framework does), but in the back-end renderer, I obviously only want to bind resources to that stages that actually need them (ie. I support domain/hull shader, but only 1 shader currently uses it, so I obviously don't want to bind every texture to the domain/hull stage unless the shader supports it.

So what I do is, I get a range of stages the current shader has (2-5), loop over those, and then for every type of resource currently supported (3) I perform the state checking I posted above, which looks like this:


for(unsigned int shader = 0; shader <= m_state.pEffect->GetMaxShader(); shader++)
{
    // CBUFFERS
    {
        const auto shaderData = m_state.pEffect->GetCbufferData(shader);

        const auto checkedResources = checkSomething(m_state.cbuffers[shader], m_lastState.cbuffers[shader], shaderData.startSlot, shaderData.endSlot);

        if(checkedResources.numResources > 0)
            BindCBuffer(shader, checkedResources.firstSlot, checkedResources.numResources, checkedResources.buffers);
    }

    // TEXTURES
    {
        const auto shaderData = m_state.pEffect->GetTextureData(shader);

        const auto checkedResources = checkSomething(m_state.textures[shader], m_lastState.textures[shader], shaderData.startSlot, shaderData.endSlot);

        if(checkedResources.numResources > 0)
            BindTextures(shader, checkedResources.firstSlot, checkedResources.numResources, checkedResources.buffers);
    }

    // SAMPLER
    {
        const auto shaderData = m_state.pEffect->GetSamplerData(shader);

        const auto checkedResources = checkSomething(m_state.sampler[shader], m_lastState.sampler[shader], shaderData.startSlot, shaderData.endSlot);

        if(checkedResources.numResources > 0) // TODO: we didn't need this branch before generalizing this in a method
            BindSampler(shader, checkedResources.firstSlot, checkedResources.numResources, checkedResources.buffers);
    }
}

Is this amongst the line of CPU-work you guys invest into redudant-state filtering, or is there an even simplier method? I first wanted to store a bool for every type of resource that only gets set if I bind a resource that actually changes to save checking every resource very drawcall, but this doesn't work due to the partial-binding model I'm using (one shader can only have vertex and pixel shader, so it will bind the changed texture to those stages. However down the line another shader might get bound the now has a geometry shader but need the same texture there).

So how does your state-filtering look like? Am I somewhat on the right track, or do you use something complely different?

Thanks!

PS: I currently don't have any meaningful 3D scene to benchmark this (I'm in the process of getting the renderer to work and actually build a useful toolchain), and I'd also like some more theoretical knowledge, thats why I'm asking.

Advertisement

I have a "ContextStateCache" class. Every API call I make (ok every call that can effect the API state : binding, unbinding, resource deleting) I make it via that wrapper class.

The single role of that class is to update the API state only if the newly bound resource will be diffrent from the previously used. this is done(in my case) mostly, as you've mentioned, with if's


void ContextStateCache::UpdateVertexShader(target){
  if(update_cached_value(m_current, target)) {
   d3dcon->VSSet(m_current);
  }
}

A quick way to prove to yourself that state caching re?ally matters is to open the D3D11 draw triangle example, modify it to draw lets say a 10000 triangles with multiple draw calls.
Try to set the Vertex/Pixel shader on every draw call, after that try setting it only once and draw, observe the frame rate.?

??

The single role of that class is to update the API state only if the newly bound resource will be diffrent from the previously used. this is done(in my case) mostly, as you've mentioned, with if's

Read the bottom section comparing between active redundant-state setting and last-minute redundant-state setting.
http://lspiroengine.com/?p=570


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid


Read the bottom section comparing between active redundant-state setting and last-minute redundant-state setting.
http://lspiroengine.com/?p=570

Thanks, that article of yours was really helpful, seems like I am right on track. Two detail-questions on that though, based on your example at the bottom of the article (about late texture checking):

- Based on your code, you are really only setting those textures that actually changed, meaning that if texture 0, 2 and 5 changed, you will make 3 API calls, instead of having 1 API call where you set all textures from 0-5. Is there a specific reason for this, ie. that you measured that having more API calls is still worth over having one call that sets textures which are not changed? Or is it just convention, or are you just never expecting textures to change in a pattern like I gave as an example?

- You are also setting textures to all stages (VS and PS in your case) at once, without checking if the current render setup (=shaders) even need the texture in this stage. Whats the reason behind this? I was expecting this to be slower, since you are probably not going to need textures you use in PS in the VS in most cases. More specifically, in my current project I have like 1 geometry shader that uses a texture, meaning in your code it would make a lot of texture changes for the GS-stage while in reality it wouldn't need to change the texture of this stage at all. Does checking for the stages separately cost more than those unnecessary API calls, or is it just because you only had 2 stages (instead of all 5 that DX supports) so it didn't matter?

- Based on your code, you are really only setting those textures that actually changed, meaning that if texture 0, 2 and 5 changed, you will make 3 API calls, instead of having 1 API call where you set all textures from 0-5. Is there a specific reason for this, ie. that you measured that having more API calls is still worth over having one call that sets textures which are not changed? Or is it just convention, or are you just never expecting textures to change in a pattern like I gave as an example?

That’s not final code, it just illustrates the point. It was a while ago that I benchmarked both variants, and I don’t remember if there was a clear winner either way or if my implementation is better-suited for how I use textures.

- You are also setting textures to all stages (VS and PS in your case) at once, without checking if the current render setup (=shaders) even need the texture in this stage. Whats the reason behind this? I was expecting this to be slower, since you are probably not going to need textures you use in PS in the VS in most cases.

This is left over from the days when my engine was meant to support primitive API’s such as fixed-function OpenGL/Direct3D and my own API did not allow specifying setting textures per-stage.
This should really be updated so that textures can be applied to each shader separately, after which the same routine I am using now would still be best (but separated by shader) since you would work under the assumption that most of the time if a texture is set it is used by the shader.

The code is only meant to show the difference between active and last-minute checks, and should not necessarily be used as a guide on how to use the API in the most efficient way.


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

It was a while ago that I benchmarked both variants, and I don’t remember if there was a clear winner either way

PS: I currently don't have any meaningful 3D scene to benchmark this

I would be interested in reading any benchmark data that anyone has smile.png

setting textures to all stages (VS and PS in your case) at once, without checking if the current render setup (=shaders) even need the texture in this stage

Getting off topic, but to implement this, I use a bitmask per stage, specifying which slots are in-use by that stage. If the mask is zero, you can skip performing any binding for that stage. Usually that means VS/GS/etc skip that logic completely.
My binding logic looks something like this http://pastebin.com/Gid5Bv4E (I've hacked that code apart; in reality, that function is actually 250 lines... I actually process more than one "range" of registers, whereas the pasted snipped assumes all bindings are in a single range -- aka "resource lists". I've also got a lot of non-retail-build error checking, and #ifdefs for xbox, etc...).
I made the choice to collect all the new texture pointers into an array, and then make a single call to PSSetShaderResources, but I made this decision without looking at any profiling data. I'm not sure if many small calls would perform any differently.


This is left over from the days when my engine was meant to support primitive API’s such as fixed-function OpenGL/Direct3D and my own API did not allow specifying setting textures per-stage.
This should really be updated so that textures can be applied to each shader separately, after which the same routine I am using now would still be best (but separated by shader) since you would work under the assumption that most of the time if a texture is set it is used by the shader.

So you are now actively supporting setting textures explicitely for each stage? Thats interesting because just before writing that article I changed my API from requiring to set resources per stage to actually just setting them per-shader. I saw upsides to both approaches but ultimately went with the per-shader approach because it seemed to be easier on the API/user side of things (no more setting of duplicate textures if multiple stages need them, ...), and removed some bloat of the actual code. What was your reasoning for doing it the other way around?


Getting off topic, but to implement this, I use a bitmask per stage, specifying which slots are in-use by that stage. If the mask is zero, you can skip performing any binding for that stage. Usually that means VS/GS/etc skip that logic completely.
My binding logic looks something like this http://pastebin.com/Gid5Bv4E (I've hacked that code apart; in reality, that function is actually 250 lines... I actually process more than one "range" of registers, whereas the pasted snipped assumes all bindings are in a single range -- aka "resource lists". I've also got a lot of non-retail-build error checking, and #ifdefs for xbox, etc...).

Interesting, I'm using a similar system, just instead of a bitmask per stage I am using an integer that stores the first stage and another one that stores the last used stage. In fact what you are doing in this "hacked" code seems pretty similar to what I'm doing in my state checking, only that I templated this function because I want to reuse it for sampler and cbuffers too.


I would be interested in reading any benchmark data that anyone has smile.png

I quess it can be agreed on that state checking is faster, but I also havn't seen much benchmarking data.

This topic is closed to new replies.

Advertisement