ID3DXEffect & state changes

Started by
4 comments, last by S1CA 19 years, 1 month ago
I am currently handling HLSL shaders through my own class interface that handles the validation, state changes and handle setting. The class is starting to become convoluted, and I am considering switching over to the effects framework (through the ID3DXEffect interface) instead of my own handling of the shaders and shader states. However, whenever I look into the effects interface I become worried about performance issues in regards to how the effects framework handles its state changes. In order to optimize graphics performance, we must minimize state changes. The effects framework handles the state changes by allowing us to set all the states neccesary for every effect, and it saves and switches states when you begin the effect. My concern is how the Effects framework handles these state changes internally. Does it protect against redundant state changes? Must it query the device in order to determine which states it must change - which will void the usage of a pure device? I have attempted to find any information about how the effects framework handles its state changes, but the information does not appear readily available, or I am searching in the wrong places. My intuition suggests the microsoft tutorials on the effects framework would advertise efficient usage of state changes if that were the case. My main question to all you champs here is: Will using the effects framework yield any performance penalty over handling shaders and state changes on my own (which would emphasize minimal state changes)? - Shaun
Advertisement
1) D3D itself performs trivial redundant state filtering ("if state X is already set to value Y, don't set it to value Y again")

2) the overhead of individual state setting calls is only really that of a call to the D3D DLL (the state isn't set at the device driver level until absolutely necessary for rendering). You'll get the most benefit from redundant state filtering if you do it at the highest level you possibly can with groups of states, and utilising any higher level knowledge you have of your engine/data.

3) IIRC the D3DX effect framework also provides additional filtering at a higher level.

4) Since you already have your own state management class (which can potentially have better performance due to knowing more about your usage patterns), you could look into ID3DXEffect::SetStateManager() which allows you to replace the default state manager an effect uses with your own...

5) For blocks of "static" state that don't need to change at render time, you can use ID3DXEffect::BeginParameterBlock, ID3DXEffect::EndParameterBlock, ID3DXEffect::ApplyParameterBlock to group bunches of Effect state changes together.

6) If your application is CPU bound (most games are...), you'll see bigger wins from being careful about where your ID3DXEffect::Begin, ID3DXEffect::BeginPass, and ID3DXEffect::CommitChanges calls are made (most can be moved to a lower frequency calling position) than you will with redundant state filtering.

7) Something else if you're CPU bound is to see if disabling pre-shaders helps your performance - they're a win if you're shader execution bound and sometimes a win if you're constant upload bound - but their cost is extra CPU work - which you don't want for CPU bound apps...

Simon O'Connor | Technical Director (Newcastle) Lockwood Publishing | LinkedIn | Personal site

Thank you so much Simon for the fast and accurate response; it was exactly what I was looking for - you sir are the man. I am now comfortable porting everything over to the effect interface. I am not doing very many state changes among my shaders (PIX shows 26 state changes a frame), so I am assuming making state change blocks for 2-3 state changes at a time might be more overhead than the speed increase is valuable.

Just to make sure I am understanding everything correctly, advice tip #6 is a pointer to render pass0 for all batches of an object type, then switch over to pass1 and re-render the entire batch rather than rendering pass0 then pass1 for each and every object that uses the same shader. That makes total sense, and was actually another concern I forgot to ask about rendering multiple passes for each object which would be an incredible ammount of shader changes with a large number of objects to render.

In regards to number 7, I am not sure if my understanding of pre-shaders is correct. Would the pre-shader refer to rendering all objects without a shader or lighting or color attributes in order to set up the depth buffer to maximize depth-fails on pixels?

I predict my game will not be CPU bound since it is aimed at only multiplayer (which will save on AI computations) in addition to offloading every possible bottleneck to the GPU from stencil shadow sillhouette generation to mesh skinning in addition to using a large ammount of alpha blends and rather expensive shaders.

Thank you so much for being able to help me out

-Shaun



Quote:Just to make sure I am understanding everything correctly, advice tip #6 is a pointer to render pass0 for all batches of an object type, then switch over to pass1 and re-render the entire batch rather than rendering pass0 then pass1 for each and every object that uses the same shader. That makes total sense, and was actually another concern I forgot to ask about rendering multiple passes for each object which would be an incredible ammount of shader changes with a large number of objects to render.


Yes. In the case of multiple passes, whether rendering all objects using pass 0 in a single batch is beneficial really depends on where your bottlenecks are and how much state chages you need to commit (e.g. per object matrices etc). There isn't a perfect one-size-fits-all sort order for this kind of thing unfortunately.


Quote:In regards to number 7, I am not sure if my understanding of pre-shaders is correct. Would the pre-shader refer to rendering all objects without a shader or lighting or color attributes in order to set up the depth buffer to maximize depth-fails on pixels?


No. What you're describing is known as "laying down Z" or "Z write prepass".

Pre-shaders are an optimisation made inside D3DX Effects and enabled by default when you use HLSL shaders. Imagine you had the following code inside your vertex shader:

float4x4 matWorldViewProj = matWorld * matView * matProj;...out.pos = inpos * matWorldViewProj;


Without pre-shaders, the above code would be executed for every vertex, so the world, view and projection matrices would be multiplied to form a single transform to use.

matWorld, matView and matProj don't change per-vertex, they only change at most per call to Draw*Primitive*(), so it's pretty wasteful of GPU shader instruction performance to re-compute that constant matrix for every vertex you render.

With pre-shaders, during compilation (including offline compilation with FXC), computations which remain constant for all vertices are detected and pulled out of your shader code and put into a separate shader known as a "preshader".

When the D3DX Effect system sets the shaders for the effect, it runs the pre-shader in software on the CPU (because code flow in an effect can be changed at runtime, preshaders use an intermediate virtual machine at runtime).

If your application is shader instruction/computation bound, then as you can imagine, using pre-shaders is a good optimisation because it transfers some of the work onto the CPU. You'll want to try them if you're using software vertex processing too.

If your application is CPU bound and you have GPU shader time to spare (many commercial games fall into this category), then preshaders can actually be bad for your performance - your frame rate is always capped by your biggest bottleneck - CPU time used to evaluate and execute the preshaders is time you'd rather spend elsewhere in a CPU bound app (the goal is to balance CPU & GPU load).

Preshaders can be disabled with the /Op flag in FXC and the D3DXSHADER_NO_PRESHADER flag at runtime.

Simon O'Connor | Technical Director (Newcastle) Lockwood Publishing | LinkedIn | Personal site

Thank you again Simon!

I successfully ported everything over to the effects framework, and every single thing you said has been great advice.

Just to give you a performance update:

When staring at an object with the least expensive shader (filling up the screen), my FPS dropped 3.33% over my own shader management (most likely due to some overhead in the effects framework burdeining the CPU since my app is probably CPU bound when looking at the simple shader).

However, When staring at the most difficult shader, my application ran 8.3% faster using the effects framework over my own shader management. My guess is that the "preshader" found something to optimize per-object rather than per-vertex like you said. However, I can't spot anything in the shader that could be pulled out per-object rather than per-vertex - so either the pre-shader is more clever than me, or the optimization with the effects lies somewhere else such as in how it handles samplers.

Even in the worst scenario, I am willing to pay 3.33% of the frames per second to make my job easier in addition to making an application run on all hardware without an absolute pain in the neck.

Thanks again Simon.

-Shaun
Quote:My guess is that the "preshader" found something to optimize per-object rather than per-vertex like you said. However, I can't spot anything in the shader that could be pulled out per-object rather than per-vertex


If your effects are stored in *.FX files (or can be pulled out into *.FX files for debugging purposes), you can run the file through the FX Compiler (FXC.EXE) using the /Fc option to to generate an assembly listing for that particular effect.

The listing file will show you what things got moved out to a preshader. An example listing is below:

//listing of all techniques and passes with embedded asm listings technique t0{    pass p0    {        vertexshader =             asm {            //            // Generated by Microsoft (R) D3DX9 Shader Compiler 5.04.00.2904            //            // Parameters:            //            //   float4x4 mtxObjectToWorld;            //   float4x4 mtxWorldToScreen;            //            //            // Registers:            //            //   Name            Reg   Size            //   --------------- ----- ----            //   s_WorldToScreen c0       4            //   g_ObjectToWorld c4       4            //                            preshader                mul r0, c4.x, c0                mul r2, c4.y, c1                add r1, r0, r2                mul r2, c4.z, c2                add r0, r1, r2                mul r1, c4.w, c3                add c1, r0, r1                mul r0, c5.x, c0                mul r2, c5.y, c1                add r1, r0, r2                mul r2, c5.z, c2                add r0, r1, r2                mul r1, c5.w, c3                add c2, r0, r1                mul r0, c6.x, c0                mul r2, c6.y, c1                add r1, r0, r2                mul r2, c6.z, c2                add r0, r1, r2                mul r1, c6.w, c3                add c3, r0, r1                mul r0, c7.x, c0                mul r2, c7.y, c1                add r1, r0, r2                mul r2, c7.z, c2                add r0, r1, r2                mul r1, c7.w, c3                add c4, r0, r1                        // approximately 28 instructions used            //            // Generated by Microsoft (R) D3DX9 Shader Compiler 5.04.00.2904            //            // Parameters:            //            //   float4x4 mtxObjectToWorld;            //   ...            //   float4 s_LightColour;            //   float4 s_LightDirection;            //            //            // Registers:            //            //   Name                        Reg   Size            //   --------------------------- ----- ----            //   mtxObjectToWorld            c5       4            //   ...            //   s_LightColour               c11      1            //   s_LightDirection            c12      1            //                            vs_1_1                def c17, 1, 0, 0, 0                dcl_position v0                dcl_normal v1                dcl_texcoord v2                ...stuff...                // the following uses the matrix computed by the preshader                mul r0, v0.y, c2                mad r1, c1, v0.x, r0                mad r1, c3, v0.z, r1                mad oPos, c4, v0.w, r1                ...stuff...                        // approximately 29 instruction slots used            };        pixelshader =             asm {                ps_1_1                tex t0                mov r0, t0                        // approximately 2 instruction slots used (1 texture, 1 arithmetic)            };    }}

Simon O'Connor | Technical Director (Newcastle) Lockwood Publishing | LinkedIn | Personal site

This topic is closed to new replies.

Advertisement