Shader performances

Started by
7 comments, last by cozzie 7 years, 10 months ago

Hi,

I'm interesting to know what is the preferred way to go regarding to shaders performances.

( Yes, it is matter of profiling, but is there a preferred way ? )

Option 1: having one shader that does the same work all the time, and at some cases may do "empty" work.

For example the specular level is inputted from a texture. But in case we don't have a unique specular level pattern, the

inputted texture is white 1x1 pixels size texture.

I know that sampling texture can impact the performances, and according to some resources online, in modern GPUs it is preferred to

use more ALU's then textures sampling, but, Is the texture size count?

The plus side of this approach is that I'm reducing the amount of programs swap, which also contribute to the performances.

Option 2: Use pre-processors and split the work to many shaders. The GPU will work less, but we will have a lot of programs change.

Tnx,

Advertisement

My experience is, that using boolean uniforms with if-elses is faster than having a program switch. Another option, probably the best, would be to use shader subroutines. But this should only be better if you use more complex variations of your shaders.

Depends.

Everybody's workload is different, and what kind of hardware you're targetting is a huge factor. Also don't forget that in many cases it can be beneficial to lose a certain amount of performance in exchange for cleaner, more maintainable code - when you come to add a feature in 6 montts time you'll be happy you did.

If it was me I'd build option 1 first. The case you cite - sampling a 1x1 texture - should be accelerated because the result will be cached. Option 1 would also allow for better draw call batching (which will give a performance boost on it's own). You can then profile your program to determine if there are any performance-critical paths which it might make more sense to split out into their own shaders.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

Yeah it depends. I default to option 2 though.

Extra pixel shader instructions have a cost that scales with the number of pixels/triangles drawn. When you're drawing millions of pixels small inefficiencies add up.

The cost of switching shaders is mostly on the CPU side, as the driver has to generate a bunch of commands and rebind all your resources.
On the GPU side, switching shaders can be completely free as long as you don't do it too frequently. My rule of thumb is to try and draw at least a few hundred pixels between each shader switch to give the GPU front end enough time to pipeline the state-change overhead.
Note that this applies to other state changes to, such as blend mode or input layouts.

Very good answers.

I'm actually agreed with all the answers.

I'm treating the GPU as a big train which must not be stopped for a small amount of passengers. Option 1 helps to reduce the amount of shader variants, which helps for batching ( And can always be divided into sub-shaders later on ).

Level of detail will to reduce the shading Load.

However, all the shaders used on games ( that I've seen ) used the pre-processor option to reduce the work... so I'm still confused about

that matter. I remember OpenGL GDC lecture about zero driver over head, swapping shader was very expensive.

so I'm still trying to figure why the majority picks option 2.

I remember OpenGL GDC lecture about zero driver over head, swapping shader was very expensive.
so I'm still trying to figure why the majority picks option 2.

The problem is: "very expensive compared to what?"

If you're swapping shaders to avoid a couple of uniform changes and maybe a texture change, then for sure it's going to be more expensive. If on the other hand the amount of state change exceeds the cost of the shader swap, then it goes the other way. That's why I remarked that "everybody's workload is different" in my earlier answer, and why a general guideline can't really be given for this kind of question.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

I'm treating the GPU as a big train which must not be stopped for a small amount of passengers. Option 1 helps to reduce the amount of shader variants, which helps for batching ( And can always be divided into sub-shaders later on ).
...
I remember OpenGL GDC lecture about zero driver over head, swapping shader was very expensive.

There's CPU expense and GPU expense. Different games will have a different workload balance. Ideally, you would be as balanced as possible, having a similar frame-time on both processors.

e.g.

  • If your game takes 30ms per frame on the GPU, but only 5ms per frame on the CPU, then CPU expense is irrelevant to you -- if you can save 0.5ms of GPU time by spending 8ms of CPU time, then it makes sense for that situation.
  • If it's reversed, and your game takes 5ms per frame on the GPU, but 30ms per frame on the CPU, then GPU expense is irrelevant to you -- you should optimize in a way that reduces CPU time at all costs, even if that means increasing the GPU workload.

Switching shaders is primarily a CPU cost, but allows you to save GPU time by reducing per-pixel waste.

The exception is if you switch shaders too often (e.g. with batches that only draw 10 pixels each), then switching shaders becomes a massive GPU cost as well... so you've got to be sensible.

FWIW, I just loaded up my test scene in D3D11 and profiled a frame -- it has 646 draw calls using 45 shaders -- 14 draws per shader switch on average. The CPU cost of all my D3D11 function calls is ~300?s (0.3ms!)... Sure, I could use 5 shaders instead of 45... but why bother when doing so is going to be optimizing a section of the code that's already only taking 0.3ms? :)

For comparison, my GPU takes 5.3ms to actually execute these commands that the CPU has prepared, which is how it should be :D

In my situation, I can afford to waste lots of CPU time if it means reducing the GPU frametime. However, I cannot afford to waste any GPU time whatsoever.

Note that the situation is a bit different for D3D9/OpenGL, as they've got much larger CPU overheads than D3D11.

The situation is different again for D3D12/Vulkan, as they have extremely small CPU overheads when compared to D3D11/GL.

Thank you very much. I will definitely use your tips when I'll have a running scene.

I also prefer option 2, if you precompile and just load all the shader permutations, you wont loose initial performance/ startup time and have all the benefits later on.

Crealysm game & engine development: http://www.crealysm.com

Looking for a passionate, disciplined and structured producer? PM me

This topic is closed to new replies.

Advertisement