Optimizing Out Uniforms, Attributes, and Varyings

Started by
11 comments, last by Vincent_M 10 years, 11 months ago

Yeah I was just sticking with the terminology already present in the thread.

Ah, that explains it. Now, speaking of which - to OP, are you sure you are not using an unnecessarily old GLSL?

GL2's API for dealing with shaders, uniforms, attributes and varyings is absolutely terrible compared to the equivalents in D3D9, or the more modern APIs of D3D10/GL3...

Quite, was a bit puzzled myself when the GLSL first surfaced. OGL sure has long lasting developmental issues (stemming from: crippling need for consensus / compatibility / support).

BTW when using interface blocks for uniforms, the default behaviour in GL is similar to D3D in that all uniforms in the block will be "active" regardless of whether they're used or not though - no optimisation to remove unused uniforms is done. It's nice that GL gives you a few options here though (assuming that every GL implementation acts the same way with these options...).

Did not quite understand what you said here (not sure which parts refer to D3D and which OGL).
About uniforms with OGL: Uniform buffers are not part of program object and hence are not directly bound to any program. Plain uniforms are bound to program object - however, it appears that everyone compiles an internal buffer for thous under the hood and the two cases are indistinguishable at hardware level.

So, in either case, there is no special "loading" code generated for uniforms and the only optimization of removing unused stuff one can speak of is ... well, just do not use the parts of the uniform you do not use ... duh. As the underlying hardware is the same then D3D is bound to end up doing the exact same thing here (unused uniforms are, in all regards that matter, thrown out - regardless of what is seen/reported at API side).

Ie. only buffers are bound - the individual uniforms are just offsets in machine code.

Oh, and thank goodness for std140 or my head would explode in agony.

With the default GL behaviour, the layout of the block isn't guaranteed, which means you can't precompile your buffers either. The choice to allow this optimisation to take place means that you're unable to perform other optimisatons.

What optimizations (makes zero difference at driver/OGL/GPU side)? You mean CPU side, ie. filling buffers with data? Yeah, it would be pretty painful not to use std140. IIRC, it was added at the same time as interface blocks - so, if you can use interface blocks then you can always use the fixed format also ... a bit late here to go digging to check it though.

Once you've created the individual D3D shader programs for each stage (vertex, pixel, etc), it's assumed that you can use them (in what you call a mix-and-match) fashion straight away, as long as you're careful to only mix-and-match shaders with interfaces that match exactly, without the runtime doing any further processing/linking.

Yep, got that when i re-read the "separate program objects" extension ( http://www.opengl.org/registry/specs/ARB/separate_shader_objects.txt ) - it uses "max-and-match" to describe it. It is core in 4.3. Not using it any time soon, nice to have the option though (as most shader programs do not particularly benefit from "whole-program-optimization").


Having precompiled intermediate is one of the most reoccurring requests at OGL side (even after binary blobs got already added) - having D3D brought up as an example time and time and time again. So, whats the holdup? If it makes sense for OGL then why is it not added?

That's a pretty silly argument.


Ee.. you lost me here :/. I think you implied an argument where there was none. I was conveying "wonderment" as perceived by me - it is not an argument from me nor from the "wonderer".

But if i would speculate anyway then the reason it has not been added to OGL might be:
* Khronos is slow and half the time i just want to throw my shoe at them.
* Instead of one specification one has to hope is implemented correctly (rolleyes.gif) - now there would be two.
* Consensus lock / competition ... ie. no MS as arbitrator to break the lock.
* Insufficient demand from thous that matter (learnin-ogl-complaining-a-lot persons do not matter).
* The question whether it would be worthwhile for OGL specifically has not been confidently settled.
* There are more important matters to attend to - maybe later.
* All the above.

PS. i would like to have an GLSL intermediate option - as you said, in case of shader explosion (as i call it), it gets problematic (uncached runs).

wait for 10 minutes the first time they load the game.

Then i would say that one is doing something wrong. One does not need thousands of shaders to show the splash screen ;)

... again, i am not against an intermediate, i would just like to point out that its absence is not as widespread and grave problem as it is often portrayed to be.

Sure, you can trade runtime performance in order to reduce build times, but to be a bit silly again, if this is such a feasible option, why do large games not do it?

Yep, that is silly indeed. Cannot quite use what one does not have - D3D, as far as i gather from your responses, does not have the option (no intermediate / whole-program-optimization) to begin with. Asking why the option that does not exist is not used more often ... well, good question.

Another reason why runtime compilation in GL-land is a bad thing, is because the quality of the GLSL implementation varies widely between drivers.

Having an extra specification and implementation is unlikely to be less problematic than not having the extra.

To deal with this, Unity has gone as far as to build their own GLSL compiler, which parses their GLSL code and then emits clean, standardized and optimized GLSL code, to make sure that it runs the same on every implementation. Such a process is unnecessary in D3D due to there being a single, standard compiler implementation.

... continuation: It is unnecessary in OGL too - GLSL etc is well specified. If implementers fail to follow the spec then changing the spec content (intermediate spec etc) to make them somehow read the darn thing and not fuck up implementing that ... is silly.

Leaving the bad example aside, what you wanted to say, if i may, is that a third party (Khronos) compiler would be helpful as it would leave only the intermediate for driver.

Perhaps. However, I highly doubt it would be any less buggy. Compilers are not rocket science (having written a few myself) - the inconsistencies stem from lesser user base and some extremely lazy driver developers and shader writers not reading the spec either. "Khronos compiler" would not have fixed any of thous.

---------------------------
I hope we are not annoying OP with this, a bit, OT tangent (assessing the merits of intermediate language in context of OGL and D3D). At least i can say i know more about D3D side than before - yay and thanks for that smile.png. Got my answer, which turned out to be relevant to OP too as OGL has the D3D mix-and-match option too, which if used has indeed the same limitations.

need sleep.
edit: yep, definitely bedtime.
Advertisement

Did not quite understand what you said here (not sure which parts refer to D3D and which OGL).

Neither D3D or GL (by default) will remove an unused variable from a uniform block / cbuffer. So, if you're putting every possible value for 100 different shaders into one big interface block and hoping that GL will remove the unused ones, this won't happen unless you use the 'packed' layout. In D3D it just won't happen.
In either case, I'd recommend to the OP to take responsibility for designing sensible UBO/CBuffer layouts themselves wink.png And of course to use UBOs rather than GL2's uniforms wink.png

What optimizations (makes zero difference at driver/OGL/GPU side)? You mean CPU side, ie. filling buffers with data? Yeah, it would be pretty painful not to use std140.

Yeah, on GL without std140, you can't build your UBO contents ahead of time. For example, in my engine all the materials are saved to disc in the same format that they'll be used in memory, so they can be read straight from disc into a UBO/CBuffer. With GL2 only, this optimisation isn't possible due to it rearranging your data layouts unpredictably. D3D9 doesn't support CBuffers(UBOs) but allows you to very efficiently emulate them due to it not attaching values to programs like GL2 and supporting explicit layouts.

FWIW, the GL2 model of dealing with uniforms as belonging to the "shader program" itself does actually make perfect sense on very early SM2/SM3 hardware. Many of these GPUs didn't actually have hardware registers for storing uniform values (or memory fetch hardware to get them from VRAM), so uniforms were implemented only as literal values embedded in the shader code. To update a uniform, you had to patch the actual shader code with a new literal value! By the GeForce8 though, this hardware design had disappeared, so it stopped making sense for the API to bundle together program code instances with uniform values.

Ee.. you lost me here :/. I think you implied an argument where there was none.

Yep unsure.png I misread your wonderment as a rhetorical question!

Cannot quite use what one does not have - D3D, as far as i gather from your responses, does not have the option (no intermediate / whole-program-optimization) to begin with. Asking why the option that does not exist is not used more often ... well, good question.

Sorry, by this I meant to imply that it is common for games to make more optimal use of the GPU via large numbers of permutations, which increases their shader compilation times. They could reduce their compilation times by using more general (or more branchy) shaders, but that would decrease their efficiency on the GPU.

Then i would say that one is doing something wrong. One does not need thousands of shaders to show the splash screen ;)
i would just like to point out that its absence is not as widespread and grave problem as it is often portrayed to be.

Yeah you don't need them all on the splash screen, but you do need to ensure that every program that could possibly be used by a level is actually compiled before starting the level, to avoid jittery framerates during gameplay.
It is a big problem in my experience too. All the current-gen games that I've worked on (around half a dozen on a few different engines) have had shader build times of 5-10 minutes. This hasn't been a problem for us simply because we haven't shipped Mac, Linux or mobile versions. Windows and consoles had the ability to pre-compile to cut this time off of the loading screens. If we did need to port to a GL platform, we likely would have increased the GPU requirements or decreased the GPU quality so that we could reduce the permutation count and used less efficient shaders.

It is unnecessary in OGL too - GLSL etc is well specified. If implementers fail to follow the spec

The compliance of different implementations with the spec isn't the problem here (although, it is a big problem too) -- the spec doesn't define how code should be optimized. Some drivers may aggressively optimize your code, while others may do a literal translation into assembly without even folding constant arithmetic... If it wasn't an issue, then Unity wouldn't have wasted their time solving the problem with their pre-compiler. In general it's still best to write sensible shader code assuming that the compiler will not optimize it at all, but at least with Unity's solution, they can be sure that certain optimizations will always be done on their code, regardless of which driver it's running on.

The quality of the optimizer makes a huge difference. On platform A, it had a good compiler, so I could write straightforward code and assume it would run at near theoretical speeds. On platform B with a bad compiler, I reduced a 720p post-processing shader from 4ms to 2ms simply by manually performing all of the optimizations that I assumed the compiler would do for me (and that platform A was doing for me). This was such a problem that me and the other graphics programmers seriously considered taking a few weeks off to build a de-compiler for platform A, so we could use it to optimize our code for platform B!

Hey all,

Yes, as Hodgman suggested, I'm still using OpenGL 2 as it is close to OpenGL ES 2.0. I try to support both, but I'm going to be adding desktop/mobile-specific functionality. Also, I can't figure out how to get OpenGL 3.0+ on Mountain Lion. It seems that the default framework supported is OpenGL 2.1. I'm not sure if it's just the framework limitation or if I have to upgrade the drivers because it seems that OS X handles the drivers on it's own (does OS X only do that with software updates?). Would GLEW be the proper way to go like I would on Windows?

What kind of functionality am I missing out on besides uniform buffer objects (UBOs)? Those seem quite efficient from what I've read, and it sounds like OpenGL ES 3.0 will be supporting it along with MRTs!

Also, it does sound like this was covered extensively, but I just want to clarify these points:

1) Unused input, output, and uniforms within the shader will not be compiled

2) Unused functions are optimized out if unused

3) Input, output and uniforms would also be optimized out if they're used in functions aren't being used

4) Unused properties will be optimized out

For point #4, I just want to clarify this. Let's say I have the following struct for my Material:


struct Material
{
	sampler2D ambientMap; // used if UBER_AMBIENT_MAPPING defined
	vec4 ambientColor; // used if UBER_AMBIENT_MAPPING is NOT defined
	
	sampler2D diffuseMap; // used if UBER_DIFFUSE_MAPPING defined
	vec4 diffuseColor; // used if UBER_DIFFUSE_MAPPING defined
	
	float power; // used if UBER_SPECULAR_LIGHTING defined
	sampler2D specularMap; // used if UBER_SPECULAR_MAPPING defined
	vec4 specularColor; // used if UBER_SPECULAR_MAPPING is NOT defined
	
	sampler2D normalMap; // used if UBER_NORMAL_MAPPING defined
	
	// these are only used if UBER_ENVIRONMENT_MAPPING is defined
	float reflectFactor;
	float refractFactor;
	float reflectIndex;
	samplerCube envMap;
};

The following struct has many properties, but not all of those properties will be used in a single configuration my über shader. For example, ambientMapping is only used if the preprocessor, UBER_AMBIENT_MAPPING, is defined, and if it isn't, ambientColor will be the fallback. This would be useful for games that have models that don't necessarily need texturing for everything. The last few uniforms aren't even used unless environment mapping is enabled for reflection and refraction. Right now, my struct code is littered with #ifdef, #else, #endif preprocessors. This bloats the code and makes it messy. I like to keep my code looking as clean as possible too...

I'd also like to point out that my über shader system does recompile shaders on-the-fly if desired. For example, if no global lights are being used, no point lights, or no spot lights, specular lighting is toggled, etc, everything would be recompiled. This would reduce the amount of shader configurations sitting around. Although I'm aware it's very expensive to recompile, I do offer the option in my Scene class' code by enabling/disabling certain states, which would add or remove certain scene-wide über shader flags, which would in turn remove them from the loaded models attached to that scene, then recompilation will occur. I look at it this way: it's better to recompile on-the-fly for shaders to remove unneeded features. Less features generally result in faster processing per-vertex and per-fragment.

EDIT: I re-read this again after posting:

The compliance of different implementations with the spec isn't the problem here (although, it is a big problem too) -- the spec doesn't define how code should be optimized. Some drivers may aggressively optimize your code, while others may do a literal translation into assembly without even folding constant arithmetic... If it wasn't an issue, then Unity wouldn't have wasted their time solving the problem with their pre-compiler. In general it's still best to write sensible shader code assuming that the compiler will not optimize it at all, but at least with Unity's solution, they can be sure that certain optimizations will always be done on their code, regardless of which driver it's running on.

The quality of the optimizer makes a huge difference. On platform A, it had a good compiler, so I could write straightforward code and assume it would run at near theoretical speeds. On platform B with a bad compiler, I reduced a 720p post-processing shader from 4ms to 2ms simply by manually performing all of the optimizations that I assumed the compiler would do for me (and that platform A was doing for me). This was such a problem that me and the other graphics programmers seriously considered taking a few weeks off to build a de-compiler for platform A, so we could use it to optimize our code for platform B!

This is how I've been doing everything, but then everything is littered in messy preprocessor statements, and is difficult to edit later on as you add features. Should I continue using preprocessors everywhere to ensure that unused stuff isn't used? If so, maybe I should continue my XML-schema to do this.

This topic is closed to new replies.

Advertisement