Jump to content
  • Advertisement
Sign in to follow this  
Vincent_M

Optimizing Out Uniforms, Attributes, and Varyings

This topic is 1854 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I've decided to take the über shader approach. Due to this, I'm going to have all of my uniforms, attributes and varyings floating around at the top of my shader. I started using #ifdef / #endif preprocessors to check whether all this data needs to be included into the shader or not, but it gets messy quickly and difficult to read. Would it be a bad practice to just remove all of these preprocessors, and have the compiler optimize them out? Shader compilers typically do this anyway, so I would think this would possibly speed up the process since I wouldn't be including all the extra preprocessor text and those checks, right?

Share this post


Link to post
Share on other sites
Advertisement

GL talks about 'active' uniforms, and explicitly provides for unused stuff to be removed entirely. So you should be fine.

Share this post


Link to post
Share on other sites
D3D won't remove unused attributes/varyings, which reduces the performance of your shaders - does anyone know GL's behavior here?

Share this post


Link to post
Share on other sites

Ok, so it sounds like letting the shader do it would be the best way to go. Funny enough, I store all my complex shaders in an XML-based schema which allows my shader class to automatically binds the attributes and then generates a collection of uniform locations for me instead of having to track them myself, so I probably shouldn't concern myself with efficiency, lol.

 

@Hodgman: OpenGL (ES) seems to optimize-out unused stuff above on all the platforms I've tested it on. Now, I'm not sure if this is always going to be the case or not...

Share this post


Link to post
Share on other sites

D3D won't remove unused attributes/varyings, which reduces the performance of your shaders...

Really? If they are unused then any compiler worth anything should do the proper dead code ellimination. Never used D3D myself, but i find it extremely difficult to believe D3D relevant compiler and relevant recompiler for GPU (which is free of whatever restrictions are at D3D side) both choose not to do that.

There has to be some confusion here about terms or something.

OpenGL: Unused uniforms and attributes can be removed (GLSL implementation choice - never seen any insane implementations that does not) and their location will be reported as "-1" (the same as when querying identifiers that were not in the source to begin with).

Share this post


Link to post
Share on other sites



D3D won't remove unused attributes/varyings, which reduces the performance of your shaders...

Really? If they are unused then any compiler worth anything should do the proper dead code ellimination. Never used D3D myself, but i find it extremely difficult to believe D3D relevant compiler and relevant recompiler for GPU (which is free of whatever restrictions are at D3D side) both choose not to do that.

There has to be some confusion here about terms or something.
Perhaps your D3D driver can perform this optimization, but D3D can't out of correctness. Vertex shader input structures ('attributes') have to match up with the 'input layout' descriptor (not sure on OGL name - the code that binds your attributes?). D3D represents the way that data is read from buffers/streams to vertex shader attribute-inputs as this descriptor object, which relies on the fact that the shader author can hard-code their "attribute locations", and then put the same hard-coded value into the descriptor without querying.
GL on the other hand requires you to reflect on the shader to discover attribute locations after compilation, allowing them to move around or disappear.

With 'varyings' (interpolated vertex outputs and pixel inputs), these are usually described as a struct in D3D, where each member is given a hard-coded location/register number by the shader author. This structure has to match exactly in both pixel and vertex shader. D3D compiles it's shaders in isolation, so when compiling the vertex shader, it has no way to know if a varying is actually used in the pixel shader or not, and therefore it can't remove any of them. If any are unused in the vertex shader, you'll get a big warning about returning uninitialized variables. If any are unused in the pixel shader, the compiler can't cull them because the interface with the vertex shader will no longer match up.
This design choice allows you to pre-compile all your shaders individually offline, and then use them in many ways at runtime with extremely little error checking or linking code inside the driver.
GL can cull variables because it requires both an expensive compilation and linking step to occur at runtime. Basically, D3D traded a small amount of shader author effort in order to greatly simplify the runtimes for CPU performance.

Share this post


Link to post
Share on other sites
For reference, as your OGL information is horrifically out of date, this is how it goes in OGL land:

First, shader stage inputs and outputs, regardless of stage, are called ... input and output: "in" / "out". The, imho pointless, notion of "attributes" and "varyings" in glsl source have been deprecated - good riddance.

Information is exchanged betweed shader stages per variable and/or one or more interface blocks (i use only interface blocks, except vertex in and fragment out as the drivers were a bit buggy way back then and i got just too used to not using the interface blocks there).

Vertex shader inputs (ie. attributes, sourced from buffers or as dangling attributes if not) do need a location number which are usually specified in the glsl source ("layout(location=7)") or with the rarely used alternative of querying/changing them outside glsl.

Fragment shader outputs work similarly (MRT for example).

Not sure whether one can specify variable locations inside or outside interface blocks elsewhere - would be insanely ridiculous thing to do (common sense would leave it as implementation detail outside OGL specification), so i highly doubt it is even allowed. The whole location querying/setting business is just soo silly that i did my best to forget all of it the moment the alternative got added to core OGL. So, can not say for certain that it is impossible.

Variables outside interface blocks are matched by name and interface blocks by block name (not the variable name that uses the block - which is very convenient). Sounds very similar to D3D, except the mandatory register allocation stuff.

Shader stages are compiled in isolation (that is the way it has always been), and linked together into shader program(s) later. If an input is not used then it is thrown away - D3D does probably the same, no? A shader stage can not know whether its output is used and it will be kept, of course.

OGL is usually a good specification and as expected, what exactly "compiling" and "linking" does under the hood is implementation detail and driver writers are free to do what they think is best for their particular hardware. Generally, tho, the final steps of compiling are done at link time for better results (like "whole program optimization" in VC). I can not see any reason for D3D not to do the same (it needs to recompile the intermediate anyway).

PS. shader programs can be extracted as binary blobs for caching to skip all of the compiling/linking altogether - i have never found any reason to use them myself (ie. never suffered shader count explosion).

"uniforms" are per shader stage and are hence easy to throw away at compile time. Uniform locations can also have their location defined in glsl source - however, they kind of "forgot" to add that ability with 3.* core, so either 4.3 core or ARB_explicit_uniform_location is needed.

This design choice allows you to pre-compile all your shaders individually offline, and then use them in many ways at runtime with extremely little error checking or linking code inside the driver.
GL can cull variables because it requires both an expensive compilation and linking step to occur at runtime. Basically, D3D traded a small amount of shader author effort in order to greatly simplify the runtimes for CPU performance.

AFAICS, both directions have their good and bad points and a lot of muddy water in-between. However, i can share my observations from OGL fence (NB! Just observations - i can not say whether nor how much it holds water nowadays).

Having precompiled intermediate is one of the most reoccurring requests at OGL side (even after binary blobs got already added) - having D3D brought up as an example time and time and time again. So, whats the holdup? If it makes sense for OGL then why is it not added? To paraphrase what people that actually do the drivers say: D3D intermediate destroys information, information which is vital for recompiling and optimizing the D3D intermediate (which is far from trivial) to the stuff the hardware actually needs. I imagine D3D intermediate has significantly improved over the years (ie. adding more high level stuff into it - making it a high level language and undoing the gains the intermediate initially had), but not sure (driver devs have been become a rarity in the public forums to put it mildly). Either way, it can not be better than not having the middle-muddle at all.

All it is good for is faster compilation times (at least as is often claimed, but i suspect the claims might be fairly out of date) - which is still way slower than no compiling/linking at all with OGL binary blobs (surely D3D has something similar?). Except one needs to cache thous first ... dang.

Share this post


Link to post
Share on other sites

addendum: did not quite remember what the "separate program objects" brought to OGL land ... well, reminded myself and: it is a way to use the D3D mix-and-match approach of forming a shader program from different stages. Similarly, with the same pitfals - there is no "whole program optimization" done (although, i quess, the driver might choose to do that in the background when it gets some extra time).

 

In short:

D3D: mix-and-match.

OGL: whole-program, or mix-and-match if you want it.

Share this post


Link to post
Share on other sites

For reference, as your OGL information is horrifically out of date, this is how it goes in OGL land:
First, shader stage inputs and outputs, regardless of stage, are called ... input and output: "in" / "out". The, imho pointless, notion of "attributes" and "varyings" in glsl source have been deprecated - good riddance.

Yeah I was just sticking with the terminology already present in the thread.

The new in/out system in GLSL is much more sensible, especially once you add more stages to the pipeline between vertex and pixel smile.png

 

Seeing the OP is using this terminology though, perhaps they're using GL2 instead of GL3 or GL4, which limits their options sad.png

GL2's API for dealing with shaders, uniforms, attributes and varyings is absolutely terrible compared to the equivalents in D3D9, or the more modern APIs of D3D10/GL3...

 

 

BTW when using interface blocks for uniforms, the default behaviour in GL is similar to D3D in that all uniforms in the block will be "active" regardless of whether they're used or not though - no optimisation to remove unused uniforms is done. It's nice that GL gives you a few options here though (assuming that every GL implementation acts the same way with these options...).

D3D's behaviour is similar to the std140 layout option, where the memory layout of the buffer is defined by the order of the variables in your block and some packing rules. No optimisation will be done on the layout of the "cbuffer" (D3D uniform interface block), due to it acting as a layout definition for your buffers.

With the default GL behaviour, the layout of the block isn't guaranteed, which means you can't precompile your buffers either. The choice to allow this optimisation to take place means that you're unable to perform other optimisatons.

Again though, assuming your GL implementation is up to date, you've got the option of enabling packing rules or optimisations, or neither.

 

 

Shader stages are compiled in isolation (that is the way it has always been), and linked together into shader program(s) later. If an input is not used then it is thrown away - D3D does probably the same, no?

This is the difference I was trying to point out biggrin.png

D3D does not have an explicit linking step, which is the only place where it's safe to perform optimizations on the layout of the interface structures.

In D3D9 there was an implicit linking step, but it's gone in D3D10. Many GL implementations are also notorious for doing lazy linking, like this implicit step, with many engines issuing a "fake draw call" after binding shaders to ensure that they're actually compiled/linked when you wanted them to be, to avoid CPU performance hiccups during gameplay...

Once you've created the individual D3D shader programs for each stage (vertex, pixel, etc), it's assumed that you can use them (in what you call a mix-and-match) fashion straight away, as long as you're careful to only mix-and-match shaders with interfaces that match exactly, without the runtime doing any further processing/linking.

In D3D9, there was some leeway in whether the interfaces matched exactly, but this requires the runtime to do some checking/linking work as a regular part of draw-calls (part of the reason it's draw calls are more expensive than D3D10/GL), so this feature was scrapped. Now, it's up to the programmer to make sure that their shaders will correctly link together as they author them, so that at runtime no validation/fix-ups needs to be done.

 

PS. shader programs can be extracted as binary blobs for caching to skip all of the compiling/linking altogether - i have never found any reason to use them myself (ie. never suffered shader count explosion).

This features isn't the same as D3D's compilation -- you can only extract blobs for the current GPU and driver. The developer can't precompile their shaders and just ship the blobs.

Having precompiled intermediate is one of the most reoccurring requests at OGL side (even after binary blobs got already added) - having D3D brought up as an example time and time and time again. So, whats the holdup? If it makes sense for OGL then why is it not added?

That's a pretty silly argument. To make a similarly silly one from the other sdie of the fence: in Windows 8 Metro, you can't compile HLSL shaders at runtime at all, but are forced to pre-compile them into bytecode ahead of time and ship these blobs to the customer. If runtime compilation is so important, then why was it removed (by a group that is about of equal importance/expertise to Khronos/ARB)? tongue.png

Yeah, the driver's internal compiler could do a better job with the high-level source rather than pre-compiled bytecode, but the fact is that compiling HLSL/GLSL is slow. D3D's option to load pre-compiled bytecode shaders is order of magnitude faster. You may not have personally run into a problem with it, but plenty of developers have, which is why this feature is so popular. Just how large C++ games can take between minutes to hours to build, the shader code-bases in large games can take just a few seconds, or half an hour... Even with the caching option, it's very poor form to require your users to wait for 10 minutes the first time they load the game.

Sure, you can trade runtime performance in order to reduce build times, but to be a bit silly again, if this is such a feasible option, why do large games not do it?

Another reason why runtime compilation in GL-land is a bad thing, is because the quality of the GLSL implementation varies widely between drivers. To deal with this, Unity has gone as far as to build their own GLSL compiler, which parses their GLSL code and then emits clean, standardized and optimized GLSL code, to make sure that it runs the same on every implementation. Such a process is unnecessary in D3D due to there being a single, standard compiler implementation.

Edited by Hodgman

Share this post


Link to post
Share on other sites

In all of this, it needs to be noted that in D3D shader compilation is actually a two stage process.

 

Stage 1 takes the shader source code and compiles it to a platform-independent binary blob (D3DCompile).

Stage 2 takes that platform-independent binary blob and converts it to a platform-specific shader object (device->Create*Shader).

 

Stage 1 is provided by Microsoft's HLSL compiler and can be assumed to be the slowest part because it involves preprocessing, error checking, translation, optimization, etc (the classic C/C++ style compilation model, in other words).  The binary blob produced at the end of this stage is what people are talking about when they mention shipping binary blobs with your game.

 

Stage 2 is provided by the vendor's driver and can be assumed to be much faster as it's just converting this platform-independent blob to something platform-specific and loading it to the GPU.  The driver can assume that the blob it's fed has already passed all of the more heavyweight tests at stage 1, although what drivers do or do not assume would be driver-dependent behaviour.

 

So our stage 1 can be performed offline and the platform-independent blob shipped with the game.  Stage 2 - the faster stage - is all that needs to be performed at load/run time.  What's interesting about this model is that the compiler used for stage 1 doesn't have to be Microsofts; it can in theory be any compiler; so long as the blob is correct and consistent it is acceptable input to stage 2.  MS themselves proved this by switching D3D9 HLSL to the D3D10 compiler years ago.  If you had the specs (presumably available in the DDK) you could even write your own; you could even in theory write one that takes GLSL code as input and outputs a D3D compatible blob.

 

There is no good reason whatsoever why GLSL can't adopt a similar compilation behaviour; this is nothing to do with OpenGL itself, it's purely an artificial software artefact of how GLSL has been specified.  How can I say this with confidence?  Simple - I just look at the old GL_ARB_vertex_program and GL_ARB_fragment_program extensions, and I see that they were closer to the D3D model than they were to GLSL.  No linking stage, separation of vertex and fragment programs, ability to mix and match, arbitrary inputs and outputs, both shared and local standalone uniforms; they had all of these and required much less supporting infrastructure and boilerplate to be written before you could use them.  GLSL, despite offering a high level language, was in many ways a huge step backwards from these extensions.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

Participate in the game development conversation and more when you create an account on GameDev.net!

Sign me up!