Jump to content

  • Log In with Google      Sign In   
  • Create Account

Optimizing Out Uniforms, Attributes, and Varyings


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
12 replies to this topic

#1 Vincent_M   Members   -  Reputation: 638

Like
0Likes
Like

Posted 15 May 2013 - 03:27 PM

I've decided to take the über shader approach. Due to this, I'm going to have all of my uniforms, attributes and varyings floating around at the top of my shader. I started using #ifdef / #endif preprocessors to check whether all this data needs to be included into the shader or not, but it gets messy quickly and difficult to read. Would it be a bad practice to just remove all of these preprocessors, and have the compiler optimize them out? Shader compilers typically do this anyway, so I would think this would possibly speed up the process since I wouldn't be including all the extra preprocessor text and those checks, right?



Sponsor:

#2 Promit   Moderators   -  Reputation: 7196

Like
2Likes
Like

Posted 15 May 2013 - 04:23 PM

GL talks about 'active' uniforms, and explicitly provides for unused stuff to be removed entirely. So you should be fine.



#3 Hodgman   Moderators   -  Reputation: 30388

Like
0Likes
Like

Posted 15 May 2013 - 07:39 PM

D3D won't remove unused attributes/varyings, which reduces the performance of your shaders - does anyone know GL's behavior here?

#4 Vincent_M   Members   -  Reputation: 638

Like
0Likes
Like

Posted 15 May 2013 - 10:45 PM

Ok, so it sounds like letting the shader do it would be the best way to go. Funny enough, I store all my complex shaders in an XML-based schema which allows my shader class to automatically binds the attributes and then generates a collection of uniform locations for me instead of having to track them myself, so I probably shouldn't concern myself with efficiency, lol.

 

@Hodgman: OpenGL (ES) seems to optimize-out unused stuff above on all the platforms I've tested it on. Now, I'm not sure if this is always going to be the case or not...



#5 tanzanite7   Members   -  Reputation: 1295

Like
0Likes
Like

Posted 17 May 2013 - 12:41 PM

D3D won't remove unused attributes/varyings, which reduces the performance of your shaders...

Really? If they are unused then any compiler worth anything should do the proper dead code ellimination. Never used D3D myself, but i find it extremely difficult to believe D3D relevant compiler and relevant recompiler for GPU (which is free of whatever restrictions are at D3D side) both choose not to do that.

There has to be some confusion here about terms or something.

OpenGL: Unused uniforms and attributes can be removed (GLSL implementation choice - never seen any insane implementations that does not) and their location will be reported as "-1" (the same as when querying identifiers that were not in the source to begin with).

#6 Hodgman   Moderators   -  Reputation: 30388

Like
1Likes
Like

Posted 17 May 2013 - 06:49 PM



D3D won't remove unused attributes/varyings, which reduces the performance of your shaders...

Really? If they are unused then any compiler worth anything should do the proper dead code ellimination. Never used D3D myself, but i find it extremely difficult to believe D3D relevant compiler and relevant recompiler for GPU (which is free of whatever restrictions are at D3D side) both choose not to do that.

There has to be some confusion here about terms or something.
Perhaps your D3D driver can perform this optimization, but D3D can't out of correctness. Vertex shader input structures ('attributes') have to match up with the 'input layout' descriptor (not sure on OGL name - the code that binds your attributes?). D3D represents the way that data is read from buffers/streams to vertex shader attribute-inputs as this descriptor object, which relies on the fact that the shader author can hard-code their "attribute locations", and then put the same hard-coded value into the descriptor without querying.
GL on the other hand requires you to reflect on the shader to discover attribute locations after compilation, allowing them to move around or disappear.

With 'varyings' (interpolated vertex outputs and pixel inputs), these are usually described as a struct in D3D, where each member is given a hard-coded location/register number by the shader author. This structure has to match exactly in both pixel and vertex shader. D3D compiles it's shaders in isolation, so when compiling the vertex shader, it has no way to know if a varying is actually used in the pixel shader or not, and therefore it can't remove any of them. If any are unused in the vertex shader, you'll get a big warning about returning uninitialized variables. If any are unused in the pixel shader, the compiler can't cull them because the interface with the vertex shader will no longer match up.
This design choice allows you to pre-compile all your shaders individually offline, and then use them in many ways at runtime with extremely little error checking or linking code inside the driver.
GL can cull variables because it requires both an expensive compilation and linking step to occur at runtime. Basically, D3D traded a small amount of shader author effort in order to greatly simplify the runtimes for CPU performance.

#7 tanzanite7   Members   -  Reputation: 1295

Like
0Likes
Like

Posted 18 May 2013 - 03:49 PM

For reference, as your OGL information is horrifically out of date, this is how it goes in OGL land:

First, shader stage inputs and outputs, regardless of stage, are called ... input and output: "in" / "out". The, imho pointless, notion of "attributes" and "varyings" in glsl source have been deprecated - good riddance.

Information is exchanged betweed shader stages per variable and/or one or more interface blocks (i use only interface blocks, except vertex in and fragment out as the drivers were a bit buggy way back then and i got just too used to not using the interface blocks there).

Vertex shader inputs (ie. attributes, sourced from buffers or as dangling attributes if not) do need a location number which are usually specified in the glsl source ("layout(location=7)") or with the rarely used alternative of querying/changing them outside glsl.

Fragment shader outputs work similarly (MRT for example).

Not sure whether one can specify variable locations inside or outside interface blocks elsewhere - would be insanely ridiculous thing to do (common sense would leave it as implementation detail outside OGL specification), so i highly doubt it is even allowed. The whole location querying/setting business is just soo silly that i did my best to forget all of it the moment the alternative got added to core OGL. So, can not say for certain that it is impossible.

Variables outside interface blocks are matched by name and interface blocks by block name (not the variable name that uses the block - which is very convenient). Sounds very similar to D3D, except the mandatory register allocation stuff.

Shader stages are compiled in isolation (that is the way it has always been), and linked together into shader program(s) later. If an input is not used then it is thrown away - D3D does probably the same, no? A shader stage can not know whether its output is used and it will be kept, of course.

OGL is usually a good specification and as expected, what exactly "compiling" and "linking" does under the hood is implementation detail and driver writers are free to do what they think is best for their particular hardware. Generally, tho, the final steps of compiling are done at link time for better results (like "whole program optimization" in VC). I can not see any reason for D3D not to do the same (it needs to recompile the intermediate anyway).

PS. shader programs can be extracted as binary blobs for caching to skip all of the compiling/linking altogether - i have never found any reason to use them myself (ie. never suffered shader count explosion).

"uniforms" are per shader stage and are hence easy to throw away at compile time. Uniform locations can also have their location defined in glsl source - however, they kind of "forgot" to add that ability with 3.* core, so either 4.3 core or ARB_explicit_uniform_location is needed.

This design choice allows you to pre-compile all your shaders individually offline, and then use them in many ways at runtime with extremely little error checking or linking code inside the driver.
GL can cull variables because it requires both an expensive compilation and linking step to occur at runtime. Basically, D3D traded a small amount of shader author effort in order to greatly simplify the runtimes for CPU performance.

AFAICS, both directions have their good and bad points and a lot of muddy water in-between. However, i can share my observations from OGL fence (NB! Just observations - i can not say whether nor how much it holds water nowadays).

Having precompiled intermediate is one of the most reoccurring requests at OGL side (even after binary blobs got already added) - having D3D brought up as an example time and time and time again. So, whats the holdup? If it makes sense for OGL then why is it not added? To paraphrase what people that actually do the drivers say: D3D intermediate destroys information, information which is vital for recompiling and optimizing the D3D intermediate (which is far from trivial) to the stuff the hardware actually needs. I imagine D3D intermediate has significantly improved over the years (ie. adding more high level stuff into it - making it a high level language and undoing the gains the intermediate initially had), but not sure (driver devs have been become a rarity in the public forums to put it mildly). Either way, it can not be better than not having the middle-muddle at all.

All it is good for is faster compilation times (at least as is often claimed, but i suspect the claims might be fairly out of date) - which is still way slower than no compiling/linking at all with OGL binary blobs (surely D3D has something similar?). Except one needs to cache thous first ... dang.

#8 tanzanite7   Members   -  Reputation: 1295

Like
0Likes
Like

Posted 18 May 2013 - 04:41 PM

addendum: did not quite remember what the "separate program objects" brought to OGL land ... well, reminded myself and: it is a way to use the D3D mix-and-match approach of forming a shader program from different stages. Similarly, with the same pitfals - there is no "whole program optimization" done (although, i quess, the driver might choose to do that in the background when it gets some extra time).

 

In short:

D3D: mix-and-match.

OGL: whole-program, or mix-and-match if you want it.



#9 Hodgman   Moderators   -  Reputation: 30388

Like
1Likes
Like

Posted 19 May 2013 - 02:26 AM

For reference, as your OGL information is horrifically out of date, this is how it goes in OGL land:
First, shader stage inputs and outputs, regardless of stage, are called ... input and output: "in" / "out". The, imho pointless, notion of "attributes" and "varyings" in glsl source have been deprecated - good riddance.

Yeah I was just sticking with the terminology already present in the thread.

The new in/out system in GLSL is much more sensible, especially once you add more stages to the pipeline between vertex and pixel smile.png

 

Seeing the OP is using this terminology though, perhaps they're using GL2 instead of GL3 or GL4, which limits their options sad.png

GL2's API for dealing with shaders, uniforms, attributes and varyings is absolutely terrible compared to the equivalents in D3D9, or the more modern APIs of D3D10/GL3...

 

 

BTW when using interface blocks for uniforms, the default behaviour in GL is similar to D3D in that all uniforms in the block will be "active" regardless of whether they're used or not though - no optimisation to remove unused uniforms is done. It's nice that GL gives you a few options here though (assuming that every GL implementation acts the same way with these options...).

D3D's behaviour is similar to the std140 layout option, where the memory layout of the buffer is defined by the order of the variables in your block and some packing rules. No optimisation will be done on the layout of the "cbuffer" (D3D uniform interface block), due to it acting as a layout definition for your buffers.

With the default GL behaviour, the layout of the block isn't guaranteed, which means you can't precompile your buffers either. The choice to allow this optimisation to take place means that you're unable to perform other optimisatons.

Again though, assuming your GL implementation is up to date, you've got the option of enabling packing rules or optimisations, or neither.

 

 

Shader stages are compiled in isolation (that is the way it has always been), and linked together into shader program(s) later. If an input is not used then it is thrown away - D3D does probably the same, no?

This is the difference I was trying to point out biggrin.png

D3D does not have an explicit linking step, which is the only place where it's safe to perform optimizations on the layout of the interface structures.

In D3D9 there was an implicit linking step, but it's gone in D3D10. Many GL implementations are also notorious for doing lazy linking, like this implicit step, with many engines issuing a "fake draw call" after binding shaders to ensure that they're actually compiled/linked when you wanted them to be, to avoid CPU performance hiccups during gameplay...

Once you've created the individual D3D shader programs for each stage (vertex, pixel, etc), it's assumed that you can use them (in what you call a mix-and-match) fashion straight away, as long as you're careful to only mix-and-match shaders with interfaces that match exactly, without the runtime doing any further processing/linking.

In D3D9, there was some leeway in whether the interfaces matched exactly, but this requires the runtime to do some checking/linking work as a regular part of draw-calls (part of the reason it's draw calls are more expensive than D3D10/GL), so this feature was scrapped. Now, it's up to the programmer to make sure that their shaders will correctly link together as they author them, so that at runtime no validation/fix-ups needs to be done.

 

PS. shader programs can be extracted as binary blobs for caching to skip all of the compiling/linking altogether - i have never found any reason to use them myself (ie. never suffered shader count explosion).

This features isn't the same as D3D's compilation -- you can only extract blobs for the current GPU and driver. The developer can't precompile their shaders and just ship the blobs.

Having precompiled intermediate is one of the most reoccurring requests at OGL side (even after binary blobs got already added) - having D3D brought up as an example time and time and time again. So, whats the holdup? If it makes sense for OGL then why is it not added?

That's a pretty silly argument. To make a similarly silly one from the other sdie of the fence: in Windows 8 Metro, you can't compile HLSL shaders at runtime at all, but are forced to pre-compile them into bytecode ahead of time and ship these blobs to the customer. If runtime compilation is so important, then why was it removed (by a group that is about of equal importance/expertise to Khronos/ARB)? tongue.png

Yeah, the driver's internal compiler could do a better job with the high-level source rather than pre-compiled bytecode, but the fact is that compiling HLSL/GLSL is slow. D3D's option to load pre-compiled bytecode shaders is order of magnitude faster. You may not have personally run into a problem with it, but plenty of developers have, which is why this feature is so popular. Just how large C++ games can take between minutes to hours to build, the shader code-bases in large games can take just a few seconds, or half an hour... Even with the caching option, it's very poor form to require your users to wait for 10 minutes the first time they load the game.

Sure, you can trade runtime performance in order to reduce build times, but to be a bit silly again, if this is such a feasible option, why do large games not do it?

Another reason why runtime compilation in GL-land is a bad thing, is because the quality of the GLSL implementation varies widely between drivers. To deal with this, Unity has gone as far as to build their own GLSL compiler, which parses their GLSL code and then emits clean, standardized and optimized GLSL code, to make sure that it runs the same on every implementation. Such a process is unnecessary in D3D due to there being a single, standard compiler implementation.


Edited by Hodgman, 19 May 2013 - 02:41 AM.


#10 mhagain   Crossbones+   -  Reputation: 7979

Like
1Likes
Like

Posted 19 May 2013 - 05:30 AM

In all of this, it needs to be noted that in D3D shader compilation is actually a two stage process.

 

Stage 1 takes the shader source code and compiles it to a platform-independent binary blob (D3DCompile).

Stage 2 takes that platform-independent binary blob and converts it to a platform-specific shader object (device->Create*Shader).

 

Stage 1 is provided by Microsoft's HLSL compiler and can be assumed to be the slowest part because it involves preprocessing, error checking, translation, optimization, etc (the classic C/C++ style compilation model, in other words).  The binary blob produced at the end of this stage is what people are talking about when they mention shipping binary blobs with your game.

 

Stage 2 is provided by the vendor's driver and can be assumed to be much faster as it's just converting this platform-independent blob to something platform-specific and loading it to the GPU.  The driver can assume that the blob it's fed has already passed all of the more heavyweight tests at stage 1, although what drivers do or do not assume would be driver-dependent behaviour.

 

So our stage 1 can be performed offline and the platform-independent blob shipped with the game.  Stage 2 - the faster stage - is all that needs to be performed at load/run time.  What's interesting about this model is that the compiler used for stage 1 doesn't have to be Microsofts; it can in theory be any compiler; so long as the blob is correct and consistent it is acceptable input to stage 2.  MS themselves proved this by switching D3D9 HLSL to the D3D10 compiler years ago.  If you had the specs (presumably available in the DDK) you could even write your own; you could even in theory write one that takes GLSL code as input and outputs a D3D compatible blob.

 

There is no good reason whatsoever why GLSL can't adopt a similar compilation behaviour; this is nothing to do with OpenGL itself, it's purely an artificial software artefact of how GLSL has been specified.  How can I say this with confidence?  Simple - I just look at the old GL_ARB_vertex_program and GL_ARB_fragment_program extensions, and I see that they were closer to the D3D model than they were to GLSL.  No linking stage, separation of vertex and fragment programs, ability to mix and match, arbitrary inputs and outputs, both shared and local standalone uniforms; they had all of these and required much less supporting infrastructure and boilerplate to be written before you could use them.  GLSL, despite offering a high level language, was in many ways a huge step backwards from these extensions.


It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.


#11 tanzanite7   Members   -  Reputation: 1295

Like
0Likes
Like

Posted 19 May 2013 - 06:00 PM

Yeah I was just sticking with the terminology already present in the thread.

Ah, that explains it. Now, speaking of which - to OP, are you sure you are not using an unnecessarily old GLSL?

GL2's API for dealing with shaders, uniforms, attributes and varyings is absolutely terrible compared to the equivalents in D3D9, or the more modern APIs of D3D10/GL3...

Quite, was a bit puzzled myself when the GLSL first surfaced. OGL sure has long lasting developmental issues (stemming from: crippling need for consensus / compatibility / support).
 

BTW when using interface blocks for uniforms, the default behaviour in GL is similar to D3D in that all uniforms in the block will be "active" regardless of whether they're used or not though - no optimisation to remove unused uniforms is done. It's nice that GL gives you a few options here though (assuming that every GL implementation acts the same way with these options...).

Did not quite understand what you said here (not sure which parts refer to D3D and which OGL).
About uniforms with OGL: Uniform buffers are not part of program object and hence are not directly bound to any program. Plain uniforms are bound to program object - however, it appears that everyone compiles an internal buffer for thous under the hood and the two cases are indistinguishable at hardware level.

So, in either case, there is no special "loading" code generated for uniforms and the only optimization of removing unused stuff one can speak of is ... well, just do not use the parts of the uniform you do not use ... duh. As the underlying hardware is the same then D3D is bound to end up doing the exact same thing here (unused uniforms are, in all regards that matter, thrown out - regardless of what is seen/reported at API side).

Ie. only buffers are bound - the individual uniforms are just offsets in machine code.

Oh, and thank goodness for std140 or my head would explode in agony.

With the default GL behaviour, the layout of the block isn't guaranteed, which means you can't precompile your buffers either. The choice to allow this optimisation to take place means that you're unable to perform other optimisatons.

What optimizations (makes zero difference at driver/OGL/GPU side)? You mean CPU side, ie. filling buffers with data? Yeah, it would be pretty painful not to use std140. IIRC, it was added at the same time as interface blocks - so, if you can use interface blocks then you can always use the fixed format also ... a bit late here to go digging to check it though.

Once you've created the individual D3D shader programs for each stage (vertex, pixel, etc), it's assumed that you can use them (in what you call a mix-and-match) fashion straight away, as long as you're careful to only mix-and-match shaders with interfaces that match exactly, without the runtime doing any further processing/linking.

Yep, got that when i re-read the "separate program objects" extension ( http://www.opengl.org/registry/specs/ARB/separate_shader_objects.txt ) - it uses "max-and-match" to describe it. It is core in 4.3. Not using it any time soon, nice to have the option though (as most shader programs do not particularly benefit from "whole-program-optimization").


Having precompiled intermediate is one of the most reoccurring requests at OGL side (even after binary blobs got already added) - having D3D brought up as an example time and time and time again. So, whats the holdup? If it makes sense for OGL then why is it not added?

That's a pretty silly argument.


Ee.. you lost me here :/. I think you implied an argument where there was none. I was conveying "wonderment" as perceived by me - it is not an argument from me nor from the "wonderer".

But if i would speculate anyway then the reason it has not been added to OGL might be:
* Khronos is slow and half the time i just want to throw my shoe at them.
* Instead of one specification one has to hope is implemented correctly (rolleyes.gif) - now there would be two.
* Consensus lock / competition ... ie. no MS as arbitrator to break the lock.
* Insufficient demand from thous that matter (learnin-ogl-complaining-a-lot persons do not matter).
* The question whether it would be worthwhile for OGL specifically has not been confidently settled.
* There are more important matters to attend to - maybe later.
* All the above.

PS. i would like to have an GLSL intermediate option - as you said, in case of shader explosion (as i call it), it gets problematic (uncached runs).

wait for 10 minutes the first time they load the game.

Then i would say that one is doing something wrong. One does not need thousands of shaders to show the splash screen ;)

... again, i am not against an intermediate, i would just like to point out that its absence is not as widespread and grave problem as it is often portrayed to be.

Sure, you can trade runtime performance in order to reduce build times, but to be a bit silly again, if this is such a feasible option, why do large games not do it?

Yep, that is silly indeed. Cannot quite use what one does not have - D3D, as far as i gather from your responses, does not have the option (no intermediate / whole-program-optimization) to begin with. Asking why the option that does not exist is not used more often ... well, good question.

Another reason why runtime compilation in GL-land is a bad thing, is because the quality of the GLSL implementation varies widely between drivers.

Having an extra specification and implementation is unlikely to be less problematic than not having the extra.

To deal with this, Unity has gone as far as to build their own GLSL compiler, which parses their GLSL code and then emits clean, standardized and optimized GLSL code, to make sure that it runs the same on every implementation. Such a process is unnecessary in D3D due to there being a single, standard compiler implementation.

... continuation: It is unnecessary in OGL too - GLSL etc is well specified. If implementers fail to follow the spec then changing the spec content (intermediate spec etc) to make them somehow read the darn thing and not fuck up implementing that ... is silly.

Leaving the bad example aside, what you wanted to say, if i may, is that a third party (Khronos) compiler would be helpful as it would leave only the intermediate for driver.

Perhaps. However, I highly doubt it would be any less buggy. Compilers are not rocket science (having written a few myself) - the inconsistencies stem from lesser user base and some extremely lazy driver developers and shader writers not reading the spec either. "Khronos compiler" would not have fixed any of thous.

---------------------------
I hope we are not annoying OP with this, a bit, OT tangent (assessing the merits of intermediate language in context of OGL and D3D). At least i can say i know more about D3D side than before - yay and thanks for that smile.png. Got my answer, which turned out to be relevant to OP too as OGL has the D3D mix-and-match option too, which if used has indeed the same limitations.

need sleep.
edit: yep, definitely bedtime.

Edited by tanzanite7, 19 May 2013 - 06:08 PM.


#12 Hodgman   Moderators   -  Reputation: 30388

Like
1Likes
Like

Posted 19 May 2013 - 10:43 PM

Did not quite understand what you said here (not sure which parts refer to D3D and which OGL).

Neither D3D or GL (by default) will remove an unused variable from a uniform block / cbuffer. So, if you're putting every possible value for 100 different shaders into one big interface block and hoping that GL will remove the unused ones, this won't happen unless you use the 'packed' layout. In D3D it just won't happen.
In either case, I'd recommend to the OP to take responsibility for designing sensible UBO/CBuffer layouts themselves wink.png And of course to use UBOs rather than GL2's uniforms wink.png

What optimizations (makes zero difference at driver/OGL/GPU side)? You mean CPU side, ie. filling buffers with data? Yeah, it would be pretty painful not to use std140.

Yeah, on GL without std140, you can't build your UBO contents ahead of time. For example, in my engine all the materials are saved to disc in the same format that they'll be used in memory, so they can be read straight from disc into a UBO/CBuffer. With GL2 only, this optimisation isn't possible due to it rearranging your data layouts unpredictably. D3D9 doesn't support CBuffers(UBOs) but allows you to very efficiently emulate them due to it not attaching values to programs like GL2 and supporting explicit layouts.
 
FWIW, the GL2 model of dealing with uniforms as belonging to the "shader program" itself does actually make perfect sense on very early SM2/SM3 hardware. Many of these GPUs didn't actually have hardware registers for storing uniform values (or memory fetch hardware to get them from VRAM), so uniforms were implemented only as literal values embedded in the shader code. To update a uniform, you had to patch the actual shader code with a new literal value! By the GeForce8 though, this hardware design had disappeared, so it stopped making sense for the API to bundle together program code instances with uniform values.

Ee.. you lost me here :/. I think you implied an argument where there was none.

Yep unsure.png I misread your wonderment as a rhetorical question!

Cannot quite use what one does not have - D3D, as far as i gather from your responses, does not have the option (no intermediate / whole-program-optimization) to begin with. Asking why the option that does not exist is not used more often ... well, good question.

Sorry, by this I meant to imply that it is common for games to make more optimal use of the GPU via large numbers of permutations, which increases their shader compilation times. They could reduce their compilation times by using more general (or more branchy) shaders, but that would decrease their efficiency on the GPU.

Then i would say that one is doing something wrong. One does not need thousands of shaders to show the splash screen ;)
i would just like to point out that its absence is not as widespread and grave problem as it is often portrayed to be.

Yeah you don't need them all on the splash screen, but you do need to ensure that every program that could possibly be used by a level is actually compiled before starting the level, to avoid jittery framerates during gameplay.
It is a big problem in my experience too. All the current-gen games that I've worked on (around half a dozen on a few different engines) have had shader build times of 5-10 minutes. This hasn't been a problem for us simply because we haven't shipped Mac, Linux or mobile versions. Windows and consoles had the ability to pre-compile to cut this time off of the loading screens. If we did need to port to a GL platform, we likely would have increased the GPU requirements or decreased the GPU quality so that we could reduce the permutation count and used less efficient shaders.

It is unnecessary in OGL too - GLSL etc is well specified. If implementers fail to follow the spec

The compliance of different implementations with the spec isn't the problem here (although, it is a big problem too) -- the spec doesn't define how code should be optimized. Some drivers may aggressively optimize your code, while others may do a literal translation into assembly without even folding constant arithmetic... If it wasn't an issue, then Unity wouldn't have wasted their time solving the problem with their pre-compiler. In general it's still best to write sensible shader code assuming that the compiler will not optimize it at all, but at least with Unity's solution, they can be sure that certain optimizations will always be done on their code, regardless of which driver it's running on.

The quality of the optimizer makes a huge difference. On platform A, it had a good compiler, so I could write straightforward code and assume it would run at near theoretical speeds. On platform B with a bad compiler, I reduced a 720p post-processing shader from 4ms to 2ms simply by manually performing all of the optimizations that I assumed the compiler would do for me (and that platform A was doing for me). This was such a problem that me and the other graphics programmers seriously considered taking a few weeks off to build a de-compiler for platform A, so we could use it to optimize our code for platform B!

Edited by Hodgman, 19 May 2013 - 11:09 PM.


#13 Vincent_M   Members   -  Reputation: 638

Like
0Likes
Like

Posted 21 May 2013 - 10:08 AM

Hey all,

 

Yes, as Hodgman suggested, I'm still using OpenGL 2 as it is close to OpenGL ES 2.0. I try to support both, but I'm going to be adding desktop/mobile-specific functionality. Also, I can't figure out how to get OpenGL 3.0+ on Mountain Lion. It seems that the default framework supported is OpenGL 2.1. I'm not sure if it's just the framework limitation or if I have to upgrade the drivers because it seems that OS X handles the drivers on it's own (does OS X only do that with software updates?). Would GLEW be the proper way to go like I would on Windows?

 

What kind of functionality am I missing out on besides uniform buffer objects (UBOs)? Those seem quite efficient from what I've read, and it sounds like OpenGL ES 3.0 will be supporting it along with MRTs!

 

Also, it does sound like this was covered extensively, but I just want to clarify these points:

1) Unused input, output, and uniforms within the shader will not be compiled

2) Unused functions are optimized out if unused

3) Input, output and uniforms would also be optimized out if they're used in functions aren't being used

4) Unused properties will be optimized out

 

For point #4, I just want to clarify this. Let's say I have the following struct for my Material:

struct Material
{
	sampler2D ambientMap; // used if UBER_AMBIENT_MAPPING defined
	vec4 ambientColor; // used if UBER_AMBIENT_MAPPING is NOT defined
	
	sampler2D diffuseMap; // used if UBER_DIFFUSE_MAPPING defined
	vec4 diffuseColor; // used if UBER_DIFFUSE_MAPPING defined
	
	float power; // used if UBER_SPECULAR_LIGHTING defined
	sampler2D specularMap; // used if UBER_SPECULAR_MAPPING defined
	vec4 specularColor; // used if UBER_SPECULAR_MAPPING is NOT defined
	
	sampler2D normalMap; // used if UBER_NORMAL_MAPPING defined
	
	// these are only used if UBER_ENVIRONMENT_MAPPING is defined
	float reflectFactor;
	float refractFactor;
	float reflectIndex;
	samplerCube envMap;
};

 

The following struct has many properties, but not all of those properties will be used in a single configuration my über shader. For example, ambientMapping is only used if the preprocessor, UBER_AMBIENT_MAPPING, is defined, and if it isn't, ambientColor will be the fallback. This would be useful for games that have models that don't necessarily need texturing for everything. The last few uniforms aren't even used unless environment mapping is enabled for reflection and refraction. Right now, my struct code is littered with #ifdef, #else, #endif preprocessors. This bloats the code and makes it messy. I like to keep my code looking as clean as possible too...

 

I'd also like to point out that my über shader system does recompile shaders on-the-fly if desired. For example, if no global lights are being used, no point lights, or no spot lights, specular lighting is toggled, etc, everything would be recompiled. This would reduce the amount of shader configurations sitting around. Although I'm aware it's very expensive to recompile, I do offer the option in my Scene class' code by enabling/disabling certain states, which would add or remove certain scene-wide über shader flags, which would in turn remove them from the loaded models attached to that scene, then recompilation will occur. I look at it this way: it's better to recompile on-the-fly for shaders to remove unneeded features. Less features generally result in faster processing per-vertex and per-fragment.

 

EDIT: I re-read this again after posting:

The compliance of different implementations with the spec isn't the problem here (although, it is a big problem too) -- the spec doesn't define how code should be optimized. Some drivers may aggressively optimize your code, while others may do a literal translation into assembly without even folding constant arithmetic... If it wasn't an issue, then Unity wouldn't have wasted their time solving the problem with their pre-compiler. In general it's still best to write sensible shader code assuming that the compiler will not optimize it at all, but at least with Unity's solution, they can be sure that certain optimizations will always be done on their code, regardless of which driver it's running on.

The quality of the optimizer makes a huge difference. On platform A, it had a good compiler, so I could write straightforward code and assume it would run at near theoretical speeds. On platform B with a bad compiler, I reduced a 720p post-processing shader from 4ms to 2ms simply by manually performing all of the optimizations that I assumed the compiler would do for me (and that platform A was doing for me). This was such a problem that me and the other graphics programmers seriously considered taking a few weeks off to build a de-compiler for platform A, so we could use it to optimize our code for platform B!

This is how I've been doing everything, but then everything is littered in messy preprocessor statements, and is difficult to edit later on as you add features. Should I continue using preprocessors everywhere to ensure that unused stuff isn't used? If so, maybe I should continue my XML-schema to do this.


Edited by Vincent_M, 21 May 2013 - 10:37 AM.





Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS