Jump to content

  • Log In with Google      Sign In   
  • Create Account


#ActualHodgman

Posted 19 May 2013 - 11:09 PM

Did not quite understand what you said here (not sure which parts refer to D3D and which OGL).

Neither D3D or GL (by default) will remove an unused variable from a uniform block / cbuffer. So, if you're putting every possible value for 100 different shaders into one big interface block and hoping that GL will remove the unused ones, this won't happen unless you use the 'packed' layout. In D3D it just won't happen.
In either case, I'd recommend to the OP to take responsibility for designing sensible UBO/CBuffer layouts themselves wink.png And of course to use UBOs rather than GL2's uniforms wink.png

What optimizations (makes zero difference at driver/OGL/GPU side)? You mean CPU side, ie. filling buffers with data? Yeah, it would be pretty painful not to use std140.

Yeah, on GL without std140, you can't build your UBO contents ahead of time. For example, in my engine all the materials are saved to disc in the same format that they'll be used in memory, so they can be read straight from disc into a UBO/CBuffer. With GL2 only, this optimisation isn't possible due to it rearranging your data layouts unpredictably. D3D9 doesn't support CBuffers(UBOs) but allows you to very efficiently emulate them due to it not attaching values to programs like GL2 and supporting explicit layouts.
 
FWIW, the GL2 model of dealing with uniforms as belonging to the "shader program" itself does actually make perfect sense on very early SM2/SM3 hardware. Many of these GPUs didn't actually have hardware registers for storing uniform values (or memory fetch hardware to get them from VRAM), so uniforms were implemented only as literal values embedded in the shader code. To update a uniform, you had to patch the actual shader code with a new literal value! By the GeForce8 though, this hardware design had disappeared, so it stopped making sense for the API to bundle together program code instances with uniform values.

Ee.. you lost me here :/. I think you implied an argument where there was none.

Yep unsure.png I misread your wonderment as a rhetorical question!

Cannot quite use what one does not have - D3D, as far as i gather from your responses, does not have the option (no intermediate / whole-program-optimization) to begin with. Asking why the option that does not exist is not used more often ... well, good question.

Sorry, by this I meant to imply that it is common for games to make more optimal use of the GPU via large numbers of permutations, which increases their shader compilation times. They could reduce their compilation times by using more general (or more branchy) shaders, but that would decrease their efficiency on the GPU.

Then i would say that one is doing something wrong. One does not need thousands of shaders to show the splash screen ;)
i would just like to point out that its absence is not as widespread and grave problem as it is often portrayed to be.

Yeah you don't need them all on the splash screen, but you do need to ensure that every program that could possibly be used by a level is actually compiled before starting the level, to avoid jittery framerates during gameplay.
It is a big problem in my experience too. All the current-gen games that I've worked on (around half a dozen on a few different engines) have had shader build times of 5-10 minutes. This hasn't been a problem for us simply because we haven't shipped Mac, Linux or mobile versions. Windows and consoles had the ability to pre-compile to cut this time off of the loading screens. If we did need to port to a GL platform, we likely would have increased the GPU requirements or decreased the GPU quality so that we could reduce the permutation count and used less efficient shaders.

It is unnecessary in OGL too - GLSL etc is well specified. If implementers fail to follow the spec

The compliance of different implementations with the spec isn't the problem here (although, it is a big problem too) -- the spec doesn't define how code should be optimized. Some drivers may aggressively optimize your code, while others may do a literal translation into assembly without even folding constant arithmetic... If it wasn't an issue, then Unity wouldn't have wasted their time solving the problem with their pre-compiler. In general it's still best to write sensible shader code assuming that the compiler will not optimize it at all, but at least with Unity's solution, they can be sure that certain optimizations will always be done on their code, regardless of which driver it's running on.

The quality of the optimizer makes a huge difference. On platform A, it had a good compiler, so I could write straightforward code and assume it would run at near theoretical speeds. On platform B with a bad compiler, I reduced a 720p post-processing shader from 4ms to 2ms simply by manually performing all of the optimizations that I assumed the compiler would do for me (and that platform A was doing for me). This was such a problem that me and the other graphics programmers seriously considered taking a few weeks off to build a de-compiler for platform A, so we could use it to optimize our code for platform B!

#2Hodgman

Posted 19 May 2013 - 11:01 PM

Did not quite understand what you said here (not sure which parts refer to D3D and which OGL).

Neither D3D or GL (by default) will remove an unused variable from a uniform block / cbuffer. So, if you're putting every possible value for 100 different shaders into one big interface block and hoping that GL will remove the unused ones, this won't happen unless you use the 'packed' layout. In D3D it just won't happen.
In either case, I'd recommend to the OP to take responsibility for designing sensible UBO/CBuffer layouts themselves wink.png And of course to use UBOs rather than GL2's uniforms wink.png

What optimizations (makes zero difference at driver/OGL/GPU side)? You mean CPU side, ie. filling buffers with data? Yeah, it would be pretty painful not to use std140.

Yeah, on GL without std140, you can't build your UBO contents ahead of time. For example, in my engine all the materials are saved to disc in the same format that they'll be used in memory, so they can be read straight from disc into a UBO/CBuffer. With GL2 only, this optimisation isn't possible due to it rearranging your data layouts unpredictably. D3D9 doesn't support CBuffers(UBOs) but allows you to very efficiently emulate them due to it not attaching values to programs like GL2 and supporting explicit layouts.
 
FWIW, the GL2 model of dealing with uniforms as belonging to the "shader program" itself does actually make perfect sense on very early SM2/SM3 hardware. Many of these GPUs didn't actually have hardware registers for storing uniform values (or memory fetch hardware to get them from VRAM), so uniforms were implemented only as literal values embedded in the shader code. To update a uniform, you had to patch the actual shader code with a new literal value! By the GeForce8 though, this hardware design had disappeared, so it stopped making sense for the API to bundle together program code instances with uniform values.

Ee.. you lost me here :/. I think you implied an argument where there was none.

Yep unsure.png I misread your wonderment as a rhetorical question!

Cannot quite use what one does not have - D3D, as far as i gather from your responses, does not have the option (no intermediate / whole-program-optimization) to begin with. Asking why the option that does not exist is not used more often ... well, good question.

Sorry, by this I meant to imply that it is common for games to make more optimal use of the GPU via large numbers of permutations, which increases their shader compilation times. They could reduce their compilation times by using more general (or more branchy) shaders, but that would decrease their efficiency on the GPU.

Then i would say that one is doing something wrong. One does not need thousands of shaders to show the splash screen ;)
i would just like to point out that its absence is not as widespread and grave problem as it is often portrayed to be.

Yeah you don't need them all on the splash screen, but you do need to ensure that every program that could possibly be used by a level is actually compiled before starting the level, to avoid jittery framerates during gameplay.
It is a big problem in my experience too. All the current-gen games that I've worked on (around half a dozen on a few different engines) have had shader build times of 5-10 minutes. This hasn't been a problem for us simply because we haven't shipped Mac, Linux or mobile versions. Windows and consoles had the ability to pre-compile to cut this time off of the loading screens. If we did need to port to a GL platform, we likely would have increased the GPU requirements or decreased the GPU quality so that we could reduce the permutation count and used less efficient shaders.

It is unnecessary in OGL too - GLSL etc is well specified. If implementers fail to follow the spec

The compliance of different implementations with the spec isn't the problem here (although, it is a big problem too) -- the spec doesn't define how code should be optimized. Some drivers may aggressively optimize your code, while others may do a literal translation into assembly without even folding constant arithmetic... If it wasn't an issue, then Unity wouldn't have wasted their time solving the problem with their pre-compiler. In general it's still best to write sensible shader code assuming that the compiler will not optimize it at all, but at least with Unity's solution, they can be sure that certain optimizations will always be done on their code, regardless of which driver it's running on.

#1Hodgman

Posted 19 May 2013 - 10:43 PM

Did not quite understand what you said here (not sure which parts refer to D3D and which OGL).

Neither D3D or GL (by default) will remove an unused variable from a uniform block / cbuffer. So, if you're putting every possible value for 100 different shaders into one big interface block and hoping that GL will remove the unused ones, this won't happen unless you use the 'packed' layout. In D3D it just won't happen.
In either case, I'd recommend to the OP to take responsibility for designing sensible UBO/CBuffer layouts themselves wink.png And of course to use UBOs rather than GL2's uniforms wink.png

What optimizations (makes zero difference at driver/OGL/GPU side)? You mean CPU side, ie. filling buffers with data? Yeah, it would be pretty painful not to use std140.

Yeah, on GL without std140, you can't build your UBO contents ahead of time. For example, in my engine all the materials are saved to disc in the same format that they'll be used in memory, so they can be read straight from disc into a UBO/CBuffer. With GL2 only, this optimisation isn't possible due to it rearranging your data layouts unpredictably. D3D9 doesn't support CBuffers(UBOs) but allows you to very efficiently emulate them due to it not attaching values to programs like GL2 and supporting explicit layouts.
 
FWIW, the GL2 model of dealing with uniforms as belonging to the "shader program" itself does actually make perfect sense on very early SM2/SM3 hardware. Many of these GPUs didn't actually have hardware registers for storing uniform values (or memory fetch hardware to get them from VRAM), so uniforms were implemented only as literal values embedded in the shader code. To update a uniform, you had to patch the actual shader code with a new literal value! By the GeForce8 though, this hardware design had disappeared, so it stopped making sense for the API to bundle together program code instances with uniform values.

Ee.. you lost me here :/. I think you implied an argument where there was none.

Yep unsure.png I misread your wonderment as a rhetorical question!

It is unnecessary in OGL too - GLSL etc is well specified. If implementers fail to follow the spec

The compliance of different implementations with the spec isn't the problem here (although, it is a big problem too) -- the spec doesn't define how code should be optimized. Some drivers may aggressively optimize your code, while others may do a literal translation into assembly without even folding constant arithmetic... If it wasn't an issue, then Unity wouldn't have wasted their time solving the problem with their pre-compiler. In general it's still best to write sensible shader code assuming that the compiler will not optimize it at all, but at least with Unity's solution, they can be sure that certain optimizations will always be done on their code, regardless of which driver it's running on.

PARTNERS