Shader Permutations

Started by
9 comments, last by Matias Goldberg 8 years, 2 months ago

Hey guys, so this is a fairly simple question. I wanted to know the method that you guys tend to find the best for handling shader permutations.

Do you use bit values and an auto generator like Unity does, or possibly create a class per permutation like Unreal Engine, maybe even a third option that I don't know of? I know that a few of the benefits of the later is, that by hardcoding shader permutations on a per class basis, any shader parameter differences can be hardcoded to circumvent parameter look up, a needed step for any generic shader system , and that getting a permutation of a shader can be as simple as GetShader<ClassName>(). One of the faults of this approach is that you have to touch engine data to add a new shader, but realistically don't you have to for any approach you use that requires engine input? My current issue with bit masks is that they can be susceptible to being deprecated, i.e in the case that the bit mask values for a certain permutation changes, while the engine is using an older version of permutation bit masks. They also have a more tedious approach of permutation look up, because you must first generate a permutation mask based on the current rendering pass, and then search for the permutation, i.e


void CSceneProxy::Render()
{
  uint64 PermutationMask = 0;
 
 if(Is Skinned)
  PermutationMask |= SGF_SKINNED;

  if(Is point light pass)
   PermutationMask |= SGF_POINT_LIGHT;

  etc...

  CShader* pShader = GetShaderMap()->GetPermutation(PermutationMask);


}

and a few other reasons. Anyways, I know that there are multiple ways to handle this method, but I just wanted to get some insight from those that have more experience than I.

Advertisement

My current issue with bit masks is that they can be susceptible to being deprecated, i.e in the case that the bit mask values for a certain permutation changes, while the engine is using an older version of permutation bit masks.

You can mitigate this by adding bit-mask info into your shader reflection system. You can then either assert that your hard-coded bits match the data:
e.g. AssertMsg( SGF_SKINNED == shader->GetBitMask("SGF_SKINNED"), "Code/Data out of date" );
Or you can fetch the right bit values from the shader:
e.g. u32 SGF_SKINNED = shader->GetBitMask("SGF_SKINNED"); AssertMsg( SGF_SKINNED != 0, "Code/Data out of date" );

In my system, I call the bits in the mask "Options", and an Option may cover more than one bit -- e.g. if you wanted a num_lights option, it would use multiple bits to store an integer within the mask.
I wrote a bit about my shader permutation system here, on sides 51 to 63.

Thank you for your reply Hodgman! I just finished reading through your slides, and thank you. I wanted to ask if you also have a method of skipping a certain permutation , i.e if a Normal Mapped shader and a specular texture shader can not exist at the same time?

"Option" dependencies is a feature that's currently missing, but that I want to add smile.png

It shouldn't be too hard to add an extra layer of constraints -- e.g. "if normal mapping is zero, parallax mapping must also be zero", and then take that into account when iterating through permutations for compilation.

You may be interested in how we tackled it in Ogre 2.1 with the Hlms (see section 8 HLMS).

Basically, 64 bits will soon look like not enough flags to handle all the permutations. But like Hodgman said, many of these options are mutually exclusive; or most of the combinations aren't used.

The solution we went for was, at creation time, to create a 32-bit hash to the shader based on all the options (which are stored in an array), and store this hash in the Renderable.

Then at render time we pull the right shader from the cache using the "final hash". The final hash is produced by merging the Renderable's one with the Pass hash. A pass hash contains all settings that are common to all Renderables and may change per pass (i.e. during the shadow map pass vs another receiver pass vs extra pass that doesn't use shadow mapping for performance reasons).

You only need to access the cache when the hash between the previous and next Renderable changes, which why it is a good idea to sort your Renderables first.

Source 2 slides suggest a similar thing to map their PSOs (see slide 13-23; PPT for animated version).

While a 64-bit permutation mask works well for simple to medium complex scenes, it will eventually fall short; specially if you need to adapt to very dynamic scenarios or have lots of content. However implementing a 64-bit permutation mask is a good exercise to get a good idea of the pros and cons of managing shaders.

Ah Okay, yeah it shouldn't be too difficult, maybe something like , if using your lua style as an example :

/*

.. shader permutation declarations

skip_permutation(NormalMap == 1, SpecularMap ==1), or something like that

*/

Thanks for the response and readings Matias. I am currently parsing through the ogre documentation. From what I can tell, you are using a hybrid implementation, where the c++ class determines which parts of a shader to compile, i.e the shader templates, somewhat like shader stitching, Or am I wrong? I'm scrolling through the source code to try and see examples of the "hlms scripts" to try and figure out what they do also. I'm doing something that is similar to Source Engine's shader cache, but what I'm doing instead is creating a temporary shader pipeline cache key. This is to circumvent the need for per thread pools and synchronizing the pools. I use command lists on different threads and then send it to the render thread for execution, so I'd do something like this:



/*
  on an worker thread 
*/
void CSceneProxy::Render(CCommandList* pCommandList)
{
   //this is created out of the command lists internal memory allocator, not on the heap
   CTransientPipelineState* PipelineState = pCommandList->CreateAndBindTransientPipelineState();

   PipelineState->m_pVertexShader = GetVS();
   
 ... etc


}

/*
  - when the command list is actually being executed 
*/
void CRenderThread::Execute(CComandList* pCommandList)
{
    
   for(all commands in : pCommandList)
  {
    if(command.m_nID = CommandID_CreateTransientPipelineState)
    {
          //check the global cache 
          auto* existingPipelineState = GetGlobalPipelineCache()->Find(command.GetPipelineHash());

          //if the pipeline state doesn't already exist , create it 
          if(existingPipelineState == nullptr)
          {
                existingPipelineState = new CD3D12PipelineState(command.m_pVertexShader, etc....);

                //add to the global cache 
                GetGlobalPipelineCache()->Insert(existingPipelineState->GetHash(), existingPipelineState);

          }

     }

   }
  

}



Thanks for the response and readings Matias. I am currently parsing through the ogre documentation. From what I can tell, you are using a hybrid implementation, where the c++ class determines which parts of a shader to compile, i.e the shader templates, somewhat like shader stitching, Or am I wrong?
I'm scrolling through the source code to try and see examples of the "hlms scripts" to try and figure out what they do also.

Sort of, yes.

The C++ side loads the templates (by scanning the folder and looking for matching patterns in the name: i.e. Whatever_piece_vs.hlsl will get loaded for parsing in the vertex shader because it contains piece_vs and has an hlsl extension) and parses them.

The templates have a syntax very similar to the macro preprocessor e.g. @property( hasNormalMaps )@end would be similar to #ifdef hasNormalMaps #endif.
The reason for this is that it has certain features that are harder to do with the regular preprocessor, and it allows us a consistent preprocessor syntax between all HLSL/GLSL/Metal shader syntaxes (by the way, GLSL ES compilers are particularly bad) and I don't trust many compilers to efficiently compile a shader that is composed of functions (i.e. transformVertex(); #litVertex(); outputVertex(); etc).
The Hlms templates are here (look at Common, Pbs and Unlit).

The C++ implementation also knows the best way to update const and texture buffers in an efficient way during render time.

I'm scrolling through the source code to try and see examples of the "hlms scripts" to try and figure out what they do also. I'm doing something that is similar to Source Engine's shader cache, but what I'm doing instead is creating a temporary shader pipeline cache key. This is to circumvent the need for per thread pools and synchronizing the pools. I use command lists on different threads and then send it to the render thread for execution, so I'd do something like this:

If I understood it correctly, it's not bad, and it's simple (that's a good thing). But it requires a serialized step at the end before executing the commands, which is not scalable as the number of commands grows. Basically your performance will be defined by O(N / thread_count + N') where N is the amount of commands and N' is also the amount of commands, but it takes less time because the loop is smaller.
Source 2's method is superior because it only needs synchronization if the shader doesn't exist yet in this frame; which will only happen the first few frames and then very occasionally. After a couple of frames the threads will not even need to synchronize.

Ah Okay, I think I''m starting to understand it better. If you don't mind me asking, what would you say are the benefits of taking the approach you have with hlms? If a template class system with a class factory was implemented, then you would be able to circumvent the need to add a large amount of mark up to your .hlsl file, parse said file, and creating a second process for situations in which the preprocessor doesn't work. The cpp class's singular purpose would be for setting the compile time shader macros, and how to properly set shader parameters, leaving you with a clean shader file. An Example would be a set up like this;



// PBS.hlsl

#if (PBS_WORKFLOW == METALLIC_WORKFLOW)
    float4 GetPBS(...)
    {
       return ....
    }
#else
   float4 GetPBS(...)
   {
     return ....
   }
#endif


void PixelShader(...)
{
   float4 color = GetPBS(...);
}


and "PBS.cpp"




enum EPBSWorkFlow
{
     ePBS_Metallic,
     ePBS_Specular
};


template<EPBSWorkFlow workFlow, ..etc>
class TPBSShader : public CPixelShader
{
  static void GetDefinitions(CCompilationOptions& options)
  {
     options.SetDefine("PBS_WORKFLOW", workFlow == ePBS_Metallic ? "METALLIC_WORKFLOW" : "SPECULAR_WORKFLOW");
  }

};





I think you're going into the wrong part of the Hlms I wanted to highlight. You're looking at the template system it uses and how shaders get compiled; while what I wanted to highlight that:

  • There are properties that involve Renderable information (i.e. a Mesh without normals cannot use lighting. A mesh without tangents cannot use normal mapping, a material that uses UV set #4 for the diffuse texture cannot be used with a mesh that only has 2 set of UVs, etc)
  • There are properties that are per Material (i.e. material has a normal map, material uses parallax mapping, material uses 4 textures, uses transparency, etc)
  • There are properties that are per Pass (i.e. this is a shadow mapping pass, this is a pass with shadow mapping and 3 active lights, this is a pass without shadow mapping and one active light, this is the Early-Z pass, this is a deferred rendering pass, this is the light accumulation pass, etc)

When assigning a Material to a Renderable, the Hlms analyzes both the Material and Renderable and creates a hash (and stores the property combination in a cache). Two different materials and two different renderables could perfectly end up with the same hash (i.e. both meshes have the same vertex formats, both materials use the same features but have different values).

Right before rendering a pass, the Hlms analyzes the Pass and creates a hash (and stores the property combination of the pass in a cache).
While rendering the pass, both the hash stored in each Renderable (which contains Renderable+Material information) and the Pass hash are combined to form the final hash. This final hash is used to pull the actual shaders and PSO needed to render (and if it doesn't exist, one is created).

The key here I wanted to highlight is that you have to make a system that accounts all 3 sources of information, and that a 64-bit key won't be enough if you use one bit for each setting.

Ah Okay, I think I''m starting to understand it better. If you don't mind me asking, what would you say are the benefits of taking the approach you have with hlms? If a template class system with a class factory was implemented, then you would be able to circumvent the need to add a large amount of mark up to your .hlsl file, parse said file, and creating a second process for situations in which the preprocessor doesn't work.

The goals were the following:

  1. Speed up iteration: A C++ class that patches up wherever the standard preprocessor doesn't work means that when you need to change something, you have to build the exe again, then running the exe which needs to load all the assets. This has an iteration time of between 20 seconds and 3 minutes depending on complexity. If you made a mistake, you need to modify your code and repeat. Modifying an Hlms shader template and reloading can be done without all that, and takes a couple milliseconds. This is massive improvement and a big major point.
  2. Make up for broken compilers: You're developing for HLSL, but in GLES-land on Android, it is a disaster. There is a major vendor whose for loops don't work at all because the dev misinterpreted the glsl ES 2.0 spec. They've released a fix for Lollipop, but older versions (and there's a lot of KitKat phones out there) still run that unpatched driver. So, the Hlms provides @foreach which allows us to manually unroll the loops (btw other vendors are really bad at unrolling loops). Some GLSL ES drivers are broken to the point where the only safe thing to do with macros is #ifdef #else #endif. But forget about #define DIFFUSE material.xyz * otherValue.x - and then using DIFFUSE instead of material.xyz * otherValue.x.
  3. Reuse snippets as much as possible. Multi-line macros on HLSL/GLSL need to be appended a '\' at the end of each line. The Hlms templates have @piece for this.
  4. The generated shader should be relatively efficient. Most approaches to uber-shaders end up with a nice modular system that results in a horribly slow shader; because it often delegates modularity to external functions and leaving unused code in the file; hoping the compiler will optimize it heavily by removing dead code and inlining all functions and avoid redundant calculation that was done multiple times inside each function. Pieces allow us to fine-tune the generated output and avoid redundant calculations.

C++ Templates heavily increase compilation time, so that makes it a no-no. But even then, your solution assumes these options can be resolved at compile time while often this has to be evaluated at run time. You're basically moving the problem from an external file (our Hlms shader templates) back into C++, which we wanted to avoid.

Note that the Hlms doesn't actually dictate how you write an implementation. You can just avoid the whole meta-preprocessor we provide (i.e. never use @property, @foreach, @counter, @piece, etc) and do it the way you want: using the HLSL's preprocessor and stitching the leftovers from C++.

--

There is no single way to achieve to the same path and the Hlms allows you to do it any way you want.

Like I said I think you're focusing on the template parsing side whereas I wanted you to look that, at design-level, you have 3 sources of information (Renderable, Material, Pass) and there is information you can bake once when assigning the material to the Renderable, information that needs to be evaluated per Pass, and a bit of work that must be done at render time per Renderable (i.e. merging baked hash in Renderable with the Pass hash).

Also you will need some sort of cache system with a 32-bit hash value, instead of a 64-bit bitfield.

Ah okay, I think I'm starting to understand it better now, thank you for taking your time with me. So for your renderable, you'd have masks such as :

SGF_SKINNED,

SGF_NORMALS,

SGF_TANGENTS, etc..

your material will have

SGF_MATERIAL_NORMALMAP,

SGF_MATERIAL_SPECULARMAP, etc..

and your passes will have something along the lines of

SGF_PASS_POINTLIGHT

SGF_PASS_LIGHTMAP

and then you coalesce all of these masks to generate a shader pipeline state mask. I think my confusion was formed because I was thinking of the offline side of shader permutation generation via tools and etc, while you were trying to display the runtime side, but thank you for your help.

This topic is closed to new replies.

Advertisement