Have you used GL_ARB_shader_subroutine or DX Dynamic Shader Linking ?
I'm curious to know the performance advantages/disadvantages you got from implementing this over a bigger number of shader swaps. So far I've only tested this on a Shadow Map pass where I write only to the Z buffer, and made a shader that has skinning on/off based on the subroutines. I only got 1 skinned object though and 100 other ones. This was the quickest change I could make to observe the performance difference and to my surprise I have 1% lower overall performance. I imagine there's a number of shader swaps at which point it's faster to use subroutines. I'm also using an AMD HD7850 and I noticed lower GPU usage ratio when using subroutines, as if the driver is doing more work. This is very similar with separate shader objects where I observe a 50% drop in performance and a 50% drop in GPU usage while driver calls like glBindProgramPipeline take a whole lot more than glUseProgram, so I'm also questioning driver quality for this feature.