I think the most important thing in optimizing a BRDF is, that you reduce linear time to constant time. Some parts of the BRDF don't need to be calculated per light. Just take a look at your version of Schlick's fresnel:

foreach (light) { float f0Sqrt = (n1 - n2) / (n1 + n2); float f0 = f0Sqrt * f0Sqrt; float fresnel = f0 + (1 - f0) * pow(1 - dot(L, H), 5); }

If you implement it this way, you almost reduce the linear code by half of its instructions:

float f0Sqrt = (n1 - n2) / (n1 + n2); float f0 = f0Sqrt * f0Sqrt; float cf0 = 1 - f0; foreach (light) { float fresnel = f0 + cf0 * pow(1 - dot(L, H), 5); }

I've reduced my BRDF this way. And this is also the reason I'm using this reduced version of Schlick's fresnel (yes I'm using the refractive indices as well) in my BRDF, because the full fresnel equation just can't be reduced this way and is way too expensive in comparison to this one. That's also the reason why I prefer the GGX NDF over any other NDF. It's pretty damn physically accurate and can be reduced into just a few instructions. Actually it's probably even faster than Blinn-Phong.

float roughnessSqr = roughness * roughness; float numerator = roughnessSqr / PI; float roughnessSqrSub1 = roughnessSqr - 1; foreach (light) { float NDotH = dot(N, H); float NDotHSqr = NDotH * NDotH; float denominatorSqrt = NDotHSqr * roughnessSqrSub1 + 1; float denominator = denominatorSqrt * denominatorSqrt; float ggx = numerator / denominator; }