The number one problem I'm encountering now is the instruction limit (512) imposed by Unity/CG/Shaderlab, the card itself seems to be irrelevant unless I'm missing something. Having said that I will happily admit I'm no GPU optimisation guru, so there may be decent gains to be had. I'll give you a few representative snippets (as happens with testing the whole thing is a huge mess), and if you have any ideas on optimisation it would be much appreciated.

**General Setup:**[source lang="plain"] half4 splat_control = tex2D (_Control, IN.uv_Control); half3 col; float3 p0, p1, p2, p3, v; const int cone_steps = 15; float db; float dist; float4 tex; float height0; float height1; float height2; float height3; float cone_ratio; v = normalize(IN.eye.xyz); v.z = abs(v.z); db = 1.0-v.z; db*=db; db*=db; db=1.0-db*db; v.xy *= db; v.xy *= parallaxDepth; v /= v.z; dist = length(v.xy); p0 = float3(IN.uv_Control.x * (_TerrainX/_Tile0), IN.uv_Control.y * (_TerrainZ/_Tile0), 0); p1 = float3(IN.uv_Control.x * (_TerrainX/_Tile1), IN.uv_Control.y * (_TerrainZ/_Tile1), 0); p2 = float3(IN.uv_Control.x * (_TerrainX/_Tile2), IN.uv_Control.y * (_TerrainZ/_Tile2), 0); p3 = float3(IN.uv_Control.x * (_TerrainX/_Tile3), IN.uv_Control.y * (_TerrainZ/_Tile3), 0);[/source]

**Cone Stepping Loop (x 4)**[source lang="plain"] for (int i=0;i<cone_steps; i++ ) { tex = tex2D(_parallax0, p0.xy); height0 = saturate(tex.w - p0.z); cone_ratio = tex.z; p0 += v * (cone_ratio * height0 / (dist + cone_ratio)); }[/source]

**Height Weighting The Splatting:**[source lang="plain"] height0 = 1 - max(p0.z, 0.0001); height1 = 1 - max(p1.z, 0.0001); height2 = 1 - max(p2.z, 0.0001); height3 = 1 - max(p3.z, 0.0001); height0 = height0 * height0; height0 = height0 * height0; height1 = height1 * height1; height1 = height1 * height1; height2 = height2 * height2; height2 = height2 * height2; height3 = height3 * height3; height3 = height3 * height3; splat_control *= float4(height0, height1, height2, height3); float totalSplat = dot(splat_control, float4(1,1,1,1)); splat_control /= totalSplat; float2 pAv = splat_control.r * p0.xy + splat_control.g * p1.xy + splat_control.b * p2.xy + splat_control.a * p3.xy; [/source]

**Splatting to get Final Result:**[source lang="plain"] col = splat_control.r * tex2D (_Splat0, pAv).rgb; o.Normal = splat_control.r * UnpackNormal(tex2D(_BumpMap0, pAv)); o.Gloss = splat_control.r * _Spec0; col += splat_control.g * tex2D (_Splat1, pAv).rgb; o.Normal += splat_control.g * UnpackNormal(tex2D(_BumpMap1, pAv)); o.Gloss += splat_control.g * _Spec1; col += splat_control.b * tex2D (_Splat2, p2.xy).rgb; o.Normal += splat_control.b * UnpackNormal(tex2D(_BumpMap2, pAv)); o.Gloss += splat_control.b * _Spec2; col += splat_control.a * tex2D (_Splat3, p3.xy).rgb; o.Normal += splat_control.a * UnpackNormal(tex2D(_BumpMap3, pAv)); o.Gloss += splat_control.a * _Spec3; o.Specular = o.Gloss;[/source]

As you can see, there's a lot of repetitiveness, in multiples of 4, which I hope means good optimisation possibilities. But I can't figure out the practicalities. I don't think most of the vector operations could be generalised to matrices. I can't figure a way to square each component of a vector more efficiently. And some of the areas actually took more instructions when I converted them to use vectors instead of floats. Frustrating.