What happens if you declare the kernel/offset arrays as uniforms and use the for loop?
I tried that too after reading some positive feedbacks about it, but without success.
Making a shader transpiler and a new shader language is a whole project in itself, Hodg :D When you don't have a tooling team that makes that stuff, the time might be best invested in something else.
Yeah, this is practically beyond capacity of an indie studio..
However, i wonder if a compute shader implementation would be faster, e.g. processing 8x8 pixels per invocation.
There would be much less texture access, and maybe it beats texture cache.
I'd be interested to know how a compute shader would behaves in that case.
Some news about the issue:
Using the AMD Shader Analyser and analyzing both looped and unrolled versions of the shader, I saw that both were almost identical. One thing though:
The unrolled version has only 5 texture lookups instead of 9. Why ? Because some of the "kernel" array values are just "0.0f" and the compiler strip the related texture lookups.
But the assembly code shows that AMD compiler is able to unroll the loop by itself.
So, after removing them, I ran both versions too see how they behave:
-Loop version: 1.7 ms
-Unrolled version : 0.24 ms
Conclusion: no real changes on nVidia, except that both run a little bit faster.
Is there a tool similar to AMD Shader Analyzer, but for nVidia gpus ?