Okay, upon closer inspection, I understand why the tangent transformation matrix is being used in the pixel shader. It's because there really isn't any other way to do it.
If you did the transformation in the vertex shader, you'd be limited to the number of texcoord channels, but by moving it to the pixel shader, you can go way over the 8-light-limit. You're just trading that off for speed, because for every single pixel on the screen you're doing light_count+1 number of transforms into tangent space.
My educated guess is that the slowdown has something to do with that transformation matrix.
The way I see it, you have two options:
- You could consider multipass lighting. Basically, the pixel shader is called once for every light in your scene, and you blend them all together. It's very speedy because no state changes are involved.
- You could optimise the crap out of the current shader, and cross your fingers in the hopes of it being a little bit faster.
If you choose number 1), I need to know something: Is it possible to use for-loops inside a technique?