Lights : Multi-pass vs Multi-light and abstraction

Started by
4 comments, last by Schrompf 17 years, 2 months ago
Hi... I'm currently experimenting with Cg interfaces in order to create shader permutations easily, at run-time and on demand. My main interest is the ability to abstract light related calculations in such a way, so the material shader shouldn't care about the light type. This include light attenuation as well as shadowmap calculations and projected textures. When i finished the basic interface for setting up parameters for shader interfaces, i thought i should try to write a multi-light shader in case to see the differences in performance from multipass. I created 2 versions of a simple dot3 diffuse shader. One which renders only one light at a time (accumulating lights through additive blending) and one which renders 4 lights at once. Both shaders perform tangent-space transformations in the fragment program (per-vertex tangent basis are passed as texcoords to the fragment program, and the light vector is transformed in tangent space using the interpolated matrices). The scene is just a box, with its faces facing inwards (something like a room), with 4 colored spot-lights with both angular and distance attenuation. Based on FPS calculations, both versions performed the same (with minor differences; sometimes the multipass approach is faster and sometimes the multi-light is faster, depending on the window size, or (more correctly), on the number of shaded pixels, if the camera is in the room). This is true even if the framebuffer is a 64-bit floating point texture (16-bit per channel RGBA). In fact when using a floating point render-target, the multi-light shader performs the same or worse than multipass and the FPS isn't steady. Now to the questions: 1) what is wrong with the above comparison? I expected the multi-light shader to perform better that the multi-pass approach, because : a) there are less texture fetches (2 for multi-light instead of 8 for multipass) b) there is no blending with the multi-light shader (no read-modify-write), and the difference should have been more visible when using a floating point render target. As a note, rendering was done on a 1280x1024 render-target (either directly to the window or to the float buffer). 2) when dealing with multiple lights in a single shader, where do you perform tangent space transformations? Inside the vertex program, passing light vectors to the fragment program as texcoords, or inside the fragment program? The first approach limits the amount of lights a single shader can process by the number of free texcoords you have to pass around data, while the second performs too many calculations in the fragment program. I know that the scene is really simple, and things would be different if i have used a more complex model (more polygons). In this case i would expect the multi-light approach to be far better than multipass because then i would be vertex bound. Now that vertex calculations are kept to a minimum, shouldn't the above mentioned points (1.a and 1.b) make a difference between the two versions? Thanks for reading this. If you have anything to comment, i'd be glad to hear it. HellRaiZer EDIT : I forgot to mention that my gfx card is a 7800GTX and the compiled shader profiles were NV_vertex_program3 and NV_fragment_program2.
HellRaiZer
Advertisement
For a extensive discussion of that topic, see this thread: http://www.gamedev.net/community/forums/topic.asp?topic_id=424468

In my tests, multiple lights per shader were always faster than single light per pass. 20% to 50% on my test scenes. You'll find the exact test bed in the thread linked above. Vertex Processing will hardly ever be a problem in any case.

We transform everything into tangent space in the vertex shader, then renormalise it in the pixel shader. Reduces a single shadow-less light to maybe 5 math instructions, which made them nearly free in our tests. Only the shadow casting ones are expensive. To cope with all the permutations, we generate shader source on the fly by a script, compile the shaders and cache them with some clever indexing tricks. The overhead to lookup shaders based on light setups was a problem earlier but some profiling and said indexing tricks helped alot. For cubemap reflective surfaces, tangent-to-world transform is needed, though. Luckily you are free to place all the calculations where you want them, it just depends on what scripts you write.

Bye, Thomas
----------
Gonna try that "Indie" stuff I keep hearing about. Let's start with Splatter.
It sounds like you are becoming arithmetic instruction bound in the fragment shaders. I don't know if Cg has something similar, but a program can be compiled into assembly with the fxc compiler in HLSL. If you can, compile both programs and see how many instructions each one is using.

I don't recall the exact architecture of the 7800 GTX (it is discussed in several presentations by NVIDIA) but I think under certain circumstances that some simple instructions can be executed two at a time per fragment processor. So the overall parallelism of the graphic pipeline is coming out better in the the multipass method.

This result actually surprises me, but the proof is in the pudding. Also, just because you are blending in the multipass method doesn't mean that it should be slower. Your card has smokin' fast memory bandwidth, plus if another segment of the pipeline is the bottleneck then you get to do your blending for 'free' since there are latencies with the rest of the pipeline.

Again, I am not sure what tools you have available, but check out the NV documents in their developer section for methods on how to profile the pipeline. There is certainly a reason for your results, you just have to pinpoint what it is!
For the multi-light non-shadowed case, you could transform the normals from tangent to world space once, then you save 4 instructions per light, at the cost of 4 instructions one time, which should speed up that case more.

On relatively long shaders, blending is free or cheaper, due to bubbles in the ROP pipeline caused by stalling for texture fetches or math ops to finish.
Thanks for the replies.

Schrompf, thanks for the link. I was following this thread at that time, but i forgot about it. Your results are exactly what i was expecting to see, but as i said this isn't the case. Thanks to SimmerD's suggestion i was able to make the multilight shader faster than multi-pass. I was transforming the light vector to tangent space instead of the normal vector to world space, and this gave me a shader with 110 instructions (in contrast with the single light shader which was 33 instructions long), and as you suspect the majority of them was arithmetic instructions.

If i transform the tangent space normal into world space (for both shaders in order for the comparison to be fair), then the single light shader is 39 instructions long and the multilight is (still) 110 instructions long. But this time the multi-light shader performs approx. 25% faster than multipass (and approx. 5% faster than the previous version).

So i think the conclusion is that pushing as many lights as you can into a single pass can help with performance (if it is done right). I should examine the application using either gDebugger or NVPerfGraph in order to find the real bottleneck in both cases, but this requires instrumented drivers, so i'll do it some time in the near future.

My main concern with multiple lights per pass was the need for extra texcoords for each light (if world to tangent was done in the vertex program), but i think this is up to the shader writer to decide. What i can't figure out is how can i abstrast the light interface in such a way so it can be used with both versions. For example the if i'm doing tangent space transformations in the vertex program, then i need the light position in object space (which needs one uniform update per object). But if i do them in the fragment program it is more natural to work in world space, because i get out with one uniform update (light's position) per light instead of per object. I'll have to work with it a little more in order to find the best combination. If someone has something to comment on the light interface, i'd like to hear it.

Thanks again for the replies. Finally the results are as i expected them to be :)

HellRaiZer
HellRaiZer
You don't need light data in object space. Usually you transform the normal (and tangent and binormal) into world space in the vertex shader, then build the tangent matrix from it. This is your tangent<->world matrix, equal which way you use it. From my experience, you always feed the same data to the shaders and you can still freely decide which coordinate space you perform your lighting calculations in.

We use tangent space lighting because this allows us to shift a considerable amount of calculations to the vertex shader. This requires one texture register per light, two if you do shadow mapping. If you do lighting in world space, you don't need texture registers per light anymore, you're limited by constant registers then. Alot of room, but you'll need to do attenuation, shadow map projection and similar in the pixel shader. Stuff that is better handled in the vertex shader in my opinion. Haven't done an exact comparision, though.

Bye, Thomas
----------
Gonna try that "Indie" stuff I keep hearing about. Let's start with Splatter.

This topic is closed to new replies.

Advertisement