big deffered shader vs many deffered shaders?

Started by
7 comments, last by kRogue 16 years, 5 months ago
I was wondering (and thise with experience please answer, anyone that says test yourself is just babbling to post as it would only test my hardware!)... background: I have a deffered shading system, each pixel has a deffered shader ID (sometimes people call it a material ID)... now for each deffered shader, the first line of the GLSL fragment shader is

if(pixel.ShaderID!=currentShader)
{
   discard;
}
where currentShader is a uniform that all differed shaders have which is set to the ID of the shader when I do the deffered shading as follows:

for_each(shader)
{
   shader->Enable();
   shader->SetUniform("currentShader", shader->ID());
   
   DrawRectangle(); 
   shader->Disable();
}
where Enable(), sets the current fragment program to what is in shader, and DrawRectangle draws a rectangle... would it be better instead to have one very large shader which basiclly does

if(pixel.ShaderID==0)
{
  doShader_0(pixel);
}
else if(pixel.ShaderID==1)
{
  doShader_1(pixel);
} 
else if(pixel.ShaderID==2)
{
  doShader_2(pixel);
} 
....
else
{
   discard;
}
as that would make it so that only one fragment program needs to get bound for deferred shading per frame... but the shader is big... I am assuming that the harware supports "real" dynamic branching in the fragment shader.... also the above method at the end of the day has fewer checks per pixel (as the first method has N checks per pixel where N is the number of shaders and the method above has k checks at a pixel where k is the shader_id of the pixel)....
Close this Gamedev account, I have outgrown Gamedev.
Advertisement
Option (1) basically does a render batch per-material. It doesn't sound quite like deferred shading.

Option (2) is more deferred-shading-like. Will it run faster than (1)? It's likely. Will it run in the first place? Depends on the HW and the code.

The number of checks-per-pixel is irrelevant as [1] may kick in to process 1 pixel while (hopefully) [2] will always render at lease several kilopixels per batch.

Last time I played with deferred shading it never came out as a clear winner. It really depends on HW (last gen is way better) but in general I would say if that performance is your metric then you shouldn't be looking there. I believe the benefit of deferred shading lies in its increased flexibility but it's something I don't need so I didn't go for it.

I still have to understand how this method can live with today's framebuffer's limitations. Even if you use the "fat buffer", what the system basically need is to do a C union on the data... ok, data can be spread to multiple framebuffers and re-composed back. With 128 RGBA32F samplers (or constant buffer trickery) one could say there's always enough space in practice... but it looks more a not-so-quick hack to me than something that's supposed to be The Right Way...

Previously "Krohm"

Quote:
I still have to understand how this method can live with today's framebuffer's limitations. Even if you use the "fat buffer", what the system basically need is to do a C union on the data... ok, data can be spread to multiple framebuffers and re-composed back. With 128 RGBA32F samplers (or constant buffer trickery) one could say there's always enough space in practice... but it looks more a not-so-quick hack to me than something that's supposed to be The Right Way...


err... I don't exactly follow... what I do is that I create 4-8 RGBA16F textures, and bind them to my FBO, each channel of the texture represents data that is needed for the lighting computation: color, position, normal, ect.... with the exception of position, these coul din fact be RGBA8 (obviously for color, but also form normal since bump maps are almost always 8buts per channel), and one could calculate the position from the depth value and screen position (i.e. texutre co-ordinate during deffered shading phase)... so if one did that, and (only) had 4 buffers the memory footprint is (only) 16MB, at GL_RGB16F, the footprint is 32MB... not so bad....moreoever, typically one has only at most 16 varying paremeters from vertex to fragment shader, note that that is exaclty 4 buffers (at 4 channels each)...
Close this Gamedev account, I have outgrown Gamedev.
I apologize, this is more a theorical rant than something which really apply in practice.

The idea of deferred shading is to have all the needed data in a fat buffer for later retrieving per-material and per-pixel.
So, as you say, we start from position, normal color... and other things.
Those "other things" (OT for short) are the whole point.
The first problem is that those things are all bit-packed and then unpacked with per-bit math. A bit ugly by today' standards, but we can live with it.

The other problem is that since the "compositing" pass takes all the data from the fat buffer we must make sure all the OT are encoded... suppose you have a material with sub-surface scattering so you need a per-pixel depth range with (let's make it even worse) an "outbound" normal and this must be added to the fat buffer since there's no other way to pass it to the compositing shader thereby "forwarding" the complexity to a next stage.
This personally gives me a bit of pain, it's clashing a bit with the flexibility of shaders themselves in my line of thinking... and to do what? Just to boil down to a standard FB in the end. It isn't really something wrong with it, just an idea gone wild... but if everything can be packed in the same number of attribs as a single shader could is there a good reason to use this instead of a lay-z pass to kill the overdraw?
In other words: if you have a well-known set of per-pixel attribs, does it really make sense to use this method? After all, the normal wouldn't go to the FB normally so there must be a break-even point in this complexity relationship.
To make the thing even worse, normal shading uses processing and on chip bandwidth. By deferring, we should save processing by trading with... memory bandwidth. I'm still not sure this makes sense.

The great thing is that changing material per-pixel sounds absolutely awesome... but with Half Life 2 having about 12k "appearances" in the whole game is it a real advantage?

In short I am still unsure this deferred shading tricks are going to pay... if you find out something it would be great to read more on your experience.

Previously "Krohm"

Quote:
Those "other things" (OT for short) are the whole point.
The first problem is that those things are all bit-packed and then unpacked with per-bit math. A bit ugly by today' standards, but we can live with it.


my take on differed shading is that it guarantees that each pixel onöy has its lighting computation done once (otherwisse you have to prey that early z-cull is good enough).

Also it allows for an object to interact with the backgroud more richly, since the OT are there for every pixel, allowing for easier writing of effects like refraction and such, these effects require that the objects behond them are correctly lit and drawn... to guarantee that in an immiediate shading style, you would then need make that those refracting like objects are drawn last....

as for your complain on the encoding part, lets say you have a facny card, like a GeForce8 series card, it supports 8MRTS, which translates into 32floats... typical vertex to fragment is atmost 8 varing vec4's, i.e. 32 floats... one can "encode" the data as floats, same as you have to pass from the vertex shader.... that is not encoding... but if one doe snot have a very fast card then you might need to encode if your render target is GL_RGBA8 (8 bits fixed) {this is what I had to do, and it was not that painful to do really, floor and fract are your friend there)...

Quote:
In other words: if you have a well-known set of per-pixel attribs, does it really make sense to use this method? After all, the normal wouldn't go to the FB normally so there must be a break-even point in this complexity relationship.
To make the thing even worse, normal shading uses processing and on chip bandwidth. By deferring, we should save processing by trading with... memory bandwidth. I'm still not sure this makes sense.


if the per pixel attributes are the same, or mostly the same, it makes more sense, as it completely avoids the issue of overdraw waste... yes what you save in GPU processing you pay in memory bandwidth... and depending on your situation that may or may not be a good idea... in my system, processing power is much tighter than memory bandwidth...

Quote:
the other problem is that since the "compositing" pass takes all the data from the fat buffer we must make sure all the OT are encoded... suppose you have a material with sub-surface scattering so you need a per-pixel depth range with (let's make it even worse) an "outbound" normal and this must be added to the fat buffer since there's no other way to pass it to the compositing shader thereby "forwarding" the complexity to a next stage.


differed shading does not need to be done all the way or none of the way, lets say you have a relatively simple calculation to be operated on a fragment, you can do a portion of that calculation on the fragment and foward that to the fat buffer... exactly what should be fowarded and what should not be is a non-trivial issue that depends on the situation... one simple example of where you don't do all the calculation in the differed shader is a simple reflection: you get the pixel of the environmental map at the object's fragment shader and then save that texel's value to the fat buffer, so the differed shader then looks something like: glFragColor= blend( refelected_texture, stuff); the point being that getting the texel from the environmnetla map is cheap and simple so it is okay if that gets overdrawn (but lighting calculations done on a fragment should be not be overdrawn)


Quote:
The great thing is that changing material per-pixel sounds absolutely awesome... but with Half Life 2 having about 12k "appearances" in the whole game is it a real advantage?


I hope that for a given level, only a few are really active, since 12k shader changes per frame seems incredibly expensive... I use the per material ID as what fragment shader to run....

Quote:
In short I am still unsure this deferred shading tricks are going to pay... if you find out something it would be great to read more on your experience.


So far things that I can do with Differed Shading that are much harder to do with out:
1) fire and forget draw: i.e just can draw the meshes in any order I choose, not have to deal with must draw this before that... main gain is that can schedule drawing meshes according to vertex shader and texture (to avoid expensive state changing). Without differed shading I have to either do a depth only pass, or draw front to back to get early z-cull.

2) effects that require stuff behind an object, like warped invisible(think Predator), refraction, heat-haze... these are much, much easier... other cool effects like wierd filters (i.e. the filter chooses the shader and such) are also really easy. Doing that with in place shading requires one to first render the scene to a texture anyways and then draw these FX objects with their funky shaders on... in my massive portalling system, this is a messy issue as I have an uber portal system which allows portalling (i.e. like in Prey) but with portals in front of each other and portals within portals ect.. I have to cap the portals in the depth buffer to get the drawing correct (and use the stencil buffer to make sure I jsut draw in the portal)... without differed shading, it is very messy as the value in the depth buffer does not really indicate the depth of the pixel there, but a hidden surface removal system only... to do those effects withoout differed shading is a major, major headache.... differed shading also is good with repsect to stencil tests: most GPU's stnecil culling sucks ass, it is diabled on the drop of a hat, so in my system lots of pixels would get their lightin calculation done even though they are completely invisible...

3) shadow mapping gets a touch better, as adding a shadowed light just becomes another pass on a differed shader, without needing to go through the vertex shader and extra 2 times (but with differed only need to go though an extra one time to generate the shadow map) Otherwise, you are limited by the number of viable texture units to how many shadowed lights can be applied to one object.. though I have to admit, I am not a big shadow fan...

[Edited by - kRogue on November 5, 2007 4:44:03 AM]
Close this Gamedev account, I have outgrown Gamedev.
Quote:Original post by kRogue
my take on differed shading is that it guarantees that each pixel onöy has its lighting computation done once (otherwisse you have to prey that early z-cull is good enough).
Yes, that's absolutely true. This takes to introduce what I call "depth-relevant attributes". Typically, color isn't depth-relevant while normal sometimes is. Position is typically a depth-relevant attribute.
Now, to reduce overdraw, lay-z-only has alot of interesting advantages: not only it eats less BW but it also allows double-speed rendering on some cards and gives a correct depth buffer...

Have you considered lay-z-only?

Quote:Original post by kRogue
Also it allows for an object to interact with the backgroud more richly, since the OT are there for every pixel, allowing for easier writing of effects like refraction and such, these effects require that the objects behond them are correctly lit and drawn... to guarantee that in an immiediate shading style, you would then need make that those refracting like objects are drawn last....
True, but deferring doesn't solve this issue by itself since it doesn't guarantee order anyway. If you want to keep for example information about the "previous" surface that sounds awesome but frankly speaking, I don't know how can you beat MIMD semantics without severing performance in a predictable way.
Depth peeling may be an idea but then how to accumulate the contributions from N layers?
Quote:Original post by kRogue
as for your complain on the encoding part, lets say you have a facny card, like a GeForce8 series card, it supports 8MRTS, which translates into 32floats... typical vertex to fragment is atmost 8 varing vec4's, i.e. 32 floats... one can "encode" the data as floats, same as you have to pass from the vertex shader.... that is not encoding... but if one doe snot have a very fast card then you might need to encode if your render target is GL_RGBA8 (8 bits fixed) {this is what I had to do, and it was not that painful to do really, floor and fract are your friend there)...
I agree - even better, for GF8 bitmasking is available.
As I said, that's not really a problem but it's nowhere as elegant as having the automatic unpacking. It's only a minor issue, mostly subjective... I would avoid bit tricks but you can do otherwise and be happy with them.
Quote:Original post by kRogue
if the per pixel attributes are the same, or mostly the same, it makes more sense, as it completely avoids the issue of overdraw waste...
No, it doesn't.
In forward rendering you compute everything, then depth-test, then (possibly) trash.
In deferred shading (at least as I've analyzed in the past) you compute the data you need then the GPU does depth test and (possibly) trashes - sure, you trash slightly less work. Later, all this stuff is resolved in a FB output in a single batch... which is typically the same as it would before.
In lay-z-only you compute only depth-relevant attribs (typically OPOS only), depth test (often at double speed) and (possibly) trash. Then another batch sequence does the real computation giving the final result.

After observation I've personally gone for lay-z, yet there are some cases in which lay-z is slower than direct rendering...
Quote:Original post by kRogue
in my system, processing power is much tighter than memory bandwidth...
On... what system? Are you running on a FX? Are you using FSAA? What resolution? What about aniso?
Quote:Original post by kRogue
differed shading does not need to be done all the way or none of the way, lets say you have a relatively simple calculation to be operated on a fragment, you can do a portion of that calculation on the fragment and foward that to the fat buffer... exactly what should be fowarded and what should not be is a non-trivial issue that depends on the situation... one simple example of where you don't do all the calculation in the differed shader is a simple reflection: you get the pixel of the environmental map at the object's fragment shader and then save that texel's value to the fat buffer, so the differed shader then looks something like: glFragColor= blend( refelected_texture, stuff); the point being that getting the texel from the environmnetla map is cheap and simple so it is okay if that gets overdrawn (but lighting calculations done on a fragment should be not be overdrawn)
Yes, I agree. This complexity computation is the point... it didn't sound so trivial to me so I didn't take my experiments to this degree.
Quote:
I hope that for a given level, only a few are really active, since 12k shader changes per frame seems incredibly expensive... I use the per material ID as what fragment shader to run....
Absolutely not! The whole game has about 12k shaders (I believe I read it on gamasutra), this doesn't mean even remotly the average screen does have this shader count.
Quote:Original post by kRogue
1) fire and forget draw: i.e just can draw the meshes in any order I choose, not have to deal with must draw this before that... main gain is that can schedule drawing meshes according to vertex shader and texture (to avoid expensive state changing). Without differed shading I have to either do a depth only pass, or draw front to back to get early z-cull.
Yes, I agree. It sounds absolutely awesome. I then went for a "smart-z": after all the testing I wish I would have chosen otherwise since it tends to break rather easily.
Quote:
2) effects that require stuff behind an object, like warped invisible(think Predator), refraction, heat-haze... these are much, much easier...
How? I still don't figure out. How can you avoid overwriting the previous layer is a SIMD-friendly way in the lay-attribute pass?
Quote:
other cool effects like wierd filters ... but with portals in front of each other and portals within portals ect.. I have to cap the portals in the depth buffer to get the drawing correct
I believe I get the point but I'm not sure this saves the day. After all, most objects will usually employ "normal" shaders.
Quote:
... without differed shading, it is very messy as the value in the depth buffer does not really indicate the depth of the pixel there, but a hidden surface removal system only... to do those effects withoout differed shading is a major, major headache...
I suppose that's true, yet dynamic prey-style portals are not going to be trivial anyway considering visibility is going to be changed pretty much and not to mention entity behaviour shall be handled accordingly... and prey didn't have so big portal counts to justify the change.
As for static portals (as given by the conventional wisdom) I'm not even sure this makes sense since they're here mostly for culling than a real graphics need.
Quote:
3) shadow mapping gets a touch better, as adding a shadowed light just becomes another pass on a differed shader, without needing to go through the vertex shader and extra 2 times (but with differed only need to go though an extra one time to generate the shadow map) Otherwise, you are limited by the number of viable texture units to how many shadowed lights can be applied to one object.. though I have to admit, I am not a big shadow fan...
I agree, but lay-z also has this advantage and rendering N lights is never going to be trivial.
Sure, you theorically waste less batches with deferring but we're assuming a complexity level that makes predicting a performance far from easy.

Previously "Krohm"

Quote:
True, but deferring doesn't solve this issue by itself since it doesn't guarantee order anyway. If you want to keep for example information about the "previous" surface that sounds awesome but frankly speaking, I don't know how can you beat MIMD semantics without severing performance in a predictable way.
Depth peeling may be an idea but then how to accumulate the contributions from N layers?


I have to confess that in my system, I had several types of texture targets:
1) Opaque normal: tpycial objects that are "solid"
2) transluscent objects
3) FX

translucsent objects get drawn to their own texture with multipicative blending, so transluscent objects are viewed as filters acting on the light channels, the nice part being that I don't have to do depth peeling and all that hassle... the bad part being that opaque normal objects need to be drawn before the transluscent stuff.

FX objects are for effects like smoke haze, ect.. these have (for now) 2 buffers they write to for data that is fowarded to their differed shaders.

the pipeline I have is:
1)in room of viewpoint:
draw opaque objects, with depth writing and testing and stencil testing
for each portal of the current room
open portal (stencil write)
recurse to room beyond portal
close portal
draw transluscent objects with depth testing but no writing.
draw FX objects with depth writing and testing

2) execute lighing differed shaders to calculate lighting
3) execute FX shaders that are allowed to use the calcualtion of the lighting shader
this sytem works well enough (but it does not handle the situation of when there are multiple FX laters on a single pixel, only the front most FX gets done)

so doing heat haze and refraction are very easy as are selected filtering: for twisted fun I made a simple FX shader which simply used the raw specular texture of the pixel "behind it".. it was kind ofunky to look at, as the model was a mdoel from doom3 (hell knight) beuing animated) but the pixels draw where it was was from the objects behind it, their specular texture... kind of silly, but it was very, very easy to do in this system.. other simply things were to take the pixel as caclulated by the lighting shader, but not use the pixel at, but rather offset by (r*cos t, r*sin t) where r is some reasonable small number and t was time... weird effect, but agian very easy for me to do...

Quote:
Now, to reduce overdraw, lay-z-only has alot of interesting advantages: not only it eats less BW but it also allows double-speed rendering on some cards and gives a correct depth buffer...

Have you considered lay-z-only?


I am going to sound like a bit of a ninnny, but what exaclty is lay-z-only? is it the wiriting of data to off screen buffers which depend on z? then drwaing stuff in a foward render style?

Quote:
On... what system? Are you running on a FX? Are you using FSAA? What resolution? What about aniso?

GeForce6600GT (128MB), AthlonXP 3200+, 1GB RAM. Yes my sytem has trouble running a fair number of the games that come out now.. but with my system I get 8 directional lights, per pixel drawing (bump mapped) at a resolution of 1028x768 (no AA) at a framerate of 60+FPS, under stress tests of a room of mirrors (4 mirrors, so 4 portals) with the room holding over 40 Doom3 models, the frame rate goes to like 20 or 40 FPS, depending on the depth of the recursion (at a recursion depth of 5 levels over 2000 meshes are no culled brining the frame rate down to 10FPS) the actualy number of models drrawn is higher from the mirrors but you get the idea, admittedly I do not do shadows, but under this system adding more models barely affected the frame rate after a while,


Quote:
No, it doesn't.
In forward rendering you compute everything, then depth-test, then (possibly) trash.
In deferred shading (at least as I've analyzed in the past) you compute the data you need then the GPU does depth test and (possibly) trashes - sure, you trash slightly less work. Later, all this stuff is resolved in a FB output in a single batch... which is typically the same as it would before.


In differed shading with heavy lighting you trash a lot less work than with forward rendring, in a typical differed shader one needs the following:
1) color texture
2) normal
3) specular
4) position

position is practically for free, color and specular are texture looks ups and normal is texture lookup together with two vec3 multiply adds... very cheap if that gets trashes via depth or stencil test, or overdrawn. A directional lighting caclulation is much pricier (and worse if there are like 8 lights)... I did an experiemnt where I drew a bunch of models in a line with per pixel point lighting, if early z-cull did not save me the framerate was really bad because of the overdraw... with differed shading, I no longer worry about overdraw.

Quote:
I suppose that's true, yet dynamic prey-style portals are not going to be trivial anyway considering visibility is going to be changed pretty much and not to mention entity behaviour shall be handled accordingly... and prey didn't have so big portal counts to justify the change.
As for static portals (as given by the conventional wisdom) I'm not even sure this makes sense since they're here mostly for culling than a real graphics need.


tell me about it, I implemented a fully dynamic Prey-style portal system on steroids... it was non trivial:
1) intelligent culling of room
a) geometry based (via planes and frustum planes)
b) occlusion based (backout of room when occlusion query is ready and repors not enough pixels)
2) stencil buffer/testing fun
3) depth capping of portals
4) co-ordinate system tracking including

[Edited by - kRogue on November 7, 2007 4:26:09 AM]
Close this Gamedev account, I have outgrown Gamedev.
Quote:Original post by kRogue
I have to confess that in my system
... omissis ...
this sytem works well enough (but it does not handle the situation of when there are multiple FX laters on a single pixel, only the front most FX gets done)
Sounds very interesting. Looks like you're going to hit much more meaningful results than me. Could you please keep us updated on the conclusions?
Quote:Original post by kRogue
I am going to sound like a bit of a ninnny, but what exaclty is lay-z-only? is it the wiriting of data to off screen buffers which depend on z? then drwaing stuff in a foward render style?
Yes. Tipical lay-z simply double the batch count for a first pass laying down depth only, then re-draw with color writes enabled. It is generally a win for high depth complexity but not a so smart idea when this is not the case.
Its problem is mainly (1) the batch count, which can be resolved in various cases and (2) you don't get separated components (which wasn't a problem in my case).
Non-trivial lay-z double the batch count only in worst-case, typically unrealistic scenarios. Unluckly, it's a real pain to make it work!
Caution: my depth complexity is typically much higher than the average application so my measurements were aiming at those cases. If you had 40 models then you're probably reaching a similar depth complexity.
Quote:Original post by kRogue
...adding more models barely affected the frame rate after a while...
That's sure but did you test the gain against a forward-renderer? Have you got an estimation of the overdraw?
I admit my tests were quite a bit easier but in the end I choose the z-only.
Quote:Original post by kRogue
if early z-cull did not save me the framerate was really bad because of the overdraw... with differed shading, I no longer worry about overdraw.
That's a bit surprising. Early-z generally helped here.
Quote:
tell me about it...
You can bet you're not alone!


It sounds very promising! Try it out!

Previously "Krohm"

Quote:
Caution: my depth complexity is typically much higher than the average application so my measurements were aiming at those cases. If you had 40 models then you're probably reaching a similar depth complexity.


what I am shooting for is the capability to have a ridiculous number of models getting drawn, the more enemies to shoot at the better :) the stress case of 40 models with just one room and no portals is very fast on even my crappy hardware: over 40FPS... as for my typcial depth complexity, that is probbaly high due to the portalling... but alas due to dpeth capping of the portals I cannot do early z-lay ...

Quote:
That's sure but did you test the gain against a forward-renderer? Have you got an estimation of the overdraw?


not yet, as right now most of the shaders are set up for differed shading.. I want to get the forward rendering system back to do some tests, but I am not going to put in early lay-z only as that is not quite possible with my portalling system... I do know that when I first played with fragment shaders that if I delibertaly made overdraw, i.e. put models infront of each other, then the frame rate went down a great deal if the models got drawn back to front (wheras front to back was ok).... to get the early z-cull to work requires one to either do early z-lay or sort the models, neither approach I am crazy about (as I sort by shader and texture now). Since I am aiming for "lots monsters to shoot at" scenario, I anticipate getting lots of overdraw...

Close this Gamedev account, I have outgrown Gamedev.

This topic is closed to new replies.

Advertisement