Sign in to follow this  
obhi

Rendering architecture

Recommended Posts

obhi    196

I have been trying to come up with a good architecture design that can after reading some good posts in gamdev have gone a little distance. My current design can be laid out as such:

Scene traversal and cull lists the renderable objects in a list : VisiblitySet
The renderer determines the order of rendering the visible objects based on the object type mask and fills up a command buffer : RenderCmdBuffer

The deferred rendering algorithm are laid out as such:

o Do Animations and Stream Out.

o Disable color write and enable depth write

o Depth Pre Pass (only for opaque objects)

o Disable depth write

o if ( occlusion query is enabled )
o Hardware Occlusion query based on the visible list and discard occluded objects

o For each visible light:
o Query visibility sets from light's perspective VLi upto max shadow depth buffers cap
o For each visible objects render to light's depth buffer

o Do a G-Buffer pass with depth write disabled for all the visible objects ( minus the discarded ones in Occ query ).

o Do Illumination pass for each light.

o Forward render alpha blended objects + particle systems.

The whole algorithm will write to a command buffer which will be dispatched to the renderer for execution. Various per-object - per-shader based parameters will also be allocated within the command buffer and transferred to the render api.

One more approach that caught up with me was to automate the sorting using the command buffer and using a key per command. This [url="http://realtimecollisiondetection.net/blog/?p=86"]article[/url] discusses this approach. This would allow me to pass the command buffer to each visible object and add them to the command buffer as per the render technique. The only problem I am facing is a single object might register to the command buffer twice or more with different vertex format (1. For depth pre pass, 2. For G Buffer). This would incur me double sorting of the same render objects. In-order to avoid this I am considering building the visibility sets and then filling up the command buffer in preprocessing (commands will have no key in this case). Also considering the fact that this will detach the object rendering code from object to the renderer I am gussing this will be a better approach.

Anyone care to add something?

Thanks,

obhi

Share this post


Link to post
Share on other sites
Ashaman73    13715
Here are some notes:

[quote name='obhi' timestamp='1310466676' post='4834220']
The deferred rendering algorithm are laid out as such:

o Do Animations and Stream Out.

o Disable color write and enable depth write

o Depth Pre Pass (only for opaque objects)

o Disable depth write
[/quote]
A depth pre pass is arguable in a deferred renderer. The more expensive shaders (SSAO, lighting etc.) are often post processing steps, so the benefit could be smaller then expected.


[quote name='obhi' timestamp='1310466676' post='4834220']
o if ( occlusion query is enabled )
o Hardware Occlusion query based on the visible list and discard occluded objects
[/quote]
IMHO occlusion queries for additional culling are b..s.. . The reason is, that you need to flush the command queue to receive the query result which could be a performance killer.


[quote name='obhi' timestamp='1310466676' post='4834220']
o For each visible light:
o Query visibility sets from light's perspective VLi upto max shadow depth buffers cap
o For each visible objects render to light's depth buffer
[/quote]
Shadowmaps are quite expensive. If you want to deliver shadows for each light and you plan to support many lights, you will run into performance issues really quickly. Try to filter lights which cast a shadow.

[quote name='obhi' timestamp='1310466676' post='4834220']
o Do Illumination pass for each light.
[/quote]
Group lights in single passes, it is really expensive to read all the buffers needed for light calculation for each light. Even better, divide your screen into tiles and determine which light affects which tile.

Share this post


Link to post
Share on other sites
obhi    196
[quote name='Ashaman73' timestamp='1310473978' post='4834266']
A depth pre pass is arguable in a deferred renderer. The more expensive shaders (SSAO, lighting etc.) are often post processing steps, so the benefit could be smaller then expected.
[/quote]

I am doing a pre pass mainly because of the occlusion culling.

[quote name='Ashaman73' timestamp='1310473978' post='4834266']
IMHO occlusion queries for additional culling are b..s.. . The reason is, that you need to flush the command queue to receive the query result which could be a performance killer.
[/quote]
To avoid the gpu stalls may be something in the lines of what nvidia's article regarding using previous frames query result can be used??

[quote name='Ashaman73' timestamp='1310473978' post='4834266']
Shadowmaps are quite expensive. If you want to deliver shadows for each light and you plan to support many lights, you will run into performance issues really quickly. Try to filter lights which cast a shadow.
[/quote]
I will use a few main shadow casting lights.

[quote name='Ashaman73' timestamp='1310473978' post='4834266']
Group lights in single passes, it is really expensive to read all the buffers needed for light calculation for each light. Even better, divide your screen into tiles and determine which light affects which tile.
[/quote]
Thanks for this suggestion.

Thanks for your reply
obhi

Share this post


Link to post
Share on other sites
smasherprog    568
[quote]To avoid the gpu stalls may be something in the lines of what nvidia's article regarding using previous frames query result can be used??[/quote]
This is not the performance hit generated by the method. The problem lies in that you have to draw each piece of geometry as it is found to be visible as you are checking objects to draw. The gpu stall isn't the problem either. Read this . . .
[url="http://nolimitsdesigns.com/game-design/octree-and-coherent-hierarchical-culling/"]http://nolimitsdesig...chical-culling/[/url]
I agree with what the above poster said: pre-pass z buffer does not give you increased performance, but does the opposite.
Most culling, and spacial partitioning algorithms preform poorly in any implementation --i still dont know why they are used. Instancing, and batching are the single most important issue in performance. Video cards cards can execute soo many instruction now, that drawing extra thousands of polygons have no noticeable impact on performance.

Share this post


Link to post
Share on other sites
zacaj    667
Isnt there a function in opengl 3+ that lets you do a draw call ONLY if a previous query was successful? That wouldnt require a stall, because the gpus probably doing the stuff in the order you tell it to (it pretty much has to), so you can do a query and then immediatly and dependant full draw call, since the query would be executed first

Share this post


Link to post
Share on other sites
obhi    196
[quote name='smasherprog' timestamp='1310490344' post='4834388']
This is not the performance hit generated by the method. The problem lies in that you have to draw each piece of geometry as it is found to be visible as you are checking objects to draw. The gpu stall isn't the problem either.
[/quote]
I was thinking more of using simple quads generated from the bounds rather than the whole object for the occ test. So for instanced and batched objects multiple quads (batched, and all calculated on the CPU) can drawn up rather than a single one per object. It will be more like a fill test. On a side note I padded the occ query and result query by the shadow map generation so that the command queue may be flushed by the end of it. Just before rendering the G-Buffer I will check on the query result.
I do agree however that the queries will incur extra draw calls, which is what should be avoided first.

Thanks,
obhi


Share this post


Link to post
Share on other sites
Ashaman73    13715
[quote name='obhi' timestamp='1310528605' post='4834608']
[quote name='smasherprog' timestamp='1310490344' post='4834388']
This is not the performance hit generated by the method. The problem lies in that you have to draw each piece of geometry as it is found to be visible as you are checking objects to draw. The gpu stall isn't the problem either.
[/quote]
I was thinking more of using simple quads generated from the bounds rather than the whole object for the occ test. So for instanced and batched objects multiple quads (batched, and all calculated on the CPU) can drawn up rather than a single one per object. It will be more like a fill test. On a side note I padded the occ query and result query by the shadow map generation so that the command queue may be flushed by the end of it. Just before rendering the G-Buffer I will check on the query result.
I do agree however that the queries will incur extra draw calls, which is what should be avoided first.

Thanks,
obhi
[/quote]
To be honest, you are optimizing too early and at the wrong end of the pipeline. Deferred rendering engines will most likely be fillrate bound (multiple,expensive, full screen post processing shaders). Your optimization is done on the culling level which decreases mostly the vertex processing requirements in a deferred rendering pipeline. Doing a pre-pass means, that you have to render the scene two times, the vertex performance hit is 2x (even if early-z is lightning fast). But when this is not a problem, why use occlusion queries to cull additionally vertices which are already fast to render ? It is more likely that you will kill performance with occlusion queries than that you increase it.

I could only think about this kind of optimization when using really expensive shaders to build up your g-buffer (relief-mapping etc.).

My sugguestion: leave occlussion queries out, even if you use the query from the previous frame, you would either ran into stalls or artifacts (poping objects by fast camera turn). Pre-pass is easy and cheap enough, so make the pre-pass optional and check your performance win by turning it on/off. :D

Share this post


Link to post
Share on other sites
obhi    196
[quote]
To be honest, you are optimizing too early and at the wrong end of the pipeline. Deferred rendering engines will most likely be fillrate bound (multiple,expensive, full screen post processing shaders). Your optimization is done on the culling level which decreases mostly the vertex processing requirements in a deferred rendering pipeline. Doing a pre-pass means, that you have to render the scene two times, the vertex performance hit is 2x (even if early-z is lightning fast). But when this is not a problem, why use occlusion queries to cull additionally vertices which are already fast to render ? It is more likely that you will kill performance with occlusion queries than that you increase it.

I could only think about this kind of optimization when using really expensive shaders to build up your g-buffer (relief-mapping etc.).

My sugguestion: leave occlussion queries out, even if you use the query from the previous frame, you would either ran into stalls or artifacts (poping objects by fast camera turn). Pre-pass is easy and cheap enough, so make the pre-pass optional and check your performance win by turning it on/off. :D
[/quote]

My reasoning is quite around the same lines, both z-pre pass and occ test can be turned on/off. Z-pre is pretty fast, however I figured out occ test is of no use. I was thinking about saving memory bandwidth when outputting to G-Buffer but then realized occ test wont improve it anyway (i missed the term 'fillrate bound'... sucks that I missed such a basic thing :P). I got your point now.

Thanks for the quick response and clearing that up.
obhi

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this