Rendering architecture

Started by
6 comments, last by obhi 12 years, 9 months ago

I have been trying to come up with a good architecture design that can after reading some good posts in gamdev have gone a little distance. My current design can be laid out as such:

Scene traversal and cull lists the renderable objects in a list : VisiblitySet
The renderer determines the order of rendering the visible objects based on the object type mask and fills up a command buffer : RenderCmdBuffer

The deferred rendering algorithm are laid out as such:

o Do Animations and Stream Out.

o Disable color write and enable depth write

o Depth Pre Pass (only for opaque objects)

o Disable depth write

o if ( occlusion query is enabled )
o Hardware Occlusion query based on the visible list and discard occluded objects

o For each visible light:
o Query visibility sets from light's perspective VLi upto max shadow depth buffers cap
o For each visible objects render to light's depth buffer

o Do a G-Buffer pass with depth write disabled for all the visible objects ( minus the discarded ones in Occ query ).

o Do Illumination pass for each light.

o Forward render alpha blended objects + particle systems.

The whole algorithm will write to a command buffer which will be dispatched to the renderer for execution. Various per-object - per-shader based parameters will also be allocated within the command buffer and transferred to the render api.

One more approach that caught up with me was to automate the sorting using the command buffer and using a key per command. This article discusses this approach. This would allow me to pass the command buffer to each visible object and add them to the command buffer as per the render technique. The only problem I am facing is a single object might register to the command buffer twice or more with different vertex format (1. For depth pre pass, 2. For G Buffer). This would incur me double sorting of the same render objects. In-order to avoid this I am considering building the visibility sets and then filling up the command buffer in preprocessing (commands will have no key in this case). Also considering the fact that this will detach the object rendering code from object to the renderer I am gussing this will be a better approach.

Anyone care to add something?

Thanks,

obhi

What if everyone had a restart button behind their head ;P
Advertisement
Here are some notes:


The deferred rendering algorithm are laid out as such:

o Do Animations and Stream Out.

o Disable color write and enable depth write

o Depth Pre Pass (only for opaque objects)

o Disable depth write

A depth pre pass is arguable in a deferred renderer. The more expensive shaders (SSAO, lighting etc.) are often post processing steps, so the benefit could be smaller then expected.



o if ( occlusion query is enabled )
o Hardware Occlusion query based on the visible list and discard occluded objects

IMHO occlusion queries for additional culling are b..s.. . The reason is, that you need to flush the command queue to receive the query result which could be a performance killer.



o For each visible light:
o Query visibility sets from light's perspective VLi upto max shadow depth buffers cap
o For each visible objects render to light's depth buffer

Shadowmaps are quite expensive. If you want to deliver shadows for each light and you plan to support many lights, you will run into performance issues really quickly. Try to filter lights which cast a shadow.


o Do Illumination pass for each light.

Group lights in single passes, it is really expensive to read all the buffers needed for light calculation for each light. Even better, divide your screen into tiles and determine which light affects which tile.

A depth pre pass is arguable in a deferred renderer. The more expensive shaders (SSAO, lighting etc.) are often post processing steps, so the benefit could be smaller then expected.


I am doing a pre pass mainly because of the occlusion culling.


IMHO occlusion queries for additional culling are b..s.. . The reason is, that you need to flush the command queue to receive the query result which could be a performance killer.

To avoid the gpu stalls may be something in the lines of what nvidia's article regarding using previous frames query result can be used??


Shadowmaps are quite expensive. If you want to deliver shadows for each light and you plan to support many lights, you will run into performance issues really quickly. Try to filter lights which cast a shadow.

I will use a few main shadow casting lights.


Group lights in single passes, it is really expensive to read all the buffers needed for light calculation for each light. Even better, divide your screen into tiles and determine which light affects which tile.

Thanks for this suggestion.

Thanks for your reply
obhi
What if everyone had a restart button behind their head ;P
To avoid the gpu stalls may be something in the lines of what nvidia's article regarding using previous frames query result can be used??[/quote]
This is not the performance hit generated by the method. The problem lies in that you have to draw each piece of geometry as it is found to be visible as you are checking objects to draw. The gpu stall isn't the problem either. Read this . . .
http://nolimitsdesig...chical-culling/
I agree with what the above poster said: pre-pass z buffer does not give you increased performance, but does the opposite.
Most culling, and spacial partitioning algorithms preform poorly in any implementation --i still dont know why they are used. Instancing, and batching are the single most important issue in performance. Video cards cards can execute soo many instruction now, that drawing extra thousands of polygons have no noticeable impact on performance.
Wisdom is knowing when to shut up, so try it.
--Game Development http://nolimitsdesigns.com: Reliable UDP library, Threading library, Math Library, UI Library. Take a look, its all free.
Isnt there a function in opengl 3+ that lets you do a draw call ONLY if a previous query was successful? That wouldnt require a stall, because the gpus probably doing the stuff in the order you tell it to (it pretty much has to), so you can do a query and then immediatly and dependant full draw call, since the query would be executed first

This is not the performance hit generated by the method. The problem lies in that you have to draw each piece of geometry as it is found to be visible as you are checking objects to draw. The gpu stall isn't the problem either.

I was thinking more of using simple quads generated from the bounds rather than the whole object for the occ test. So for instanced and batched objects multiple quads (batched, and all calculated on the CPU) can drawn up rather than a single one per object. It will be more like a fill test. On a side note I padded the occ query and result query by the shadow map generation so that the command queue may be flushed by the end of it. Just before rendering the G-Buffer I will check on the query result.
I do agree however that the queries will incur extra draw calls, which is what should be avoided first.

Thanks,
obhi


What if everyone had a restart button behind their head ;P

[quote name='smasherprog' timestamp='1310490344' post='4834388']
This is not the performance hit generated by the method. The problem lies in that you have to draw each piece of geometry as it is found to be visible as you are checking objects to draw. The gpu stall isn't the problem either.

I was thinking more of using simple quads generated from the bounds rather than the whole object for the occ test. So for instanced and batched objects multiple quads (batched, and all calculated on the CPU) can drawn up rather than a single one per object. It will be more like a fill test. On a side note I padded the occ query and result query by the shadow map generation so that the command queue may be flushed by the end of it. Just before rendering the G-Buffer I will check on the query result.
I do agree however that the queries will incur extra draw calls, which is what should be avoided first.

Thanks,
obhi
[/quote]
To be honest, you are optimizing too early and at the wrong end of the pipeline. Deferred rendering engines will most likely be fillrate bound (multiple,expensive, full screen post processing shaders). Your optimization is done on the culling level which decreases mostly the vertex processing requirements in a deferred rendering pipeline. Doing a pre-pass means, that you have to render the scene two times, the vertex performance hit is 2x (even if early-z is lightning fast). But when this is not a problem, why use occlusion queries to cull additionally vertices which are already fast to render ? It is more likely that you will kill performance with occlusion queries than that you increase it.

I could only think about this kind of optimization when using really expensive shaders to build up your g-buffer (relief-mapping etc.).

My sugguestion: leave occlussion queries out, even if you use the query from the previous frame, you would either ran into stalls or artifacts (poping objects by fast camera turn). Pre-pass is easy and cheap enough, so make the pre-pass optional and check your performance win by turning it on/off. :D

To be honest, you are optimizing too early and at the wrong end of the pipeline. Deferred rendering engines will most likely be fillrate bound (multiple,expensive, full screen post processing shaders). Your optimization is done on the culling level which decreases mostly the vertex processing requirements in a deferred rendering pipeline. Doing a pre-pass means, that you have to render the scene two times, the vertex performance hit is 2x (even if early-z is lightning fast). But when this is not a problem, why use occlusion queries to cull additionally vertices which are already fast to render ? It is more likely that you will kill performance with occlusion queries than that you increase it.

I could only think about this kind of optimization when using really expensive shaders to build up your g-buffer (relief-mapping etc.).

My sugguestion: leave occlussion queries out, even if you use the query from the previous frame, you would either ran into stalls or artifacts (poping objects by fast camera turn). Pre-pass is easy and cheap enough, so make the pre-pass optional and check your performance win by turning it on/off. :D
[/quote]

My reasoning is quite around the same lines, both z-pre pass and occ test can be turned on/off. Z-pre is pretty fast, however I figured out occ test is of no use. I was thinking about saving memory bandwidth when outputting to G-Buffer but then realized occ test wont improve it anyway (i missed the term 'fillrate bound'... sucks that I missed such a basic thing :P). I got your point now.

Thanks for the quick response and clearing that up.
obhi
What if everyone had a restart button behind their head ;P

This topic is closed to new replies.

Advertisement