Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 19 Sep 2007
Offline Last Active Today, 01:18 PM

#5152252 Kinds of deferred rendering

Posted by ATEFred on 08 May 2014 - 02:45 AM

There are commercial game engines that use all of these approaches. Frostbite 3 uses the tiled CS approach, the Stalker engine was the first to use the deferred shading approach that I know of, loads of games have used light prepass ( especially 360 games to get around edram size limitations / avoid tiling ), Volition came up with and used inferred in their engine, forward+ seems to be one of the trendier approaches for future games, not sure if anything released uses that already.

The main thing is for you to decide what your target platforms are and what kind of scenes you want to render. (visible entity counts, light counts, light types, whether a single lighting model is enough for all your surfaces, etc.) 

For learning purposes though, they are all similar enough that you can just pick a simpler one (deferred shading or light prepass maybe), get it working, and then adapt afterwards to a more complex approach if needed. 

As for docs / presentations, there are plenty around for all of these. I would recommend reading the GPU pro books, there are plenty papers on this. Dice.se has presentations on their website you can freely access for the tiled approach they used on bf3. GDC vault is also a great place to look.

You can also find example implementations around, like here:
(authors are active one this forum btw)

#5152037 Kinds of deferred rendering

Posted by ATEFred on 07 May 2014 - 08:07 AM

The best way will depend on the type of scene you have and your target hw. They are mostly pretty similar, but here is an overview of a few of the popular ones:
- deferred shading, based on generating a gbuffer for opaque objects with all properties needed for both lighting and shading, followed by light pass typically done by rendering geometrical volumes or quad for each light and a fullscreen pass for sunlight, followed by composite pass where lighting and surface properties are combined to get back the final shaded buffer. This is followed by alpha passes, often with forward lighting, and post fx passes, which can use the content of the gbuffer if needed. Full or partial Z prepass is optional. Advantages include potentially rendering your scene only once and 

- light prepass/ deferred lighting involves the same kind of steps, only with a minimal gbuffer containing only what you need for the actual lighting ( often just depth buffer + one render target containing normals + spec power ), the same kind of light pass, but then another full scene rendering pass to get the final colour buffer. This means loads more draw calls, but much lighter gbuffers, which can be handy on HW with limited bandwidth, limited support for MRTs, or limited EDRam like the 360. Also gives more flexibility than the previous approach when it comes to object materials, since you are not limited to the information you can store in the gbuffer.

- inferred rendering, which is like light prepass, only with a downsampled gbuffer containing material IDs, downsample light pass, but high res colour pass which uses IDs to pick the correct values from the light buffer without edge artifacts. Kind of neat way of doing gbuffer and light pass much faster at the cost of resolution. Can also be used to store the alpha object properties in the gbuffer with a dithered pattern, and then excluding the samples you don't want / not for that layer during the colour pass. So no more need for forward lighting for alpha objects (up to a point).

- tiled deferred involves not rendering volumes or quads for your lights, which can be pretty extensive when you get alot of light overdraw, especially if your light volumes are not super tight, but instead divide your screen into smaller tiles, generate a frustum per tile, cull your lights on gpu for each tile frustum, and then light only the fragments in the tile by the final list. Usually done in CS, no overdraw issues at all, overall much faster, but requires modern HW and also can generate very large tile frustums when you have large depth discontinuities per tile. The last part can be mitigated by adding a depth division to your tiles ( use 3d clusters instead of 2d tiles ).

- forward+ is similar, but involves z prepass instead of gbuffer generation, then  pass to generate light lists per tile, same as above, but instead of lighting at that point, you render your scene again and light forward style using the list of lights intersecting the current tile. Allows for material flexibility and easy MSAA support at the cost of another full geo pass.

There are loads more variations of course, but these are maybe a good starting point.

#5124812 how to draw a moving object trace path?

Posted by ATEFred on 19 January 2014 - 04:22 AM

One approach is have attractors on the start and end of your trace creating object ( such as the hilt and tip of a sword ), and create bands of geometry every frame from these attractor positions. Simplest form of that is one band being 2 verts, one for each attractor / end of the band, have a dynamic vertex buffer which you fill up with all active band vertices, from newest to oldest ( or the other way around ). You can use vertex colours for fading it out. Then render it as a tri strip, and job done :)

#5117780 Are GPU constants really constants?

Posted by ATEFred on 18 December 2013 - 03:22 AM

I don't believe this is a GPU limitation, but rather a limitation of apis like open gl es 2 for example, which do not allow you to bind constants to registers, but rather assign them to "slots" associated with your shader program. So if you don't change shader, you don't need to reset them, but everytime you bind a new shader you will need to reset all constants. D3d9/11/ogles3/ogl4/etc. do not have that limitation.

#5114054 Its all about DirectX and OpenGL?

Posted by ATEFred on 03 December 2013 - 09:20 AM

Thanks for your reply Zaoshi!!


So, its a good idea to write a 'wrapper' layer to graphics API if you want to write a cross platform game engine... right? smile.png

Someone who use Playstation SDK and/or some 'Nintendo SDK'  may share some knowledge?? 

Thanks!! biggrin.png


yeah, you want to have your own wrapper around the different graphics apis (libgcm for ps3, libgnm for ps4, dx11, dx for xbob, ogl, etc.)
Coming up with the right level of abstraction can take some time, you need to learn the differences in between the APIs pretty well to get it right, but overall not too difficult. 

As Zaoshi mentioned, the constructs are pretty similar in between all major graphics apis. Some things are still a touch different such as constant management (gles2 vs dx11 /ogl3+ vs consoles), console graphics apis also usually expose alot more than is typically available on pc through dx and ogl.

#5101362 Many dispatch calls vs. higher ThreadGroupCount

Posted by ATEFred on 14 October 2013 - 01:35 PM

Thank you for the reply.


Yep I know about the thread group size. But for example if your shader is configured so that you have [numthreads(64, 1, 1)] as thread group size you could dispatch(1,1,1) (one) of that group and still have a full 64 threads running. As far as I know you should always run a multiple of 64.


So, if we take my first example again:

ID3D11DeviceContext::Dispatch(10,1,1); // dispatches 10 x 64 threads -> 640
for(int i = 0; i < 10; i++)
  ID3D11DeviceContext::Dispatch(1,1,1); // dispatches 10 x 64 threads -> 640

Lets say I am forced to use the second "solution" for some complicated reason. I wonder how much worse it will be. Or what other disadvantage I would have.

if you only have one thread group, whatever it's size, you will probably run into issues when the GPU stalls it's execution waiting for mem fetches / whatever. Typically it would then just start working on another warp/wavefront, and get back to the original one once the data was ready and it's current group stalled for some reason.

If you dispatch many groups one by one, there is no guarantee that they will run in parallel and allow the GPU to jump in between. ( In fact, I am pretty sure than on current PC gpus you are guaranteed that they won't ).

#5093732 Stippled Deferred Translucency

Posted by ATEFred on 13 September 2013 - 02:15 AM

That's similar to clustered shading, but instead of storing a list of lights per cell/texel in the volume, you're storing the (approximate) radiance at that location. I was thinking of using something similar for things where approximate lighting is ok, like smoke particles. Does it work well for you in these cases?

BTW, if you stored the light in each cell as SH, you could extract the dominant light direction and colour from this representation, and use it for some fake specular highlights ;)


That's pretty much it. It works really well for particles and fog with the single directionless approximated value, and it's lightning fast, once it is generated. I'll have to get a video capture done at some point.


atm I use HL2 basis rather than SH (simply because it was easier to prototype, and for alpha geo I only really care about camera facing stuff). Getting dominant direction from SH sounds like a good idea, now sure how computationally expensive it is? I'll need to look it up. 

#5093512 Stippled Deferred Translucency

Posted by ATEFred on 12 September 2013 - 03:32 AM

For alpha lighting, I generate a volume texture locked to the camera with lighting information (warped to match the frustum). Atm I fill this in a CS, similar to the usual CS light culling pass. I store both single non directional approximated lighting value, and a separate set of directional values. This allows me to do either a simple texture fetch to get rough lighting info when applying it (for particles for example), or higher quality directional application with 3 texture fetches and a few ALU ops.

It's a pretty simple system atm, downsides are lower lighting resolution in the distance, and it's not exactly free to generate. (That might be possible to optimize by at least partially generating it on CPU though). Also, no specular atm...

Pros are cheap lighting, even for a huge number of particles, and semi cheap volumetric lighting / light shafts for any shadow casting light in the scene, as I also march through the volume when I apply my directional light volumetric shadows (simple raymarch through shadowmap).

#5092719 ComputeShader Particle System DispatchIndirect

Posted by ATEFred on 09 September 2013 - 08:01 AM


if you plan to expand each vert into quads, it would be 10, 1, 0, 0.


I guess that is a typo and you mean 1,10,0,0?


But still it does not explain why it does not work if I set the initial value to 1,0,0,0 and use CopyStructureCount to update the count...


Hm... alright I am going to install vs 2012. (Since I also was not able to figure the buffer results out via NSight)


Could world both ways: if you wanted 10 verts which you would expand in GS, it would be 10,1,0,0 ( 10verts -> 10 quads * 1 instance of the 10 ). Or you can use the HW instancing. I have noticed differences in performance when generating hundreds of thousands of quads, instancing being slightly slower than single instance with loads of expanded verts.

If you set initial value of 1,0,0,0, you are specifying 1 vertex, 0 instances. So as long as you update the second parameter with your structured count, it should work.

#5092705 ComputeShader Particle System DispatchIndirect

Posted by ATEFred on 09 September 2013 - 06:58 AM

Should be vertex count, instance count, 0,0 (startvertloc and start inst loc).

So lets imagine you have a quad and you want to instance if 10 times, your indirect args buffer should be 4, 10, 0, 0.

if you plan to expand each vert into quads, it would be 10, 1, 0, 0.

(At least that's what I remember off the top of my head, I can check tonight when I get home).

Inspecting the results of the buffer is more annoying than it should, NSight refuses to show it to me. However the VS2012 graphics debugger displays it no probs (finally something it does well :) ) or you can go the way of copying to staging buffer and displaying in your app.

#5083488 Multiple SV_POSITION's for different RT

Posted by ATEFred on 06 August 2013 - 03:06 AM

You can do that through geometry shaders, but not through VS afaik. (expect some performance hit from using the GS to instance your geo n times, once for each output).

(Gs allows you to set one SV_position per triangle stream output)

#5082493 ComputeShader Particle System DispatchIndirect

Posted by ATEFred on 02 August 2013 - 08:40 AM

DrawInstancedIndirect will do what you want to do. Copy the SB size into an indirect args buffer and pass that to the indirect draw method. (I mean copying the size to the specific location of the arguments in the indirect args buffer you want. (control number of verts vs number of instances, etc.)

#5076358 C++ DX API, help me get it?

Posted by ATEFred on 09 July 2013 - 09:21 AM

None of this relates to performance at all


Exceptions have a noticeable performance impact, which is one of the reasons many game projects at least do not use them. From that perspective it makes sense from an API point of view to not rely on them (in addition to C legacy)


I don't know of any wrappers like that you are describing, but you could take just the API abstraction layer of any openly available engine and start with that instead of interfacing directly with d3d if you really wanted to. It would be better to use it natively if you want to learn how it works though I think.

#5074991 Tree and other free-standing mesh Shadows

Posted by ATEFred on 03 July 2013 - 03:59 AM

depending on your proxy / simplified meshes you use for the shadow pass, you might run into self shadowing issues.

#5072679 GPU particles

Posted by ATEFred on 25 June 2013 - 02:09 AM


Quite good reading about gpu particle systems using just dx9.

Just creating lot of particles is fun but not that usefull. Making particles look good and fitting for scene you usually want shading/shadowing match the rest of the scene. Also you want simulate some collisions and maybe even particle to particle forces. Also some sorting is must for good looking alpha blending.

For particle lighting, I generate a volume texture mapped to the camera projection with lighting information in each voxel (I use HL2 basis) which I can then sample with a few texture fetches in my particle PS to resolve reasonably accurate lighting. When I have too much overdraw/particles, I revert to just sampling a fixed approximated lighting value instead of resolving directional lighting. Works well, and is really quite fast, once you pay the cost to generate the map. (atm takes 2-3 ms in cs with 500 lights visible). Maybe it would be possible to move the generation to CPU while the gbuffer / whatnot is being rendered?