turanszkij

R&D Tile based particle rendering

Recommended Posts

Hi,

Tile based renderers are quite popular nowadays, like tiled deferred, forward+ and clustered renderers. There is a presentation about GPU based particle systems from AMD. What particularly interest me is the tile based rendering part. The basic idea is, that leave the rasterization pipeline when rendering billboards and do it in a compute shader instead, much like Forward+. You determine tile frustums, cull particles, sort front to back, then render them until the accumulated alpha value is below 1. The performance results at the end of the slides seems promising. Has anyone ever implemented this? Was it a success, is it worth doing? The front to back rendering is the most interesting part in my opinion, because overdraw can be eliminated for alpha blending.

The demo is sadly no longer available..

Edited by turanszkij

Share this post


Link to post
Share on other sites

It is doable, we do it all in our game, but we do it back to front ( no earlier out ) and we also interleaved them with sorted fragment from traditional geometry or unsupported particles types. The bandwidth saving plus well written shader optimization make it a good gain ( plus order independent transparency draw :) )

The challenge is DX11 PC without bindless, you have to deal with texture atlas and drivers having a hard time to optimise such a complex shader ( from the DXBC compared to console where we have dedicated shader compiler ), On Console and dx12/Vulkan, you can also just provide an array of texture descriptors, so easier :) For practical reason and storage for the culling you may want to limit the number of particles to a few thousands, it was fine for us, but other games based on heavy effects would have mourn.

 

 

 

Share this post


Link to post
Share on other sites
3 hours ago, Infinisearch said:

I have the demo if you want it, and it is available on Github here: https://github.com/GPUOpen-LibrariesAndSDKs/GPUParticles11/

edit - I attached the demo.

GPUParticles11_v1.0.zip

Thanks for that, I will check it out. I've just started implementing this myself, anyway. :)

1 hour ago, galop1n said:

For practical reason and storage for the culling you may want to limit the number of particles to a few thousands, it was fine for us, but other games based on heavy effects would have mourn.

Hm, that's a bit disappointing. I know that most games probably don't use more than a few thousand particles anyway, but I thought that this would help with the sheer numbers as well apart from overdraw optimization.

Share this post


Link to post
Share on other sites

I have managed to implement this technique on a console (PS4). I made a tech demo which renders particles for high overdraw and with heavy shaders (per pixel lighting). It can also render particles spread out in the distance with little to no overdraw. I am using an additional coarse culling step for the tile based approach, like in the AMD demo. The coarse culling culls particles for large screen space tiles (240x135 pixels). The fine culling culls particles for 32x32 pixel tiles and renders them in the same shader.

With 100,000 particles filling the screen and heavy overdraw, the tile based technique is a clear win, manages to remain under 30 ms, while the rasterization based technique, it renders them in about 70 ms.

With 100,000 small particles on screen with little overdraw, the rasterization performs clearly better, going below 10 ms easily. The tile based approach is around 15-20 ms this time.

With 1,000,000 particles, and heavy overdraw, the tile based approach can not keep up, because it runs out of LDS to store per tile particle lists which results in flickering. The performance is slow, but the rasterization is much slower, however, it renders without artifacts.

1,000,000 particles, little overdraw, the tile based approach suffers from culling performance, while the rasterization easily does 60 FPS.

It seems to me it can only be used for specific scenarios, with little amount of particles and heavy overdraw. However, I imagine most games do not use millions of particles, so it might be worth implementing.

Share this post


Link to post
Share on other sites
3 minutes ago, turanszkij said:

The coarse culling culls particles for large screen space tiles (240x135 pixels). The fine culling culls particles for 32x32 pixel tiles and renders them in the same shader.

Whats the point of coarse culling in this case... if you're gonna tile why not just go straight for the fine tiles?  Wouldn't there be less memory access's this way?  Also since it seems you're using LDS for tile particle lists why not increase the fine tile size to 64x64 since the L2 should be big enough to keep the whole tile cached?  I'm most likely missing something... I don't really remember the presentation that well.

14 minutes ago, turanszkij said:

With 100,000 particles filling the screen and heavy overdraw, the tile based technique is a clear win, manages to remain under 30 ms, while the rasterization based technique, it renders them in about 70 ms.

With 100,000 small particles on screen with little overdraw, the rasterization performs clearly better, going below 10 ms easily. The tile based approach is around 15-20 ms this time.

With 1,000,000 particles, and heavy overdraw, the tile based approach can not keep up, because it runs out of LDS to store per tile particle lists which results in flickering. The performance is slow, but the rasterization is much slower, however, it renders without artifacts.

1,000,000 particles, little overdraw, the tile based approach suffers from culling performance, while the rasterization easily does 60 FPS.

Aren't smoke effects typically medium to high overdraw?  If so it would most likely be a win.

Share this post


Link to post
Share on other sites
1 minute ago, Infinisearch said:

Whats the point of coarse culling in this case... if you're gonna tile why not just go straight for the fine tiles?  Wouldn't there be less memory access's this way?  Also since it seems you're using LDS for tile particle lists why not increase the fine tile size to 64x64 since the L2 should be big enough to keep the whole tile cached?  I'm most likely missing something... I don't really remember the presentation that well.

Coarse culling does result in more memory accesses, but can lighten the load on the fine culling step a lot, because now you don't do fine culling for a million particles per tile, but just 10,000 for instance, or whatever the number which is in the coarse tile. This generally improves speed a lot, but this way you have an additional indirection of course. The coarse culling also has a better thread distribution: Only dispatch for the number of particles, and each particle adds itself to the relevant tiles, as opposed to the fine culling, where you dispatch for tiles, and the tiles iterate through each particle and add them.

64x64 tile size would require even more LDS storage, not less. And I can't even dispatch a threadgroup that big. If I cut back to 16x16 tiles though, then the LDS can be better utilized, because less particles will be visible in the tile, but the parallel nature of the culling will be worse. With a 32x32 tile, each thread culls a particle until all are culled, meaning 1024 particles are culled in parallel. With a 16x16 tile, 256 particles are culled in parallel, which is slower.

Share this post


Link to post
Share on other sites

By the way, I am also doing decal rendering for a Forward+ renderer, and they also benefit from top-to-bottom sorting while blending them bottom-to-top and skipping the bottom ones when the alpha is already one. :)

Edited by turanszkij

Share this post


Link to post
Share on other sites
17 hours ago, turanszkij said:

Coarse culling does result in more memory accesses, but can lighten the load on the fine culling step a lot, because now you don't do fine culling for a million particles per tile, but just 10,000 for instance, or whatever the number which is in the coarse tile. This generally improves speed a lot, but this way you have an additional indirection of course. The coarse culling also has a better thread distribution: Only dispatch for the number of particles, and each particle adds itself to the relevant tiles, as opposed to the fine culling, where you dispatch for tiles, and the tiles iterate through each particle and add them.

I guess I'll have to read through the presentation again, its been a while.

17 hours ago, turanszkij said:

64x64 tile size would require even more LDS storage, not less. And I can't even dispatch a threadgroup that big. If I cut back to 16x16 tiles though, then the LDS can be better utilized, because less particles will be visible in the tile, but the parallel nature of the culling will be worse. With a 32x32 tile, each thread culls a particle until all are culled, meaning 1024 particles are culled in parallel. With a 16x16 tile, 256 particles are culled in parallel, which is slower.

Oh you use LDS to do blending as well?  I thought you were going through the L1 and L2, thats why I suggested a larger tile size.  But upon thinking about it more like you said more particles might be visible with bigger tiles... and you're using LDS for the particle list.  As far as your 256 vs 1024 particles being culled in parallel that doesn't seem right to me.  You're using LDS so wouldn't the compute shaders' execution be limited to one CU (AMD hardware)?  If so it would be limited to 64 threads per clock, and 256 threads in lock step (since each 16 wide SIMD executes with a cadence of 4 clocks) @MJP @Hodgman  Could you clear this up for me, if a compute shader uses LDS is its execution limited to one CU?

Share this post


Link to post
Share on other sites

Using LDS doesn't limit a Compute Shader to a single CU, no. The requirement would be that a single thread group run all its waves on a single CU in order that they all have access to the same bit of LDS.

A 256 thread thread-group is 4 waves, and would typically be scheduled to have one wave per SIMD. A 1024 thread thread-group would have 4 waves running on each SIMD (all on the same CU). You're only wasting / not using CUs if you have less thread groups than you have CUs. Since even the biggest AMD parts only have 64 CUs, you'd have to be running at an extremely low resolution to be issuing less than (64 * 1024) threads :).

Share this post


Link to post
Share on other sites
15 minutes ago, Infinisearch said:

My mistake I meant compute shader invocation.

What do you mean by the term 'invocation'? To me an invocation is a single thread of execution, meaning a 1080p Quad would "invoke" the pixel shader ~2M times. A single thread of course will be run on a single CU for its lifetime.

Edited by ajmiles

Share this post


Link to post
Share on other sites
6 minutes ago, ajmiles said:

What do you mean by the term 'invocation'? To me an invocation is a single thread of execution, meaning a 1080p Quad would "invoke" the pixel shader ~2M times. A single thread of course has to be run on a single CU for its lifetime.

In this context I meant a dispatch.

Share this post


Link to post
Share on other sites
1 minute ago, Infinisearch said:

@ajmiles  BTW if a thread group doesn't use LDS can it be spread across multiple CU's?

It won't be, no. The hardware doesn't seem to launch a replacement thread group either until all waves in the thread group have retired, so I tend to steer clear of Thread Groups  > 1 wave unless I'm using LDS.

Share this post


Link to post
Share on other sites
3 minutes ago, ajmiles said:

It won't be, no.

Might that behavior change in the future?  Or is that entirely in AMD's hand?  Is nvidia any different?

5 minutes ago, ajmiles said:

The hardware doesn't seem to launch a replacement thread group either until all waves in the thread group have retired, so I tend to steer clear of Thread Groups  > 1 wave unless I'm using LDS.

I remember reading somewhere AMD seems to like a thread group size of at least 256, am I misremembering?  Doesn't this have something to do with hiding memory latency?  I'm not that experienced with compute shader's yet and my memory of what I did learn isn't that great.

Share this post


Link to post
Share on other sites
Just now, Infinisearch said:

Might that behavior change in the future?  Or is that entirely in AMD's hand?  Is nvidia any different?

I remember reading somewhere AMD seems to like a thread group size of at least 256, am I misremembering?  Doesn't this have something to do with hiding memory latency?  I'm not that experienced with compute shader's yet and my memory of what I did learn isn't that great.

It seems reasonable to imagine it /could/ change in the future (or may even have already changed?), but that's up to the hardware, it's not a D3D/HLSL thing. No idea what the behaviour is on other IHVs.

I don't think the size of a thread group has much bearing on being able to hide memory latency per se. Obviously you want to make sure the hardware has enough waves to be able to switch between them (4 per SIMD, 16 per CU) is a reasonable target to aim for. But whether that's 16 waves from 16 different thread groups or 16 waves from a single thread group doesn't matter too much so long as enough of them are making forward progress and executing instructions while others wait on memory.

You don't want to be writing a 1024 thread thread group (16 waves on AMD) where one wave takes on the lion's share of the work while the other 15 sit around stalled on barriers, that's not going to help you hide latency at all. There's nothing inherently wrong with larger thread groups, you just need to be aware of how the waves get scheduled and ensure that you don't have too many waves sitting around doing nothing.

Share this post


Link to post
Share on other sites
7 minutes ago, ajmiles said:

You don't want to be writing a 1024 thread thread group (16 waves on AMD) where one wave takes on the lion's share of the work while the other 15 sit around stalled on barriers, that's not going to help you hide latency at all.

What do you mean by barriers here?

Share this post


Link to post
Share on other sites
5 minutes ago, Infinisearch said:

What do you mean by barriers here?

GroupMemoryBarrierWithGroupSync() is the one you'll see 99% of the time. It blocks all threads in the thread group executing any further until all threads have finished accessing LDS and hit that instruction. It's essentially a cross-wave synchronisation point.

1. All threads write to LDS
2. GroupMemoryBarrierWithGroupSync()
3. All threads read LDS.

Would be a typical pattern.

Share this post


Link to post
Share on other sites

If you use a thread-group size of 1, i thought AMD HW will run you code on it's 64-wide vector instruction set with 63 lanes masked out / wasted?

(and likewise on NVidia with 31 masked  and Intel with 7 masked).

And yeah by running 128 threads on AMD instead of 64 is the same as manually unrolling your code 2x, which in some situations can help reduce observed latency. 

[edit] Ahhhh i misread! I though groups of 1 thread were mentioned, but it was groups of 1 wave. :o oops

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now


  • Forum Statistics

    • Total Topics
      627719
    • Total Posts
      2978790
  • Similar Content

    • By ilovegames
      Your home planet was attacked. Now you have to use your spaceship to battle the invaders. Powerful 3D arcade with outer space background. Very addictive. Good luck!
      Download https://falcoware.com/StarFighter.php
       




    • By ilovegames
      Attack Of Mutants is an adrenaline - powered and bloody shooter in with lots of horror and action! Beat back the waves of opponents. The game features a lot of weapons and types of enemies. Show them what you are capable of. Prove your power and strength!
      Download https://falcoware.com/AttackOfMutants.php



    • By ilovegames
      BOOM is a multiplayer shooter that takes place on one of the satellites of Saturn. Destroy your enemies using your large arsenal! The game features excellent graphics and a spacious map. Good luck fighter!   Controls: W - Forward S - Backward A - Left D - Right SPACE - Jump Enter - Chat LBM - Shot RBM - Sight V - Third party   Download https://falcoware.com/rus/BOOM.php



    • By ilovegames
      Please delete, dublicate.
       
       
       
    • By skyemaidstone
      Hello!
      I've read MJP's great post on CSM and jiggling/shimmering and it seem to be the exact problem i'm having (I realise it's from 7 years ago or so but it seems very relevant)
      I was hoping someone you had time to help me solve it or point me in the right direction. I feel like i've learned the concepts and read everything I can find on the subject but I just can't seem to solve it. I'm more than happy to pay someone for their time, I just want this solved so I can forget about it because my game is at a fun stage were I can really start adding all the good bits (combat and spell effects, dungeons, swords etc). 
      Anyway..
      Here is a youtube video of the problem. I turn on the render target view about halfway through so you can see the shadow map top right. I've turned off all but the closest cascade for now and stretched it rather far so the quality isn't amazing but it shows the jiggling problem nicely.
      Please help
      My method for making the projection is pretty short and is very similar to MJPs post:
      public void GenerateCSMOrthoSliceTS(float pNearClip, float pfarClip) { Vector3[] frustumCorners = new Vector3[8]; Matrix mCameraViewProj = _Camera.CameraView; mCameraViewProj *= Matrix.CreatePerspectiveFieldOfView(MathHelper.PiOver4, _Camera._aspectRatio, pNearClip, pfarClip); BoundingFrustum oCameraViewProjFrustum = new BoundingFrustum(mCameraViewProj); frustumCorners = oCameraViewProjFrustum.GetCorners(); Vector3 frustumCenter = new Vector3(0, 0, 0); for (int i = 0; i < 8; i++) frustumCenter += frustumCorners[i]; frustumCenter /= 8; // don't bother recaculating the radius if we've already done it if (radius == 0) radius = (frustumCorners[0] - frustumCorners[6]).Length() / 2.0f; Vector3 eye = frustumCenter + (SunlightDirection * radius); ShadowLightPos = eye; Vector3 ShadowLookAt = frustumCenter; ShadowLightView = Matrix.CreateLookAt(eye, ShadowLookAt, new Vector3(0, 1, 0)); ShadowLightProjection = Matrix.CreateOrthographicOffCenter(-radius, radius, -radius, radius, -radius * 8.0f, radius * 8.0f); ShadowLightViewProjectionMatrix = ShadowLightView * ShadowLightProjection; if (_nojiggle) { float ShadowMapSize = 4096.0f; // Set this to the size of your shadow map Vector3 shadowOrigin = Vector3.Transform(Vector3.Zero, ShadowLightViewProjectionMatrix); shadowOrigin *= (ShadowMapSize / 2.0f); Vector2 roundedOrigin = new Vector2((float)Math.Round(shadowOrigin.X), (float)Math.Round(shadowOrigin.Y)); Vector2 rounding = roundedOrigin - new Vector2(shadowOrigin.X, shadowOrigin.Y); rounding /= (ShadowMapSize / 2.0f); Matrix roundMatrix = Matrix.CreateTranslation(rounding.X, rounding.Y, 0.0f); ShadowLightViewProjectionMatrix *= roundMatrix; } }  
  • Popular Now