R&D Tile based particle rendering

Recommended Posts

Posted (edited)

Hi,

Tile based renderers are quite popular nowadays, like tiled deferred, forward+ and clustered renderers. There is a presentation about GPU based particle systems from AMD. What particularly interest me is the tile based rendering part. The basic idea is, that leave the rasterization pipeline when rendering billboards and do it in a compute shader instead, much like Forward+. You determine tile frustums, cull particles, sort front to back, then render them until the accumulated alpha value is below 1. The performance results at the end of the slides seems promising. Has anyone ever implemented this? Was it a success, is it worth doing? The front to back rendering is the most interesting part in my opinion, because overdraw can be eliminated for alpha blending.

The demo is sadly no longer available..

Edited by turanszkij

Share this post


Link to post
Share on other sites

It is doable, we do it all in our game, but we do it back to front ( no earlier out ) and we also interleaved them with sorted fragment from traditional geometry or unsupported particles types. The bandwidth saving plus well written shader optimization make it a good gain ( plus order independent transparency draw :) )

The challenge is DX11 PC without bindless, you have to deal with texture atlas and drivers having a hard time to optimise such a complex shader ( from the DXBC compared to console where we have dedicated shader compiler ), On Console and dx12/Vulkan, you can also just provide an array of texture descriptors, so easier :) For practical reason and storage for the culling you may want to limit the number of particles to a few thousands, it was fine for us, but other games based on heavy effects would have mourn.

 

 

 

Share this post


Link to post
Share on other sites
3 hours ago, Infinisearch said:

I have the demo if you want it, and it is available on Github here: https://github.com/GPUOpen-LibrariesAndSDKs/GPUParticles11/

edit - I attached the demo.

GPUParticles11_v1.0.zip

Thanks for that, I will check it out. I've just started implementing this myself, anyway. :)

1 hour ago, galop1n said:

For practical reason and storage for the culling you may want to limit the number of particles to a few thousands, it was fine for us, but other games based on heavy effects would have mourn.

Hm, that's a bit disappointing. I know that most games probably don't use more than a few thousand particles anyway, but I thought that this would help with the sheer numbers as well apart from overdraw optimization.

Share this post


Link to post
Share on other sites

I have managed to implement this technique on a console (PS4). I made a tech demo which renders particles for high overdraw and with heavy shaders (per pixel lighting). It can also render particles spread out in the distance with little to no overdraw. I am using an additional coarse culling step for the tile based approach, like in the AMD demo. The coarse culling culls particles for large screen space tiles (240x135 pixels). The fine culling culls particles for 32x32 pixel tiles and renders them in the same shader.

With 100,000 particles filling the screen and heavy overdraw, the tile based technique is a clear win, manages to remain under 30 ms, while the rasterization based technique, it renders them in about 70 ms.

With 100,000 small particles on screen with little overdraw, the rasterization performs clearly better, going below 10 ms easily. The tile based approach is around 15-20 ms this time.

With 1,000,000 particles, and heavy overdraw, the tile based approach can not keep up, because it runs out of LDS to store per tile particle lists which results in flickering. The performance is slow, but the rasterization is much slower, however, it renders without artifacts.

1,000,000 particles, little overdraw, the tile based approach suffers from culling performance, while the rasterization easily does 60 FPS.

It seems to me it can only be used for specific scenarios, with little amount of particles and heavy overdraw. However, I imagine most games do not use millions of particles, so it might be worth implementing.

Share this post


Link to post
Share on other sites
3 minutes ago, turanszkij said:

The coarse culling culls particles for large screen space tiles (240x135 pixels). The fine culling culls particles for 32x32 pixel tiles and renders them in the same shader.

Whats the point of coarse culling in this case... if you're gonna tile why not just go straight for the fine tiles?  Wouldn't there be less memory access's this way?  Also since it seems you're using LDS for tile particle lists why not increase the fine tile size to 64x64 since the L2 should be big enough to keep the whole tile cached?  I'm most likely missing something... I don't really remember the presentation that well.

14 minutes ago, turanszkij said:

With 100,000 particles filling the screen and heavy overdraw, the tile based technique is a clear win, manages to remain under 30 ms, while the rasterization based technique, it renders them in about 70 ms.

With 100,000 small particles on screen with little overdraw, the rasterization performs clearly better, going below 10 ms easily. The tile based approach is around 15-20 ms this time.

With 1,000,000 particles, and heavy overdraw, the tile based approach can not keep up, because it runs out of LDS to store per tile particle lists which results in flickering. The performance is slow, but the rasterization is much slower, however, it renders without artifacts.

1,000,000 particles, little overdraw, the tile based approach suffers from culling performance, while the rasterization easily does 60 FPS.

Aren't smoke effects typically medium to high overdraw?  If so it would most likely be a win.

Share this post


Link to post
Share on other sites
1 minute ago, Infinisearch said:

Whats the point of coarse culling in this case... if you're gonna tile why not just go straight for the fine tiles?  Wouldn't there be less memory access's this way?  Also since it seems you're using LDS for tile particle lists why not increase the fine tile size to 64x64 since the L2 should be big enough to keep the whole tile cached?  I'm most likely missing something... I don't really remember the presentation that well.

Coarse culling does result in more memory accesses, but can lighten the load on the fine culling step a lot, because now you don't do fine culling for a million particles per tile, but just 10,000 for instance, or whatever the number which is in the coarse tile. This generally improves speed a lot, but this way you have an additional indirection of course. The coarse culling also has a better thread distribution: Only dispatch for the number of particles, and each particle adds itself to the relevant tiles, as opposed to the fine culling, where you dispatch for tiles, and the tiles iterate through each particle and add them.

64x64 tile size would require even more LDS storage, not less. And I can't even dispatch a threadgroup that big. If I cut back to 16x16 tiles though, then the LDS can be better utilized, because less particles will be visible in the tile, but the parallel nature of the culling will be worse. With a 32x32 tile, each thread culls a particle until all are culled, meaning 1024 particles are culled in parallel. With a 16x16 tile, 256 particles are culled in parallel, which is slower.

Share this post


Link to post
Share on other sites
17 hours ago, turanszkij said:

Coarse culling does result in more memory accesses, but can lighten the load on the fine culling step a lot, because now you don't do fine culling for a million particles per tile, but just 10,000 for instance, or whatever the number which is in the coarse tile. This generally improves speed a lot, but this way you have an additional indirection of course. The coarse culling also has a better thread distribution: Only dispatch for the number of particles, and each particle adds itself to the relevant tiles, as opposed to the fine culling, where you dispatch for tiles, and the tiles iterate through each particle and add them.

I guess I'll have to read through the presentation again, its been a while.

17 hours ago, turanszkij said:

64x64 tile size would require even more LDS storage, not less. And I can't even dispatch a threadgroup that big. If I cut back to 16x16 tiles though, then the LDS can be better utilized, because less particles will be visible in the tile, but the parallel nature of the culling will be worse. With a 32x32 tile, each thread culls a particle until all are culled, meaning 1024 particles are culled in parallel. With a 16x16 tile, 256 particles are culled in parallel, which is slower.

Oh you use LDS to do blending as well?  I thought you were going through the L1 and L2, thats why I suggested a larger tile size.  But upon thinking about it more like you said more particles might be visible with bigger tiles... and you're using LDS for the particle list.  As far as your 256 vs 1024 particles being culled in parallel that doesn't seem right to me.  You're using LDS so wouldn't the compute shaders' execution be limited to one CU (AMD hardware)?  If so it would be limited to 64 threads per clock, and 256 threads in lock step (since each 16 wide SIMD executes with a cadence of 4 clocks) @MJP @Hodgman  Could you clear this up for me, if a compute shader uses LDS is its execution limited to one CU?

Share this post


Link to post
Share on other sites

Using LDS doesn't limit a Compute Shader to a single CU, no. The requirement would be that a single thread group run all its waves on a single CU in order that they all have access to the same bit of LDS.

A 256 thread thread-group is 4 waves, and would typically be scheduled to have one wave per SIMD. A 1024 thread thread-group would have 4 waves running on each SIMD (all on the same CU). You're only wasting / not using CUs if you have less thread groups than you have CUs. Since even the biggest AMD parts only have 64 CUs, you'd have to be running at an extremely low resolution to be issuing less than (64 * 1024) threads :).

Share this post


Link to post
Share on other sites
15 minutes ago, Infinisearch said:

My mistake I meant compute shader invocation.

What do you mean by the term 'invocation'? To me an invocation is a single thread of execution, meaning a 1080p Quad would "invoke" the pixel shader ~2M times. A single thread of course will be run on a single CU for its lifetime.

Edited by ajmiles

Share this post


Link to post
Share on other sites
6 minutes ago, ajmiles said:

What do you mean by the term 'invocation'? To me an invocation is a single thread of execution, meaning a 1080p Quad would "invoke" the pixel shader ~2M times. A single thread of course has to be run on a single CU for its lifetime.

In this context I meant a dispatch.

Share this post


Link to post
Share on other sites
1 minute ago, Infinisearch said:

@ajmiles  BTW if a thread group doesn't use LDS can it be spread across multiple CU's?

It won't be, no. The hardware doesn't seem to launch a replacement thread group either until all waves in the thread group have retired, so I tend to steer clear of Thread Groups  > 1 wave unless I'm using LDS.

Share this post


Link to post
Share on other sites
3 minutes ago, ajmiles said:

It won't be, no.

Might that behavior change in the future?  Or is that entirely in AMD's hand?  Is nvidia any different?

5 minutes ago, ajmiles said:

The hardware doesn't seem to launch a replacement thread group either until all waves in the thread group have retired, so I tend to steer clear of Thread Groups  > 1 wave unless I'm using LDS.

I remember reading somewhere AMD seems to like a thread group size of at least 256, am I misremembering?  Doesn't this have something to do with hiding memory latency?  I'm not that experienced with compute shader's yet and my memory of what I did learn isn't that great.

Share this post


Link to post
Share on other sites
Just now, Infinisearch said:

Might that behavior change in the future?  Or is that entirely in AMD's hand?  Is nvidia any different?

I remember reading somewhere AMD seems to like a thread group size of at least 256, am I misremembering?  Doesn't this have something to do with hiding memory latency?  I'm not that experienced with compute shader's yet and my memory of what I did learn isn't that great.

It seems reasonable to imagine it /could/ change in the future (or may even have already changed?), but that's up to the hardware, it's not a D3D/HLSL thing. No idea what the behaviour is on other IHVs.

I don't think the size of a thread group has much bearing on being able to hide memory latency per se. Obviously you want to make sure the hardware has enough waves to be able to switch between them (4 per SIMD, 16 per CU) is a reasonable target to aim for. But whether that's 16 waves from 16 different thread groups or 16 waves from a single thread group doesn't matter too much so long as enough of them are making forward progress and executing instructions while others wait on memory.

You don't want to be writing a 1024 thread thread group (16 waves on AMD) where one wave takes on the lion's share of the work while the other 15 sit around stalled on barriers, that's not going to help you hide latency at all. There's nothing inherently wrong with larger thread groups, you just need to be aware of how the waves get scheduled and ensure that you don't have too many waves sitting around doing nothing.

Share this post


Link to post
Share on other sites
7 minutes ago, ajmiles said:

You don't want to be writing a 1024 thread thread group (16 waves on AMD) where one wave takes on the lion's share of the work while the other 15 sit around stalled on barriers, that's not going to help you hide latency at all.

What do you mean by barriers here?

Share this post


Link to post
Share on other sites
5 minutes ago, Infinisearch said:

What do you mean by barriers here?

GroupMemoryBarrierWithGroupSync() is the one you'll see 99% of the time. It blocks all threads in the thread group executing any further until all threads have finished accessing LDS and hit that instruction. It's essentially a cross-wave synchronisation point.

1. All threads write to LDS
2. GroupMemoryBarrierWithGroupSync()
3. All threads read LDS.

Would be a typical pattern.

Share this post


Link to post
Share on other sites

If you use a thread-group size of 1, i thought AMD HW will run you code on it's 64-wide vector instruction set with 63 lanes masked out / wasted?

(and likewise on NVidia with 31 masked  and Intel with 7 masked).

And yeah by running 128 threads on AMD instead of 64 is the same as manually unrolling your code 2x, which in some situations can help reduce observed latency. 

[edit] Ahhhh i misread! I though groups of 1 thread were mentioned, but it was groups of 1 wave. :o oops

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now


  • Forum Statistics

    • Total Topics
      628707
    • Total Posts
      2984310
  • Similar Content

    • By NexusDivision
      Hello people of gamedev.net

      Me and my team have been working on a MMORPG game with Unreal Engine 4 for quite some time now.
      We are seeking beta tester's and have beta key's available to people who sign up on our website.
      Please visit the website https://nexusdivision.com
      Feel free to register on our forums, We can talk about the game and help everyone get a better idea of what type of game it is. 

      Legion is a 3D fantasy MMORPG that has features including massive scale battles, unique characters and monsters, customization of avatars, special equipment and more. Players choose between the starter stats of Warrior, Magician, Archer and character advancement occurs through a mix of questing, PvP, Guild Wars, and hunting, depending upon player preference. In Legion, completely open PvP battles take place between members of the two warring factions.

      We plan to make this game very competitive and exciting 
    • By Matuda
      Hello!
      Trying to create a physics puzzle game in my "free" time.
      So far it's going very steady, but slow.
      Hope to get some feedback from you!



      Area 86 is a physics-based game, that lets you control a robot at a secret place in space.
      From simple item moving to custom imagined solutions with item picking, throwing, combining and activating!
      Explore & examine all possibilities each place has to offer and do your best to get further.
      But remember, each action has consequences and thus could break or make something unexpected.


      Quick overlook of main features:
      Physics-based gameplay with no bugs or whatsoever Tasks that give you more clue on how to do things wrong Controllable robot who can be blamed for all consequences Includes more than 1 level and each level contains less than 12 possible tasks to complete [ not in free version ] Secret places and hidden objects for extra challenge  
      What can you find in the free downloadable version:
      One fully completable level with 6 tasks and 2 hidden special items to discover.
      From the task list, 2 are main tasks which you should complete to get further and then there are 4 other tasks that should challenge your thinking.
      One of the secret items is visible instant, but you need to figure out how to collect it, while the other special item is hiding.
      Another extra feature is visual hints, that should force your thinking of discovering features.

      Download playable version for your system:

          



    • By Mert Oguz
      well, i have started developing games last year, alone , I made a singleplayer 3d openworld rpg on unity you can look at it on googleplaystore ( kooru stone rpg ) whatever, this year, i wanted to make mmo, which gone really fine until I first try real hosting, I was working on "wamp" until then. The reason i am desperate now is that the way my game works.
      On my pc, using wamp mysql , with localhost as host for my game, i was testing my mmorpg with using andorid emulators, ofcourse no lag no issues no restrictions, beautiful dream... But then, I wanted to get real host from web, so, I rent a basic, cheaphest ever web host ( 10$ year ), and transferred my php files along with sql database. 
      So, I launched the game, still no issues, tried to handle 2-3 players by using my pc, phone, friend's phone...  
      After a while, ( after really short time (3-4mins)) host started not to respond, beacause those web hosting were not fit to handle mmos, i predicted that.
      now what i am explaining is that my game works like this and asking what way should i use to handle it :
      - Creates web request ( like : webhost.com/game/getplayerdata.php?ID=2 )
      -Reads request ( request result be like = "ID2-GoodGuyXx-23-123-4-123-43 )
      -Builds player using result string
      -does similar requests REEAALY FREQUENTLY  ( total requests of 8 - 12 times per seconds )
      With my current ultimate cheap web hosting, i can handle 2 players with low lag ( lol ) but, i want to handle around 20-100 players,
      just need a clear path, i have been struggling with google cloud sql and other vps server dedicated server options, i dont wanna pay much and get ripped off.
    • By Sri Harsha
      Hi,
       
      I have a triangle,oriented in a 3D Plane i.e. I have my vertices as (x1,y1,z1) ; (x2,y2,z2) and (x3,y3,z3)
      I am trying to convert this triangular facet to voxelised model i.e.
      Along each edge,I am applying Bresenhams 3D Line algorithm and generating intermediate points.After generating intermediate points, I want to fill the inside region.
      I have been searching for some algorithm like flood filling,but did not find anything relevant till now.
      I would be really glad,if some one can provide an algorithm for achieving this.
      I basically have a List of tuple for storing all the (x,y,z) data created along the edges.(generated using Brsenhams 3D Line algorithm).
      Now,I want an algorithm,which creates cubes in the inside region.
    • By DavidMT
      A team has a position open for a 3D artist and/or modeler.
      Excellent opportunity to gain experience working with a team remotely. Great chance to extend your portfolio Candidates, if successful have the chance to join the team permanently with entry-level financial benefits. Revenue Share What they expect
      Excellent communication skills Portfolio of previous work If you are interested please attach a CV and/or relevant works to jobs@iamdavidmt.com
       
  • Popular Now