Tile based particle rendering

Started by
22 comments, last by Hodgman 6 years, 6 months ago

Hi,

Tile based renderers are quite popular nowadays, like tiled deferred, forward+ and clustered renderers. There is a presentation about GPU based particle systems from AMD. What particularly interest me is the tile based rendering part. The basic idea is, that leave the rasterization pipeline when rendering billboards and do it in a compute shader instead, much like Forward+. You determine tile frustums, cull particles, sort front to back, then render them until the accumulated alpha value is below 1. The performance results at the end of the slides seems promising. Has anyone ever implemented this? Was it a success, is it worth doing? The front to back rendering is the most interesting part in my opinion, because overdraw can be eliminated for alpha blending.

The demo is sadly no longer available..

Advertisement

I have the demo if you want it, and it is available on Github here: https://github.com/GPUOpen-LibrariesAndSDKs/GPUParticles11/

edit - I attached the demo.

GPUParticles11_v1.0.zip

-potential energy is easily made kinetic-

It is doable, we do it all in our game, but we do it back to front ( no earlier out ) and we also interleaved them with sorted fragment from traditional geometry or unsupported particles types. The bandwidth saving plus well written shader optimization make it a good gain ( plus order independent transparency draw :) )

The challenge is DX11 PC without bindless, you have to deal with texture atlas and drivers having a hard time to optimise such a complex shader ( from the DXBC compared to console where we have dedicated shader compiler ), On Console and dx12/Vulkan, you can also just provide an array of texture descriptors, so easier :) For practical reason and storage for the culling you may want to limit the number of particles to a few thousands, it was fine for us, but other games based on heavy effects would have mourn.

 

 

 

3 hours ago, Infinisearch said:

I have the demo if you want it, and it is available on Github here: https://github.com/GPUOpen-LibrariesAndSDKs/GPUParticles11/

edit - I attached the demo.

GPUParticles11_v1.0.zip

Thanks for that, I will check it out. I've just started implementing this myself, anyway. :)

1 hour ago, galop1n said:

For practical reason and storage for the culling you may want to limit the number of particles to a few thousands, it was fine for us, but other games based on heavy effects would have mourn.

Hm, that's a bit disappointing. I know that most games probably don't use more than a few thousand particles anyway, but I thought that this would help with the sheer numbers as well apart from overdraw optimization.

BTW - You should check here https://gpuopen.com for AMD stuff.

-potential energy is easily made kinetic-

I have managed to implement this technique on a console (PS4). I made a tech demo which renders particles for high overdraw and with heavy shaders (per pixel lighting). It can also render particles spread out in the distance with little to no overdraw. I am using an additional coarse culling step for the tile based approach, like in the AMD demo. The coarse culling culls particles for large screen space tiles (240x135 pixels). The fine culling culls particles for 32x32 pixel tiles and renders them in the same shader.

With 100,000 particles filling the screen and heavy overdraw, the tile based technique is a clear win, manages to remain under 30 ms, while the rasterization based technique, it renders them in about 70 ms.

With 100,000 small particles on screen with little overdraw, the rasterization performs clearly better, going below 10 ms easily. The tile based approach is around 15-20 ms this time.

With 1,000,000 particles, and heavy overdraw, the tile based approach can not keep up, because it runs out of LDS to store per tile particle lists which results in flickering. The performance is slow, but the rasterization is much slower, however, it renders without artifacts.

1,000,000 particles, little overdraw, the tile based approach suffers from culling performance, while the rasterization easily does 60 FPS.

It seems to me it can only be used for specific scenarios, with little amount of particles and heavy overdraw. However, I imagine most games do not use millions of particles, so it might be worth implementing.

3 minutes ago, turanszkij said:

The coarse culling culls particles for large screen space tiles (240x135 pixels). The fine culling culls particles for 32x32 pixel tiles and renders them in the same shader.

Whats the point of coarse culling in this case... if you're gonna tile why not just go straight for the fine tiles?  Wouldn't there be less memory access's this way?  Also since it seems you're using LDS for tile particle lists why not increase the fine tile size to 64x64 since the L2 should be big enough to keep the whole tile cached?  I'm most likely missing something... I don't really remember the presentation that well.

14 minutes ago, turanszkij said:

With 100,000 particles filling the screen and heavy overdraw, the tile based technique is a clear win, manages to remain under 30 ms, while the rasterization based technique, it renders them in about 70 ms.

With 100,000 small particles on screen with little overdraw, the rasterization performs clearly better, going below 10 ms easily. The tile based approach is around 15-20 ms this time.

With 1,000,000 particles, and heavy overdraw, the tile based approach can not keep up, because it runs out of LDS to store per tile particle lists which results in flickering. The performance is slow, but the rasterization is much slower, however, it renders without artifacts.

1,000,000 particles, little overdraw, the tile based approach suffers from culling performance, while the rasterization easily does 60 FPS.

Aren't smoke effects typically medium to high overdraw?  If so it would most likely be a win.

-potential energy is easily made kinetic-

1 minute ago, Infinisearch said:

Whats the point of coarse culling in this case... if you're gonna tile why not just go straight for the fine tiles?  Wouldn't there be less memory access's this way?  Also since it seems you're using LDS for tile particle lists why not increase the fine tile size to 64x64 since the L2 should be big enough to keep the whole tile cached?  I'm most likely missing something... I don't really remember the presentation that well.

Coarse culling does result in more memory accesses, but can lighten the load on the fine culling step a lot, because now you don't do fine culling for a million particles per tile, but just 10,000 for instance, or whatever the number which is in the coarse tile. This generally improves speed a lot, but this way you have an additional indirection of course. The coarse culling also has a better thread distribution: Only dispatch for the number of particles, and each particle adds itself to the relevant tiles, as opposed to the fine culling, where you dispatch for tiles, and the tiles iterate through each particle and add them.

64x64 tile size would require even more LDS storage, not less. And I can't even dispatch a threadgroup that big. If I cut back to 16x16 tiles though, then the LDS can be better utilized, because less particles will be visible in the tile, but the parallel nature of the culling will be worse. With a 32x32 tile, each thread culls a particle until all are culled, meaning 1024 particles are culled in parallel. With a 16x16 tile, 256 particles are culled in parallel, which is slower.

By the way, I am also doing decal rendering for a Forward+ renderer, and they also benefit from top-to-bottom sorting while blending them bottom-to-top and skipping the bottom ones when the alpha is already one. :)

17 hours ago, turanszkij said:

Coarse culling does result in more memory accesses, but can lighten the load on the fine culling step a lot, because now you don't do fine culling for a million particles per tile, but just 10,000 for instance, or whatever the number which is in the coarse tile. This generally improves speed a lot, but this way you have an additional indirection of course. The coarse culling also has a better thread distribution: Only dispatch for the number of particles, and each particle adds itself to the relevant tiles, as opposed to the fine culling, where you dispatch for tiles, and the tiles iterate through each particle and add them.

I guess I'll have to read through the presentation again, its been a while.

17 hours ago, turanszkij said:

64x64 tile size would require even more LDS storage, not less. And I can't even dispatch a threadgroup that big. If I cut back to 16x16 tiles though, then the LDS can be better utilized, because less particles will be visible in the tile, but the parallel nature of the culling will be worse. With a 32x32 tile, each thread culls a particle until all are culled, meaning 1024 particles are culled in parallel. With a 16x16 tile, 256 particles are culled in parallel, which is slower.

Oh you use LDS to do blending as well?  I thought you were going through the L1 and L2, thats why I suggested a larger tile size.  But upon thinking about it more like you said more particles might be visible with bigger tiles... and you're using LDS for the particle list.  As far as your 256 vs 1024 particles being culled in parallel that doesn't seem right to me.  You're using LDS so wouldn't the compute shaders' execution be limited to one CU (AMD hardware)?  If so it would be limited to 64 threads per clock, and 256 threads in lock step (since each 16 wide SIMD executes with a cadence of 4 clocks) @MJP @Hodgman  Could you clear this up for me, if a compute shader uses LDS is its execution limited to one CU?

-potential energy is easily made kinetic-

This topic is closed to new replies.

Advertisement