Tile based particle rendering

János Turánszki · 2017-10-13T22:27:13

Hi, Tile based renderers are quite popular nowadays, like tiled deferred, forward+ and clustered renderers. There is a presentation about GPU based particle systems from AMD. What particularly interest me is the tile based rendering part. The basic idea is, that leave the rasterization pipeline when rendering billboards and do it in a compute shader instead, much like Forward+. You determine tile frustums, cull particles, sort front to back, then render them until the accumulated alpha value is below 1. The performance results at the end of the slides seems promising. Has anyone ever implemented this? Was it a success, is it worth doing? The front to back rendering is the most interesting part in my opinion, because overdraw can be eliminated for alpha blending. The demo is sadly no longer available..

Graphics and GPU Programming Programming R&D 3D

Started by turanszkij October 02, 2017 10:47 AM

22 comments, last by Hodgman 6 years, 6 months ago

Infinisearch

3,058

October 13, 2017 03:57 PM

7 minutes ago, ajmiles said:

You don't want to be writing a 1024 thread thread group (16 waves on AMD) where one wave takes on the lion's share of the work while the other 15 sit around stalled on barriers, that's not going to help you hide latency at all.

What do you mean by barriers here?

-potential energy is easily made kinetic-

Adam Miles

3,468

October 13, 2017 04:05 PM

5 minutes ago, Infinisearch said:

What do you mean by barriers here?

GroupMemoryBarrierWithGroupSync() is the one you'll see 99% of the time. It blocks all threads in the thread group executing any further until all threads have finished accessing LDS and hit that instruction. It's essentially a cross-wave synchronisation point.

1. All threads write to LDS
2. GroupMemoryBarrierWithGroupSync()
3. All threads read LDS.

Would be a typical pattern.

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

Hodgman

52,717

October 13, 2017 10:27 PM

If you use a thread-group size of 1, i thought AMD HW will run you code on it's 64-wide vector instruction set with 63 lanes masked out / wasted?

(and likewise on NVidia with 31 masked and Intel with 7 masked).

And yeah by running 128 threads on AMD instead of 64 is the same as manually unrolling your code 2x, which in some situations can help reduce observed latency.

[edit] Ahhhh i misread! I though groups of 1 thread were mentioned, but it was groups of 1 wave. oops

. 22 Racing Series .

Tile based particle rendering

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Tile based particle rendering

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines