Tile based particle rendering

Started by
22 comments, last by Hodgman 6 years, 6 months ago
7 minutes ago, ajmiles said:

You don't want to be writing a 1024 thread thread group (16 waves on AMD) where one wave takes on the lion's share of the work while the other 15 sit around stalled on barriers, that's not going to help you hide latency at all.

What do you mean by barriers here?

-potential energy is easily made kinetic-

Advertisement
5 minutes ago, Infinisearch said:

What do you mean by barriers here?

GroupMemoryBarrierWithGroupSync() is the one you'll see 99% of the time. It blocks all threads in the thread group executing any further until all threads have finished accessing LDS and hit that instruction. It's essentially a cross-wave synchronisation point.

1. All threads write to LDS
2. GroupMemoryBarrierWithGroupSync()
3. All threads read LDS.

Would be a typical pattern.

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

If you use a thread-group size of 1, i thought AMD HW will run you code on it's 64-wide vector instruction set with 63 lanes masked out / wasted?

(and likewise on NVidia with 31 masked  and Intel with 7 masked).

And yeah by running 128 threads on AMD instead of 64 is the same as manually unrolling your code 2x, which in some situations can help reduce observed latency. 

[edit] Ahhhh i misread! I though groups of 1 thread were mentioned, but it was groups of 1 wave. :o oops

This topic is closed to new replies.

Advertisement