Tile based particle rendering

Started by
22 comments, last by Hodgman 6 years, 7 months ago

Using LDS doesn't limit a Compute Shader to a single CU, no. The requirement would be that a single thread group run all its waves on a single CU in order that they all have access to the same bit of LDS.

A 256 thread thread-group is 4 waves, and would typically be scheduled to have one wave per SIMD. A 1024 thread thread-group would have 4 waves running on each SIMD (all on the same CU). You're only wasting / not using CUs if you have less thread groups than you have CUs. Since even the biggest AMD parts only have 64 CUs, you'd have to be running at an extremely low resolution to be issuing less than (64 * 1024) threads :).

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

Advertisement
6 hours ago, ajmiles said:

Using LDS doesn't limit a Compute Shader to a single CU, no.

My mistake I meant compute shader invocation.

 

 

-potential energy is easily made kinetic-

15 minutes ago, Infinisearch said:

My mistake I meant compute shader invocation.

What do you mean by the term 'invocation'? To me an invocation is a single thread of execution, meaning a 1080p Quad would "invoke" the pixel shader ~2M times. A single thread of course will be run on a single CU for its lifetime.

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

6 minutes ago, ajmiles said:

What do you mean by the term 'invocation'? To me an invocation is a single thread of execution, meaning a 1080p Quad would "invoke" the pixel shader ~2M times. A single thread of course has to be run on a single CU for its lifetime.

In this context I meant a dispatch.

-potential energy is easily made kinetic-

41 minutes ago, Infinisearch said:

In this context I meant a dispatch.

And so did what I wrote answer your question?

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

Yes it did thank you.  I just wanted to clarify what I meant even though I was still wrong.

-potential energy is easily made kinetic-

@ajmiles  BTW if a thread group doesn't use LDS can it be spread across multiple CU's?

-potential energy is easily made kinetic-

1 minute ago, Infinisearch said:

@ajmiles  BTW if a thread group doesn't use LDS can it be spread across multiple CU's?

It won't be, no. The hardware doesn't seem to launch a replacement thread group either until all waves in the thread group have retired, so I tend to steer clear of Thread Groups  > 1 wave unless I'm using LDS.

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

3 minutes ago, ajmiles said:

It won't be, no.

Might that behavior change in the future?  Or is that entirely in AMD's hand?  Is nvidia any different?

5 minutes ago, ajmiles said:

The hardware doesn't seem to launch a replacement thread group either until all waves in the thread group have retired, so I tend to steer clear of Thread Groups  > 1 wave unless I'm using LDS.

I remember reading somewhere AMD seems to like a thread group size of at least 256, am I misremembering?  Doesn't this have something to do with hiding memory latency?  I'm not that experienced with compute shader's yet and my memory of what I did learn isn't that great.

-potential energy is easily made kinetic-

Just now, Infinisearch said:

Might that behavior change in the future?  Or is that entirely in AMD's hand?  Is nvidia any different?

I remember reading somewhere AMD seems to like a thread group size of at least 256, am I misremembering?  Doesn't this have something to do with hiding memory latency?  I'm not that experienced with compute shader's yet and my memory of what I did learn isn't that great.

It seems reasonable to imagine it /could/ change in the future (or may even have already changed?), but that's up to the hardware, it's not a D3D/HLSL thing. No idea what the behaviour is on other IHVs.

I don't think the size of a thread group has much bearing on being able to hide memory latency per se. Obviously you want to make sure the hardware has enough waves to be able to switch between them (4 per SIMD, 16 per CU) is a reasonable target to aim for. But whether that's 16 waves from 16 different thread groups or 16 waves from a single thread group doesn't matter too much so long as enough of them are making forward progress and executing instructions while others wait on memory.

You don't want to be writing a 1024 thread thread group (16 waves on AMD) where one wave takes on the lion's share of the work while the other 15 sit around stalled on barriers, that's not going to help you hide latency at all. There's nothing inherently wrong with larger thread groups, you just need to be aware of how the waves get scheduled and ensure that you don't have too many waves sitting around doing nothing.

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

This topic is closed to new replies.

Advertisement