Question about InterlockedOr

Started by
18 comments, last by Matias Goldberg 7 years, 4 months ago
My assumption is that all atomic writes on a UAV are immediate visible to all other threads since for one dispatch call thread in different threadgroup can see the atomic result immediately (or this is not true?)

Even if such assumptions are true for current hardware, no one knows what changes in the future.

Following my own advice, i've just added all missing barriers to my project.

Before the change averaged runtime over 10 frames was 2.60ms, after the change still 2.58ms.

So there seems no price to pay when using proper barriers :)

Oops - i forgot that i already read a timestamp at all those places where the new barriers have been added.

Reading timestamps has a noticeable performance cost, probably it's the same for the barriers - i'll have a look...

EDIT:

After removing fine grained timestamp reading i get 2.432ms.

There are only 2 barriers left i could remove while keep things working then i get down to 2.418ms.

So memory barriers have a cost but it's not that bad.

Advertisement

Within the same dispatch yes, across different dispatches NO. That includes atomic operations

Thanks so much for putting a clear answer on that, it will be great if MSFT have a good and detailed docs on those topic....

Also I am not very familiar with how L2 cache worked on GPU, so the L2 cache is not persistent across dispatches? I mean, even if new dispatch starts, and want to read the same ram address, it will find that the data is already in L2, and the L2 is not 'dirty' so it could use it right away, right?

Or as you suggest, when GPU starts a new dispatch, it actually 'allocate' and initialize a region in L2 specific for that dispatch? If that is the case, the available L2 cache size should be different for different dispatches? sorry I just get confused.

Thanks


Yes, just suggested this to use all 32 bits to get best caching.

Thanks JoeJ, But why 32bits element will get best caching? With 8bit element, you can fit 4 times more data into cache, so I feel in terms of cache hit rate, they should be almost the same right?

With 8bit element, you can fit 4 times more data into cache, so I feel in terms of cache hit rate, they should be almost the same right?

Probably some misunderstanding. I'm not sure what you mean, but when you talked about 2x2x2 i assumed you would use only 8 of 32 bits.

I don't think a GPU can adress individual bytes, it must be 32 or 64 bits. So loading a 2x2x2 bits has the same cost than all 32.

With 8bit element, you can fit 4 times more data into cache, so I feel in terms of cache hit rate, they should be almost the same right?

Probably some misunderstanding. I'm not sure what you mean, but when you talked about 2x2x2 i assumed you would use only 8 of 32 bits.

I don't think a GPU can adress individual bytes, it must be 32 or 64 bits. So loading a 2x2x2 bits has the same cost than all 32.

yup, I was talking using typed format like DXGI_FORMAT_R8_UINT to represent 2x2x2 voxel block instead of DXGI_FORMAT_R32_UINT for 4x4x2 voxel block. So you mean they shouldn't have any perf difference? even one cache line can fit more data for DXGI_FORMAT_R8_UINT?

Also I was wondering how atomic op implemented on texture uav with format smaller than 32bit like DXGI_FORMAT_R8_UINT?

You need to test.

But atomics work only on 32 bit types (Khronos APIs support 64 bits by extensions).

Microsoft seems 32 bit only: "This operation can only be performed on int or uint typed resources and shared memory variables"

https://msdn.microsoft.com/en-us/library/windows/desktop/ff471406(v=vs.85).aspx

Mr_Fox - regarding L2, there's no specific region for any dispatch. Maybe Matias Goldberg might explain what he meant by that.

The writes to your UAV might for example bypass L1 (which is per compute unit, for example) and go directly to L2 (one GPU). All depending on the actual architecture.

The writes don't go directly to main memory, though, not automatically at least, they go to a cache instead.

The reads will look into L2 and if they a see a not-invalidated cache-line for the memory address in question, they won't read RAM and return the cached value instead.

Hence the need to flush and invalidate (L2) caches between dispatches to ensure visibility for all other clients.

A flush will just transfer all the changed 64-byte (usual size) cache lines from L2 to the main memory and mark all lines invalid in the L2 cache.

However, if the only client interacting with your UAV is compute shader, and there's only one L2 per GPU, it should NOT be necessary to flush anything. I'm not sure here...

But if the atomic accesses don't bypass L1, which I can't tell from MS docs (I don't know, how I'd configure that), a flush+invalidate is definitely necessary. I'd bet this is the default situation.

On one of the recent consoles with GCN, it's possible to set such a cache operation, that both L1 and L2 are completely bypassed and all reads and writes (on any certain resource) from shaders go straight to RAM, which is slow, yet usable for certain scenarios involving small amounts of transferred data.

I'm not sure, if this is possible to set up on Windows/DX12, someone might hint if yes or no.

You need to test.

But atomics work only on 32 bit types (Khronos APIs support 64 bits by extensions).

Microsoft seems 32 bit only: "This operation can only be performed on int or uint typed resources and shared memory variables"

https://msdn.microsoft.com/en-us/library/windows/desktop/ff471406(v=vs.85).aspx

Yup, that's the sentence I've see before, and get confused: int/uint really just means 32bit integer type or general integer types (8bit, 16bit). But I should definitely test it myself.

Thanks

However, if the only client interacting with your UAV is compute shader, and there's only one L2 per GPU, it should NOT be necessary to flush anything.

Thanks pcmaster, but I just wondering from your post, you seems implied that 'client' switch will have impact on the cache system on GPU. So please correct me if I got it wrong: So given a case where I have 2 pixel shader, 2 compute shader want to access(both cs/ps will do interlockedadd) the same memory through UAV actomically, and so there will be 2 clients? and in my case I only have one L2 cache for entire GPU, so for my two cs since they will just read L2 cache, we don't need to do a flush+invalidate, and same for 2 ps access. But between cs and ps, we probably need to do a flush+invalidate since there is a 'Client switch' which will somehow corrupt L2 cache?

And what is exactly a 'client' on GPU? Sorry, I just a little bit confused.

Thanks

Here, by a memory client I mean: CPU, shader core, colour block, depth block, command processor, DMA. Each has different caches (CPU has L1/L2 or even L3, shaders have L1 and L2, CB/DB have separate caches for accessing render targets (but not UAVs), etc).

So in your case of PS/CS using atomics, it's all the same memory client (shader) using only GPU L1 and GPU L2.

What nobody seems to know here, is whether interlocked atomics go via L1 or L2 or somehow else. If it was as I write (and it is on the consoles), we'd be safe with not flushing or doing just a partial flush (which I'm not sure we can with PC D3D12).

After all, D3D12 probably doesn't even want us to know things like this :)

So, for now, we must assume the worst case and that is an unknown cache hierarchy, unlike on GCN consoles. Thus, we can't know what happens between dispatches/draws, therefore it seems completely necessary to flush the caches to ensure visibility :(

That's my worst case assessment and I hope somebody proves me wrong and it's possible to do with a less-heavy cache sync.

I'm very sorry, I don't know any better yet...

Mr_Fox - regarding L2, there's no specific region for any dispatch. Maybe Matias Goldberg might explain what he meant by that.

So, for now, we must assume the worst case and that is an unknown cache hierarchy, unlike on GCN consoles. Thus, we can't know what happens between dispatches/draws, therefore it seems completely necessary to flush the caches to ensure visibility :(

That's my worst case assessment and I hope somebody proves me wrong and it's possible to do with a less-heavy cache sync.

What I meant is answered by the second part I quoted. You can't assume what the HW does (e.g. as far as you know each dispatch could be getting its own region of L2 cache) that's why in order to be cross-vendor cross-platform DX12 compliant, you have to put a barrier between each dispatch for dependent jobs.

This topic is closed to new replies.

Advertisement