Sign in to follow this  
Mr_Fox

Question about InterlockedOr

Recommended Posts

Hey Guys,

 

I recently have a bug in related to atomic operation:

void InterlockedAddToUpdateQueue(uint3 u3BlockIdx)
{
    uint uOrig = 1;
    InterlockedOr(
        tex_uavBlockStateVol[u3BlockIdx], BLOCKSTATEMASK_UPDATE, uOrig);
    if ((uOrig & BLOCKSTATEMASK_UPDATE) == 0) {
        AddToUpdateQueue(u3BlockIdx);
    }
}

So the above is a compute shader function. In my cases, there will be multiple threads from different (or same) threadgroup calling this function try to add u3BlockIdx to UpdateQueue (some thread may call this function with same u3BlockIdx). And I hope to only add unique u3BlockIdx to UpdateQueue, so I made up the above function. The basic idea is that I maintain a flag volume (I just use one bit of a exist block volume so this won't give me lots of memory pressure, and tex_uavBlockStateVol is the UAV of that volume ). Then if one thread try to add its u3BlockIdx, it will flag the corresponding bit in the volume, and since I was doing a interlockedor, I can get the original value on that location, and use that to check whether u3BlockIdx is already added to the UpdateQueue.

 

The idea is very straight forward. But my UpdateQueue still have so many duplicated u3BlockIdx. So I guess I may do something terribly wrong, and hope you guys could help me figuring that out. And if someone have better idea to achieve what I want, any comments or suggestions will be greatly appreciated.

 

Thanks

Edited by Mr_Fox

Share this post


Link to post
Share on other sites

Is interlockedXXX result on a UAV immediate visible to all other threads? I was wondering how this gonna happen for threads within the same warp/wavefront since all threads in that warp/wavefront execute in a lockstep especially if you readback the original_value(the third param, which mean it is a both read and write)

Share this post


Link to post
Share on other sites

Are you sure there are no other writes poending to tex_uavBlockStateVol[u3BlockIdx] at this point that could accidently clear the flag?

Your assumptions seem correct so that's the only reason i can think of.

Maybe you need a memory barrier or something between shader invocations to ensure pending writes from the first shader are done before the second shader starts.

 

And if someone have better idea to achieve what I want, any comments or suggestions will be greatly appreciated.

 

You could try to use a unique UAV for the flag where each voxel uses one individual bit (e.g. the 4x4x2 bits per uint32).

This increases the atomic pressure per uint32, but also increases cache hits.

It would be interesting how this affects performance.

Share this post


Link to post
Share on other sites
Are you sure there are no other writes poending to tex_uavBlockStateVol[u3BlockIdx] at this point that could accidently clear the flag? Your assumptions seem correct so that's the only reason i can think of.

Thanks JoeJ for confirm that my assumptions looks correct, and that gives me confident to start find the bug elsewhere, and I do find it, it was a silly bug. The problem is some u3BlockIdx is out of bound... and tex_uavBlockStateVol[outofbound] will somehow get remapped to a valid voxel and thus my flag get corrupted. But what interested is that I have my validation layer turned on, and there is no error, no warning, and no crash even though I was doing some out of bound read and write....

 

And thank you for the batched stand-alone flag volume suggestion, but I think in my case, that may not improve my performance since my cs can be overlapped, and it is likely some cs from previous pass or later pass may execute all together, and those passes are doing atomic reading writing from, and to the big volume frequently, so some part of the big volume may need to be in cache anyway, thus cache hits rate may not change that much among all running cs.  

 

But as always, please correct me if I got that wrong.

 

Thanks

Edited by Mr_Fox

Share this post


Link to post
Share on other sites
AFAIK, for cases like this, D3D12 requires you to put a memory barrier between each dispatch. Otherwise not all of the writes to tex_uavBlockStateVol will be seen by the next dispatch as some caches may have yet to be flushed.
D3D11 enforced a memory barrier between each dispatch (which was a waste if two dispatches were completely independent); but D3D12 doesn't enforce this and you are the one who needs to do that. Edited by Matias Goldberg

Share this post


Link to post
Share on other sites
Otherwise not all of the writes to tex_uavBlockStateVol will be seen by the next dispatch as some caches may have yet to be flushed.

 

Even though all my write are atomic?  My assumption is that all atomic writes on a UAV are immediate visible to all other threads since for one dispatch call thread in different threadgroup can see the atomic result immediately (or this is not true?). Thus if atomic write can be 'synced' between threadgroup (in different EU) it should be also be 'synced' for different dispatch without uvabarrier?

 

Please correct me if I got that wrong. Thanks

Edited by Mr_Fox

Share this post


Link to post
Share on other sites

You could try to use a unique UAV for the flag where each voxel uses one individual bit (e.g. the 4x4x2 bits per uint32).

 

Just curious why not suggesting using 2x2x2 8bit typed buffer which more naturally map to a cube? use 32bit element buffer is preferred? 

Thanks

Share this post


Link to post
Share on other sites

Yes, just suggested this to use all 32 bits to get best caching.

I'd like to use 4x4x4 uint64 the most.

In Vulkan 64 bit type is not guaranteed to be supported on every GPU, but i guess in DX it is.

Share this post


Link to post
Share on other sites

 

Otherwise not all of the writes to tex_uavBlockStateVol will be seen by the next dispatch as some caches may have yet to be flushed.

 
Even though all my write are atomic?  My assumption is that all atomic writes on a UAV are immediate visible to all other threads since for one dispatch call thread in different threadgroup can see the atomic result immediately (or this is not true?). Thus if atomic write can be 'synced' between threadgroup (in different EU) it should be also be 'synced' for different dispatch without uvabarrier?

 

Within the same dispatch yes, across different dispatches NO. That includes atomic operations.

 

Edit: IIRC Atomic operations work on L2 cache levels, which is why all threads from any threadgroups within the same dispatch can see the atomic operation being performed. However for a different dispatch you'll need a barrier to flush the L2 cache back to RAM and from RAM back to L2 (or at least the second dispatch wait for the first dispatch to finish and reuse the same L2 region). If you don't insert a barrier, the two dispatches may end up executing in parallel using different L2 regions, thus "clearing" your XORs due to the race condition.

Edited by Matias Goldberg

Share this post


Link to post
Share on other sites

AFAIK, for cases like this, D3D12 requires you to put a memory barrier between each dispatch. Otherwise not all of the writes to tex_uavBlockStateVol will be seen by the next dispatch as some caches may have yet to be flushed. D3D11 enforced a memory barrier between each dispatch (which was a waste if two dispatches were completely independent); but D3D12 doesn't enforce this and you are the one who needs to do that.

 

I share this experience with Vulkan.

I have not yet looked up the specs for this, but my project became quite large now (about 30 CS invocations in a single queue) and often a invocation depends on the result of the previous.

Sometimes it would work without memory barriers, but often i definitevely need them.

 

I assume we have to use memory barriers whenever we need to make sure the data is ready, no matter if the writes are done atomic or not, and even if there is a unrelated dispatch in between.

 

(Just to give some feedback related to the other thread)

Share this post


Link to post
Share on other sites
My assumption is that all atomic writes on a UAV are immediate visible to all other threads since for one dispatch call thread in different threadgroup can see the atomic result immediately (or this is not true?)

 

Even if such assumptions are true for current hardware, no one knows what changes in the future.

 

Following my own advice, i've just added all missing barriers to my project.

Before the change averaged runtime over 10 frames was 2.60ms, after the change still 2.58ms.

So there seems no price to pay when using proper barriers :)

 

Oops - i forgot that i already read a timestamp at all those places where the new barriers have been added.

Reading timestamps has a noticeable performance cost, probably it's the same for the barriers - i'll have a look...

 

 

EDIT:

 

After removing fine grained timestamp reading i get 2.432ms.

There are only 2 barriers left i could remove while keep things working then i get down to 2.418ms.

So memory barriers have a cost but it's not that bad.

Edited by JoeJ

Share this post


Link to post
Share on other sites

Within the same dispatch yes, across different dispatches NO. That includes atomic operations

 

Thanks so much for putting a clear answer on that, it will be great if MSFT have a good and detailed docs on those topic....

Also I am not very familiar with how L2 cache worked on GPU, so the L2 cache is not persistent across dispatches? I mean, even if new dispatch starts, and want to read the same ram address, it will find that the data is already in L2, and the L2 is not 'dirty' so it could use it right away, right?

Or as you suggest, when GPU starts a new dispatch, it actually 'allocate' and initialize a region in L2 specific for that dispatch? If that is the case, the available L2 cache size should be different for different dispatches?  sorry I just get confused.

 

Thanks 


Yes, just suggested this to use all 32 bits to get best caching.

Thanks JoeJ, But why 32bits element will get best caching? With 8bit element, you can fit 4 times more data into cache, so I feel in terms of cache hit rate, they should be almost the same right? 

Share this post


Link to post
Share on other sites

With 8bit element, you can fit 4 times more data into cache, so I feel in terms of cache hit rate, they should be almost the same right?

 

Probably some misunderstanding. I'm not sure what you mean, but when you talked about 2x2x2 i assumed you would use only 8 of 32 bits.

I don't think a GPU can adress individual bytes, it must be 32 or 64 bits. So loading a 2x2x2 bits has the same cost than all 32.

Share this post


Link to post
Share on other sites

 

With 8bit element, you can fit 4 times more data into cache, so I feel in terms of cache hit rate, they should be almost the same right?

 

Probably some misunderstanding. I'm not sure what you mean, but when you talked about 2x2x2 i assumed you would use only 8 of 32 bits.

I don't think a GPU can adress individual bytes, it must be 32 or 64 bits. So loading a 2x2x2 bits has the same cost than all 32.

 

yup, I was talking using typed format like DXGI_FORMAT_R8_UINT to represent 2x2x2 voxel block instead of DXGI_FORMAT_R32_UINT for 4x4x2 voxel block. So you mean they shouldn't have any perf difference? even one cache line can fit more data for DXGI_FORMAT_R8_UINT?

 

Also I was wondering how atomic op implemented on texture uav with format smaller than 32bit like DXGI_FORMAT_R8_UINT?  

Share this post


Link to post
Share on other sites

You need to test.

But atomics work only on 32 bit types (Khronos APIs support 64 bits by extensions).

 

Microsoft seems 32 bit only: "This operation can only be performed on int or uint typed resources and shared memory variables"

https://msdn.microsoft.com/en-us/library/windows/desktop/ff471406(v=vs.85).aspx

Share this post


Link to post
Share on other sites

Mr_Fox - regarding L2, there's no specific region for any dispatch. Maybe Matias Goldberg might explain what he meant by that.

 

The writes to your UAV might for example bypass L1 (which is per compute unit, for example) and go directly to L2 (one GPU). All depending on the actual architecture.

The writes don't go directly to main memory, though, not automatically at least, they go to a cache instead.

The reads will look into L2 and if they a see a not-invalidated cache-line for the memory address in question, they won't read RAM and return the cached value instead.

Hence the need to flush and invalidate (L2) caches between dispatches to ensure visibility for all other clients.

 

A flush will just transfer all the changed 64-byte (usual size) cache lines from L2 to the main memory and mark all lines invalid in the L2 cache.

 

However, if the only client interacting with your UAV is compute shader, and there's only one L2 per GPU, it should NOT be necessary to flush anything. I'm not sure here...

But if the atomic accesses don't bypass L1, which I can't tell from MS docs (I don't know, how I'd configure that), a flush+invalidate is definitely necessary. I'd bet this is the default situation.

 

On one of the recent consoles with GCN, it's possible to set such a cache operation, that both L1 and L2 are completely bypassed and all reads and writes (on any certain resource) from shaders go straight to RAM, which is slow, yet usable for certain scenarios involving small amounts of transferred data.

I'm not sure, if this is possible to set up on Windows/DX12, someone might hint if yes or no.

Edited by pcmaster

Share this post


Link to post
Share on other sites

You need to test.

But atomics work only on 32 bit types (Khronos APIs support 64 bits by extensions).

 

Microsoft seems 32 bit only: "This operation can only be performed on int or uint typed resources and shared memory variables"

https://msdn.microsoft.com/en-us/library/windows/desktop/ff471406(v=vs.85).aspx

 

Yup, that's the sentence I've see before, and get confused: int/uint really just means 32bit integer type or general integer types (8bit, 16bit). But I should definitely test it myself.  

Thanks

Share this post


Link to post
Share on other sites

However, if the only client interacting with your UAV is compute shader, and there's only one L2 per GPU, it should NOT be necessary to flush anything.
 

 

Thanks pcmaster, but I just wondering from your post, you seems implied that 'client' switch will have impact on the cache system on GPU. So please correct me if I got it wrong: So given a case where I have 2 pixel shader, 2 compute shader want to access(both cs/ps will do interlockedadd) the same memory through UAV actomically, and so there will be 2 clients? and in my case I only have one L2 cache for entire GPU, so for my two cs since they will just read L2 cache, we don't need to do a flush+invalidate, and same for 2 ps access. But between cs and ps, we probably need to do a flush+invalidate since there is a 'Client switch' which will somehow corrupt L2 cache?

 

And what is exactly a 'client' on GPU?  Sorry, I just a little bit confused.

 

Thanks

Share this post


Link to post
Share on other sites

Here, by a memory client I mean: CPU, shader core, colour block, depth block, command processor, DMA. Each has different caches (CPU has L1/L2 or even L3, shaders have L1 and L2, CB/DB have separate caches for accessing render targets (but not UAVs), etc).

So in your case of PS/CS using atomics, it's all the same memory client (shader) using only GPU L1 and GPU L2.

 

What nobody seems to know here, is whether interlocked atomics go via L1 or L2 or somehow else. If it was as I write (and it is on the consoles), we'd be safe with not flushing or doing just a partial flush (which I'm not sure we can with PC D3D12).

 

After all, D3D12 probably doesn't even want us to know things like this :)

 

So, for now, we must assume the worst case and that is an unknown cache hierarchy, unlike on GCN consoles. Thus, we can't know what happens between dispatches/draws, therefore it seems completely necessary to flush the caches to ensure visibility :(

That's my worst case assessment and I hope somebody proves me wrong and it's possible to do with a less-heavy cache sync.

 

I'm very sorry, I don't know any better yet...

Edited by pcmaster

Share this post


Link to post
Share on other sites

Mr_Fox - regarding L2, there's no specific region for any dispatch. Maybe Matias Goldberg might explain what he meant by that.

 

So, for now, we must assume the worst case and that is an unknown cache hierarchy, unlike on GCN consoles. Thus, we can't know what happens between dispatches/draws, therefore it seems completely necessary to flush the caches to ensure visibility :(

That's my worst case assessment and I hope somebody proves me wrong and it's possible to do with a less-heavy cache sync.

 

What I meant is answered by the second part I quoted. You can't assume what the HW does (e.g. as far as you know each dispatch could be getting its own region of L2 cache) that's why in order to be cross-vendor cross-platform DX12 compliant, you have to put a barrier between each dispatch for dependent jobs.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this