Question about InterlockedOr

Started by
18 comments, last by Matias Goldberg 7 years, 4 months ago

Hey Guys,

I recently have a bug in related to atomic operation:


void InterlockedAddToUpdateQueue(uint3 u3BlockIdx)
{
    uint uOrig = 1;
    InterlockedOr(
        tex_uavBlockStateVol[u3BlockIdx], BLOCKSTATEMASK_UPDATE, uOrig);
    if ((uOrig & BLOCKSTATEMASK_UPDATE) == 0) {
        AddToUpdateQueue(u3BlockIdx);
    }
}

So the above is a compute shader function. In my cases, there will be multiple threads from different (or same) threadgroup calling this function try to add u3BlockIdx to UpdateQueue (some thread may call this function with same u3BlockIdx). And I hope to only add unique u3BlockIdx to UpdateQueue, so I made up the above function. The basic idea is that I maintain a flag volume (I just use one bit of a exist block volume so this won't give me lots of memory pressure, and tex_uavBlockStateVol is the UAV of that volume ). Then if one thread try to add its u3BlockIdx, it will flag the corresponding bit in the volume, and since I was doing a interlockedor, I can get the original value on that location, and use that to check whether u3BlockIdx is already added to the UpdateQueue.

The idea is very straight forward. But my UpdateQueue still have so many duplicated u3BlockIdx. So I guess I may do something terribly wrong, and hope you guys could help me figuring that out. And if someone have better idea to achieve what I want, any comments or suggestions will be greatly appreciated.

Thanks

Advertisement

Is interlockedXXX result on a UAV immediate visible to all other threads? I was wondering how this gonna happen for threads within the same warp/wavefront since all threads in that warp/wavefront execute in a lockstep especially if you readback the original_value(the third param, which mean it is a both read and write)

Are you sure there are no other writes poending to tex_uavBlockStateVol[u3BlockIdx] at this point that could accidently clear the flag?

Your assumptions seem correct so that's the only reason i can think of.

Maybe you need a memory barrier or something between shader invocations to ensure pending writes from the first shader are done before the second shader starts.

And if someone have better idea to achieve what I want, any comments or suggestions will be greatly appreciated.

You could try to use a unique UAV for the flag where each voxel uses one individual bit (e.g. the 4x4x2 bits per uint32).

This increases the atomic pressure per uint32, but also increases cache hits.

It would be interesting how this affects performance.

Are you sure there are no other writes poending to tex_uavBlockStateVol[u3BlockIdx] at this point that could accidently clear the flag? Your assumptions seem correct so that's the only reason i can think of.

Thanks JoeJ for confirm that my assumptions looks correct, and that gives me confident to start find the bug elsewhere, and I do find it, it was a silly bug. The problem is some u3BlockIdx is out of bound... and tex_uavBlockStateVol[outofbound] will somehow get remapped to a valid voxel and thus my flag get corrupted. But what interested is that I have my validation layer turned on, and there is no error, no warning, and no crash even though I was doing some out of bound read and write....

And thank you for the batched stand-alone flag volume suggestion, but I think in my case, that may not improve my performance since my cs can be overlapped, and it is likely some cs from previous pass or later pass may execute all together, and those passes are doing atomic reading writing from, and to the big volume frequently, so some part of the big volume may need to be in cache anyway, thus cache hits rate may not change that much among all running cs.

But as always, please correct me if I got that wrong.

Thanks

AFAIK, for cases like this, D3D12 requires you to put a memory barrier between each dispatch. Otherwise not all of the writes to tex_uavBlockStateVol will be seen by the next dispatch as some caches may have yet to be flushed.
D3D11 enforced a memory barrier between each dispatch (which was a waste if two dispatches were completely independent); but D3D12 doesn't enforce this and you are the one who needs to do that.
Otherwise not all of the writes to tex_uavBlockStateVol will be seen by the next dispatch as some caches may have yet to be flushed.

Even though all my write are atomic? My assumption is that all atomic writes on a UAV are immediate visible to all other threads since for one dispatch call thread in different threadgroup can see the atomic result immediately (or this is not true?). Thus if atomic write can be 'synced' between threadgroup (in different EU) it should be also be 'synced' for different dispatch without uvabarrier?

Please correct me if I got that wrong. Thanks

You could try to use a unique UAV for the flag where each voxel uses one individual bit (e.g. the 4x4x2 bits per uint32).

Just curious why not suggesting using 2x2x2 8bit typed buffer which more naturally map to a cube? use 32bit element buffer is preferred?

Thanks

Yes, just suggested this to use all 32 bits to get best caching.

I'd like to use 4x4x4 uint64 the most.

In Vulkan 64 bit type is not guaranteed to be supported on every GPU, but i guess in DX it is.

Otherwise not all of the writes to tex_uavBlockStateVol will be seen by the next dispatch as some caches may have yet to be flushed.


Even though all my write are atomic? My assumption is that all atomic writes on a UAV are immediate visible to all other threads since for one dispatch call thread in different threadgroup can see the atomic result immediately (or this is not true?). Thus if atomic write can be 'synced' between threadgroup (in different EU) it should be also be 'synced' for different dispatch without uvabarrier?

Within the same dispatch yes, across different dispatches NO. That includes atomic operations.

Edit: IIRC Atomic operations work on L2 cache levels, which is why all threads from any threadgroups within the same dispatch can see the atomic operation being performed. However for a different dispatch you'll need a barrier to flush the L2 cache back to RAM and from RAM back to L2 (or at least the second dispatch wait for the first dispatch to finish and reuse the same L2 region). If you don't insert a barrier, the two dispatches may end up executing in parallel using different L2 regions, thus "clearing" your XORs due to the race condition.

AFAIK, for cases like this, D3D12 requires you to put a memory barrier between each dispatch. Otherwise not all of the writes to tex_uavBlockStateVol will be seen by the next dispatch as some caches may have yet to be flushed. D3D11 enforced a memory barrier between each dispatch (which was a waste if two dispatches were completely independent); but D3D12 doesn't enforce this and you are the one who needs to do that.

I share this experience with Vulkan.

I have not yet looked up the specs for this, but my project became quite large now (about 30 CS invocations in a single queue) and often a invocation depends on the result of the previous.

Sometimes it would work without memory barriers, but often i definitevely need them.

I assume we have to use memory barriers whenever we need to make sure the data is ready, no matter if the writes are done atomic or not, and even if there is a unrelated dispatch in between.

(Just to give some feedback related to the other thread)

This topic is closed to new replies.

Advertisement