Jump to content
  • Advertisement
Sign in to follow this  
Mr_Fox

Question about InterlockedOr

This topic is 651 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hey Guys,

 

I recently have a bug in related to atomic operation:

void InterlockedAddToUpdateQueue(uint3 u3BlockIdx)
{
    uint uOrig = 1;
    InterlockedOr(
        tex_uavBlockStateVol[u3BlockIdx], BLOCKSTATEMASK_UPDATE, uOrig);
    if ((uOrig & BLOCKSTATEMASK_UPDATE) == 0) {
        AddToUpdateQueue(u3BlockIdx);
    }
}

So the above is a compute shader function. In my cases, there will be multiple threads from different (or same) threadgroup calling this function try to add u3BlockIdx to UpdateQueue (some thread may call this function with same u3BlockIdx). And I hope to only add unique u3BlockIdx to UpdateQueue, so I made up the above function. The basic idea is that I maintain a flag volume (I just use one bit of a exist block volume so this won't give me lots of memory pressure, and tex_uavBlockStateVol is the UAV of that volume ). Then if one thread try to add its u3BlockIdx, it will flag the corresponding bit in the volume, and since I was doing a interlockedor, I can get the original value on that location, and use that to check whether u3BlockIdx is already added to the UpdateQueue.

 

The idea is very straight forward. But my UpdateQueue still have so many duplicated u3BlockIdx. So I guess I may do something terribly wrong, and hope you guys could help me figuring that out. And if someone have better idea to achieve what I want, any comments or suggestions will be greatly appreciated.

 

Thanks

Edited by Mr_Fox

Share this post


Link to post
Share on other sites
Advertisement

Is interlockedXXX result on a UAV immediate visible to all other threads? I was wondering how this gonna happen for threads within the same warp/wavefront since all threads in that warp/wavefront execute in a lockstep especially if you readback the original_value(the third param, which mean it is a both read and write)

Share this post


Link to post
Share on other sites

Are you sure there are no other writes poending to tex_uavBlockStateVol[u3BlockIdx] at this point that could accidently clear the flag?

Your assumptions seem correct so that's the only reason i can think of.

Maybe you need a memory barrier or something between shader invocations to ensure pending writes from the first shader are done before the second shader starts.

 

And if someone have better idea to achieve what I want, any comments or suggestions will be greatly appreciated.

 

You could try to use a unique UAV for the flag where each voxel uses one individual bit (e.g. the 4x4x2 bits per uint32).

This increases the atomic pressure per uint32, but also increases cache hits.

It would be interesting how this affects performance.

Share this post


Link to post
Share on other sites
Are you sure there are no other writes poending to tex_uavBlockStateVol[u3BlockIdx] at this point that could accidently clear the flag? Your assumptions seem correct so that's the only reason i can think of.

Thanks JoeJ for confirm that my assumptions looks correct, and that gives me confident to start find the bug elsewhere, and I do find it, it was a silly bug. The problem is some u3BlockIdx is out of bound... and tex_uavBlockStateVol[outofbound] will somehow get remapped to a valid voxel and thus my flag get corrupted. But what interested is that I have my validation layer turned on, and there is no error, no warning, and no crash even though I was doing some out of bound read and write....

 

And thank you for the batched stand-alone flag volume suggestion, but I think in my case, that may not improve my performance since my cs can be overlapped, and it is likely some cs from previous pass or later pass may execute all together, and those passes are doing atomic reading writing from, and to the big volume frequently, so some part of the big volume may need to be in cache anyway, thus cache hits rate may not change that much among all running cs.  

 

But as always, please correct me if I got that wrong.

 

Thanks

Edited by Mr_Fox

Share this post


Link to post
Share on other sites
AFAIK, for cases like this, D3D12 requires you to put a memory barrier between each dispatch. Otherwise not all of the writes to tex_uavBlockStateVol will be seen by the next dispatch as some caches may have yet to be flushed.
D3D11 enforced a memory barrier between each dispatch (which was a waste if two dispatches were completely independent); but D3D12 doesn't enforce this and you are the one who needs to do that. Edited by Matias Goldberg

Share this post


Link to post
Share on other sites
Otherwise not all of the writes to tex_uavBlockStateVol will be seen by the next dispatch as some caches may have yet to be flushed.

 

Even though all my write are atomic?  My assumption is that all atomic writes on a UAV are immediate visible to all other threads since for one dispatch call thread in different threadgroup can see the atomic result immediately (or this is not true?). Thus if atomic write can be 'synced' between threadgroup (in different EU) it should be also be 'synced' for different dispatch without uvabarrier?

 

Please correct me if I got that wrong. Thanks

Edited by Mr_Fox

Share this post


Link to post
Share on other sites

You could try to use a unique UAV for the flag where each voxel uses one individual bit (e.g. the 4x4x2 bits per uint32).

 

Just curious why not suggesting using 2x2x2 8bit typed buffer which more naturally map to a cube? use 32bit element buffer is preferred? 

Thanks

Share this post


Link to post
Share on other sites

Yes, just suggested this to use all 32 bits to get best caching.

I'd like to use 4x4x4 uint64 the most.

In Vulkan 64 bit type is not guaranteed to be supported on every GPU, but i guess in DX it is.

Share this post


Link to post
Share on other sites

 

Otherwise not all of the writes to tex_uavBlockStateVol will be seen by the next dispatch as some caches may have yet to be flushed.

 
Even though all my write are atomic?  My assumption is that all atomic writes on a UAV are immediate visible to all other threads since for one dispatch call thread in different threadgroup can see the atomic result immediately (or this is not true?). Thus if atomic write can be 'synced' between threadgroup (in different EU) it should be also be 'synced' for different dispatch without uvabarrier?

 

Within the same dispatch yes, across different dispatches NO. That includes atomic operations.

 

Edit: IIRC Atomic operations work on L2 cache levels, which is why all threads from any threadgroups within the same dispatch can see the atomic operation being performed. However for a different dispatch you'll need a barrier to flush the L2 cache back to RAM and from RAM back to L2 (or at least the second dispatch wait for the first dispatch to finish and reuse the same L2 region). If you don't insert a barrier, the two dispatches may end up executing in parallel using different L2 regions, thus "clearing" your XORs due to the race condition.

Edited by Matias Goldberg

Share this post


Link to post
Share on other sites

AFAIK, for cases like this, D3D12 requires you to put a memory barrier between each dispatch. Otherwise not all of the writes to tex_uavBlockStateVol will be seen by the next dispatch as some caches may have yet to be flushed. D3D11 enforced a memory barrier between each dispatch (which was a waste if two dispatches were completely independent); but D3D12 doesn't enforce this and you are the one who needs to do that.

 

I share this experience with Vulkan.

I have not yet looked up the specs for this, but my project became quite large now (about 30 CS invocations in a single queue) and often a invocation depends on the result of the previous.

Sometimes it would work without memory barriers, but often i definitevely need them.

 

I assume we have to use memory barriers whenever we need to make sure the data is ready, no matter if the writes are done atomic or not, and even if there is a unrelated dispatch in between.

 

(Just to give some feedback related to the other thread)

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!