Atomic functions in SM5

Started by
7 comments, last by Jason Z 10 years, 7 months ago

I'm working on a voxel renderer and I'm using a pretty basic oct-tree design to store the data. Each oct-tree node is a 2x2x2 block of either indices or colors (in the case of leaf nodes). To keep memory usage down identical blocks are merged. To 'compress' to this format on the CPU is quite easy, I simple use a hash table to store the blocks as well as find identical ones. Now I'd like to try to implement the 'compression' on the GPU.

The general idea as it stands is to have a main 'block' texture and to add to it incrementally from a scratch texture. The scratch texture would hold the raw/uncompressed voxel data, the 'block' texture would essentially be a hash table that stores the blocks. I would generate a portion of my world directly to the scratch texture, then 'compress' that to the main texture, repeat, ect...

The way I was going to set it up was to have the scratch and main textures as normal textures, use a pixel shader to output to an 'index' texture which would save the destination index for each block in the scratch texture (which would be copied in a 2nd pass), and to have a RWTexure to indicate which slots in the main texture were filled, empty, or to be filled in the 2nd pass. Each thread in the pixel shader would then have to read a block from the scratch texture, hash to find its index in the main texture, then use the RWTexture to coordinate in the case where there are collisions.

Now SM5 has atomic intrinsic functions, but they aren't well documented and I can't find any examples on google. For example the InterlockedCompareStore() function doesn't make sense to me. I'm not sure what 'R' in the documentation should be. How do I specify both a particular UAV and its address in a single parameter? The asm instruction for it makes more sense (http://msdn.microsoft.com/en-us/library/windows/desktop/hh446821(v=vs.85).aspx), having both a dst and a dstAddress parameter, but how I would specify that in hlsl, I'm not sure.

There's also the 'globallycoherent' keyword which the documentation states: "This storage class causes memory barriers and syncs to flush data across the entire GPU such that other groups can see writes. Without this specifier, a memory barrier or sync will flush only an unordered access view (UAV) within the current group.". Is this required when using atomic functions (ie. the InterlockedWhatever() functions), or is that only required for normal read/writes?

Any thoughts, links, or ideas appreciated. Thanks in advance for your time.

Advertisement

Unfortunately I haven't tried to use this in person, so I can't give you an un-qualified answer. However, when I was doing some research on how these instructions worked way back when D3D11 first came out, I seem to recall the resource name with a bracket notation to specify the address.

This is also going out on a limb, but I also seem to recall an AMD presentation about order independent transparency that creates a linked list for each pixel in a render target. The implementation was dependent on this type of updating of a resource in a synchronized way. If you search for the presentation (from GDC perhaps?) there is sample code in the slides that shows how they authored it. Sorry I can't give direct experience, but I think you will be able to find what you need there...

About the globallycoherent keyword - the atomic intrinsics will work with or without it. I think it only deals with the thread synchronization intrinsics, but again I'm not speaking directly from proof by use...

I hope that helps!

So are you thinking something like:


uint compare_value, store_value;
InterlockedCompareStore(texture[index],compare_value,store_value);

Weird, but I'll give it a try.

I'll take a look for that GDC presentation too, thank-you.

So are you thinking something like:


uint compare_value, store_value;
InterlockedCompareStore(texture[index],compare_value,store_value);

Weird, but I'll give it a try.

I'll take a look for that GDC presentation too, thank-you.

I believe for a texture you would have to supply either an int2 or an int3, depending on the dimensions of the texture. It should be the same notation that you would use if you don't sample the texture, but directly load it instead.

I don't know enough about the intrinsics to comment on them specifically, but what you are planning to do sounds a lot like compaction based on the scan algorithm, and there's a good deal of literature about that. Perhaps you already know that, but otherwise I think that will give you some good keywords to do some more research on. If not, consider the keywords for the benefit of the thread's other readers.

Google: GPU Scan algorithm, or GPU compaction for some good results like this nVidia whitepaper: Efficient Parallel Scan Algorithms for GPUs. As a Shameless plug for the people I work for, if you're using or open to using C++ AMP for your GPU computing needs, The C++ AMP Algorithms Library implements Scan and probably some other things that would be useful to you. Unfortunately for now, C++ AMP is Visual Studio / Windows / D3D for now, so it may not be an option; but C++ AMP is an open specification and others are working on bringing support to other compilers and platforms, notably someone's prototyped an implementation over OpenCL using Clang (which would support Mac and Linux, in theory) once its ready.

Good luck!

throw table_exception("(? ???)? ? ???");

I found the presentation you were talking about Jason Z here: http://developer.amd.com/wordpress/media/2012/10/GDCE_2010_DX11.pdf

And yes you were correct, in the paper they used:


InterlockedExchange(tRWFragmentListHead[vScreenAddress], nNewHeadAddress, nOldHeadAddress );

Thanks Ravyne, I'll keep that in mind.

Another quick question. In the documentation for InterlockedCompareExchange() it says: "If you call InterlockedCompareExchange in a for or while compute shader loop, to properly compile, you must use the [allow_uav_condition] attribute on that loop.". But then in the while statement documentation (http://msdn.microsoft.com/en-us/library/windows/desktop/bb509708(v=vs.85).aspx) it states: "allow_uav_condition - Allows a compute shader loop termination condition to be based off of a UAV read. The loop must not contain synchronization intrinsics.". Doesn't InterlockCompareExchange() count as a sychronization primitive?

I'm pretty sure that they're referring to the *MemoryBarrier/*MemoryBarrierWithGroupSync intrinsics, and not atomic operations.

I'm pretty sure that they're referring to the *MemoryBarrier/*MemoryBarrierWithGroupSync intrinsics, and not atomic operations.

I think MJP is right - the atomic intrinsics are not necessarily synchronization primitives, while the memory barriers are explicitly for synchronizing. In fact, the atomic intrinsics are usually used to interact with memory from multiple threads without synchronizing!

This topic is closed to new replies.

Advertisement