I'm working on a voxel renderer and I'm using a pretty basic oct-tree design to store the data. Each oct-tree node is a 2x2x2 block of either indices or colors (in the case of leaf nodes). To keep memory usage down identical blocks are merged. To 'compress' to this format on the CPU is quite easy, I simple use a hash table to store the blocks as well as find identical ones. Now I'd like to try to implement the 'compression' on the GPU.
The general idea as it stands is to have a main 'block' texture and to add to it incrementally from a scratch texture. The scratch texture would hold the raw/uncompressed voxel data, the 'block' texture would essentially be a hash table that stores the blocks. I would generate a portion of my world directly to the scratch texture, then 'compress' that to the main texture, repeat, ect...
The way I was going to set it up was to have the scratch and main textures as normal textures, use a pixel shader to output to an 'index' texture which would save the destination index for each block in the scratch texture (which would be copied in a 2nd pass), and to have a RWTexure to indicate which slots in the main texture were filled, empty, or to be filled in the 2nd pass. Each thread in the pixel shader would then have to read a block from the scratch texture, hash to find its index in the main texture, then use the RWTexture to coordinate in the case where there are collisions.
Now SM5 has atomic intrinsic functions, but they aren't well documented and I can't find any examples on google. For example the InterlockedCompareStore() function doesn't make sense to me. I'm not sure what 'R' in the documentation should be. How do I specify both a particular UAV and its address in a single parameter? The asm instruction for it makes more sense (http://msdn.microsoft.com/en-us/library/windows/desktop/hh446821(v=vs.85).aspx), having both a dst and a dstAddress parameter, but how I would specify that in hlsl, I'm not sure.
There's also the 'globallycoherent' keyword which the documentation states: "This storage class causes memory barriers and syncs to flush data across the entire GPU such that other groups can see writes. Without this specifier, a memory barrier or sync will flush only an unordered access view (UAV) within the current group.". Is this required when using atomic functions (ie. the InterlockedWhatever() functions), or is that only required for normal read/writes?
Any thoughts, links, or ideas appreciated. Thanks in advance for your time.