Questions about defragment GPU buffer

Started by
8 comments, last by _the_phantom_ 7 years, 4 months ago

Hey Guys,

Recently I have a lot of fun playing with GPU atomic, compute shader and dispatchIndirect. And I have a tricky situation:

================================Backgroud====================================

In my project I was doing volume rendering, and to speed that up, I partitioned my volume into blocks(which contain 8^3 voxels), and have a GPU buffer contain idx of blocks which are not empty. So here I got a large buffer contains non-empty block coordinates I call it occupiedBlocksBuf

My scene in that volume is dynamic, so during each pass, I have a compute shader update the whole volume (It was fast, it only update voxels actually changed), so there will be blocks which are previously empty now become non-empty, and also there will be blocks which are previously non-empty now become empty.

For new empty blocks which are previously non-empty, I also maintain a buffer called freedSlotsBuf, so when a block get freed, the compute shader will first find its idx in occupiedBlocksBuf, and write FREED_FLAG into that location, and then append that idx into freedSlotsBuf

For new non-empty blocks which are previously empty, my compute shader will first get available slots from freedSlotsBuf and write the coordinate of that newly non-empty block's coordinate into that slot in occupiedBlocksBuf, so basically filling freedslots in occupiedBlocksBuf first, and then if there are no more freed slots, I append the block's coordinate to the end of occupiedBlocksBuf.

So that's the basic idea. And as you may notice, the 'size' of occupiedBlocksBuf will never decrease, and as my program running, in some cases that buffer will become fragmented (lots of slots get freed), which are bad....

===============================Problem=======================================

I then write a defragmentation shader (I have freedSlotsBuf told me how many freedslot I've got and where are they, so I got everything I need), and use dispatchIndirect to defragment occupiedBlocksBuf. Indirect param are written by compute shader, and based on the size of freedSlotsBuf, so when the size of freedSlotsBuf is smaller than the threshold, Indirect param will be 0,1,1 which result in empty thread.

However, by doing defragmentation the way I described, I have to call the following code every frame on CPU side even though I know 99% of the time, it will map to empty GPU working thread.


void
TSDFVolume::OnDefragment(ComputeContext& cptCtx)
{
    GPU_PROFILE(cptCtx, L"Defragment");
    cptCtx.SetPipelineState(_cptBlockQDefragment);
    cptCtx.SetRootSignature(_rootsig);
    cptCtx.TransitionResource(_occupiedBlocksBuf, UAV);
    cptCtx.TransitionResource(_freedFuseBlocksBuf, csSRV);
    cptCtx.TransitionResource(_jobParamBuf, csSRV);
    cptCtx.TransitionResource(_indirectParams, IARG);
    Bind(cptCtx, 2, 0, 1, &_occupiedBlocksBuf.GetUAV());
    Bind(cptCtx, 2, 1, 1, &_freedFuseBlocksBuf.GetCounterUAV(cptCtx));
    Bind(cptCtx, 3, 0, 1, &_freedFuseBlocksBuf.GetSRV());
    Bind(cptCtx, 3, 1, 1, &_jobParamBuf.GetSRV());
    cptCtx.DispatchIndirect(_indirectParams, 48);
}

There will be UAV transitions, PSO changes, so seems definitly non-zero CPU/GPU cost. Which looks sub-optimal....

to avoid that, we can let CPU decide whether to call OnDefragment or not. But that require we read back freeSlotBuf size from GPU in at some frequency, which may have even worse perf impact....

So any suggestions? or are there existing better ways to do these GPU buf defragmentation?

Thanks in advance.

P.S. When UAV barrier actually have non-zero GPU cost? I feel like if I don't have any read/write between two UAV barriers on the same resource, then second barrier should take no GPU time, right? (please correct me if I got that wrong)

Advertisement

I have exact the same thing todo next, already implemented a defragmenting version on CPU. I have to port to GPU now, but first i need to understand again how my code works... :)

But i already have the dispatch zero / little work problem multiple times in my pipeline, and even with Vulkan it's still a big problem (1% of runtime on CPU turns to 10% on GPU).

If you have this problem only once at this spot, probably you don't need to worry so much about it. (I have multiple cases of 16 dispatches with barriers in between emitting only 100 wavefronts)

It's no option to exclude zero work dispatches on CPU for me as i use a prebuild command buffer which must contain indirect dispatches for every work that can possibly happen (probably you do the same in general?).

But i do CPU side decissions this way on my OpenCL 1.2 code path - there's no indirect dispatch, so i need to read back to get the amount of necessary work anyways.

Here the read back is very slow. It's better to dispatch zero work than to read back to CPU. Probably this is true for any API and almost any use case.

Nvidia has a solution: Build command buffers on GPU. But that's an experimantal Extension and Vulkan only.

So i think one thing we can do is using async compute to keep working while there is little or no work on the other side.

I have little success in real world for now, but opened this thread with some syntetic results: http://www.gamedev.net/topic/684589-confusing-performance-with-async-compute/

So i think one thing we can do is using async compute to keep working while there is little or no work on the other side.

Thanks JoeJ for the reply, I did think about using async compute, but after a little bit of google search, I was so disappointing to know that async compute is not 'async' on Nvidia GPUs right now, and I have to do this project on Nvidia GPU, so using async compute will only brought me more sync issue and no perf gain...

Also you mentioned you use a prebuild command buffer, do you mean that you have more than one command get executed in one ExecuteIndirect API call? In my case, I need to switch PSOs between each command, so right now, it seems I can't do this with prebuild command (am I wrong?). Also I feel like right now ExecuteIndirect is somewhat limited, all commands within one Indirect Command buffer need to use the same shader, same pipeline setting.... So if you don't mind I really wish to know what inside you prebuild command buffer :)

Thanks

I have everything in one command buffer. I have never used DX, but i'm pretty sure it's the same possibilities. Just a sequence af all setup / dispatch commands in the order you need.

E.g. here's code for the testing i've mentioned:

vkBeginCommandBuffer(commandBuffer, &commandBufferBeginInfo);

            vkCmdBindDescriptorSets(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipelineLayouts[tTEST2], 0, 1, &descriptorSets[tTEST2], 0, nullptr);     
            vkCmdBindPipeline(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipelines[taskToPipeline[tTEST2]]);
            int barrierBuffers[] = {bTEST2};
            //vkCmdDispatchIndirect(commandBuffer, buffers[bDISPATCH].deviceBuffer, sizeof(VkDispatchIndirectCommand) * (0 + OFFSET_DISPATCH_CL_TO_GRAND) );
            for (int i=0; i<100; i++)
            {
                vkCmdDispatch(commandBuffer, 1,1,1);
                MemoryBarriers (commandBuffer, barrierBuffers, 1);
            }
 
// in practice i use 10-20 sections like the above to process all things i need: animate, rebuild tree, plan work, raytracing, gathering, simple things similar to building mip maps, ... does not matter what it fits all in.
 
vkEndCommandBuffer(commandBuffer);

So i pack my whole algorithm in one commend buffer and i need only one 'draw call' per frame to execute it.

(You also can divide stuff in multiple command buffers, enqueue them in order and execute the queue still just once)

Only exception is when i want to use async compute i need to execute more than one queues.

No further GPU<->CPU communication necessary if the hardware can handle it all on it's own.

You should try this, probably it already solves your problem. Vulkan is almost twice as fast than OpenCL for me because of this overhead reduction. (The shaders themselves have similar execution times for both APIs)

but after a little bit of google search, I was so disappointing to know that async compute is not 'async' on Nvidia GPUs right now

I'd try myself. Hardware support is probably limited in comparision to AMD, but there might be something...

I have everything in one command buffer. I have never used DX, but i'm pretty sure it's the same possibilities. Just a sequence af all setup / dispatch commands in the order you need.

E.g. here's code for the testing i've mentioned:


vkBeginCommandBuffer(commandBuffer, &commandBufferBeginInfo);

            vkCmdBindDescriptorSets(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipelineLayouts[tTEST2], 0, 1, &descriptorSets[tTEST2], 0, nullptr);     
            vkCmdBindPipeline(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipelines[taskToPipeline[tTEST2]]);
            int barrierBuffers[] = {bTEST2};
            //vkCmdDispatchIndirect(commandBuffer, buffers[bDISPATCH].deviceBuffer, sizeof(VkDispatchIndirectCommand) * (0 + OFFSET_DISPATCH_CL_TO_GRAND) );
            for (int i=0; i<100; i++)
            {
                vkCmdDispatch(commandBuffer, 1,1,1);
                MemoryBarriers (commandBuffer, barrierBuffers, 1);
            }
 
// in practice i use 10-20 sections like the above to process all things i need: animate, rebuild tree, plan work, raytracing, gathering, simple things similar to building mip maps, ... does not matter what it fits all in.
 
vkEndCommandBuffer(commandBuffer);

So i pack my whole algorithm in one commend buffer and i need only one 'draw call' per frame to execute it.

(You also can divide stuff in multiple command buffers, enqueue them in order and execute the queue still just once)

Only exception is when i want to use async compute i need to execute more than one queues.

No further GPU<->CPU communication necessary if the hardware can handle it all on it's own.

You should try this, probably it already solves your problem. Vulkan is almost twice as fast than OpenCL for me because of this overhead reduction. (The shaders themselves have similar execution times for both APIs)

but after a little bit of google search, I was so disappointing to know that async compute is not 'async' on Nvidia GPUs right now

I'd try myself. Hardware support is probably limited in comparision to AMD, but there might be something...

Thanks for the reply, I think I got confused by 'command buffer' at first. So if I got it right, your pre-build command buffer is a persistent commandlist in DX12 (which basically is a command list you create once and never call reset on it, and every frame just submit it to command queue. CORRECT IF I AM WRONG...). I haven't touch vulkan yet, but if dx12 and vulkan are very similar, I think there is one more thing which should be very beneficial in your case: In dx12, there is a 'bundle' which are said to be even lower overhead when get executed compare to persistent commandlist, see this

And I think I should definite try Bundles too. As for avoiding 0 thread dispatch, the perfect solution will be being able to build full commandlist (including change PSOs, etcs) from GPU, which as you said only available on NV Vulkan...

Thanks for sharing those with me, they really helps

Ah yes, i'm not aware of some differnces on terminology. It's like you say, so prebuild command buffer == persistent commandlist.

Maybe Bundles refer to primary / secondary command buffers in VK. I'll look it up, but i guess the usecase is graphics pipeline and it does not matter for compute.

Still no luck with async on real project. Should go from 1.1 ms to 1.0ms, but i end up at 1.2ms. I need to do another test with indirect dispatch and queue synchronization...

There will be UAV transitions, PSO changes, so seems definitly non-zero CPU/GPU cost. Which looks sub-optimal....
to avoid that, we can let CPU decide whether to call OnDefragment or not. But that require we read back freeSlotBuf size from GPU in at some frequency, which may have even worse perf impact....

P.S. When UAV barrier actually have non-zero GPU cost? I feel like if I don't have any read/write between two UAV barriers on the same resource, then second barrier should take no GPU time, right? (please correct me if I got that wrong)


On the UAV barriers; the driver might catch it BUT I would avoid it if you can just to be on the safe side.

When it comes to the transitions themselves; what are you transitioning the UAV from and to?
(I also, I assume you are grouping your transitions not issuing them one at a time?)

When it comes to the transitions themselves; what are you transitioning the UAV from and to?

Thanks for the reply. So in my case the uav barrier will just ensure all previous write is visible, so from UAV to UAV. The problem in my case is that the barrier is only useful if the following DispatchIndirect is not empty. But in general case since this is dispatchIndirect is just a defragment cs, the indirect parameter will be (0,1,1) 90% of the time so no compute shader will be launched, thus all the overhead for calling uav barriers, setting pso, call dispatchindirect for this defragment cs is just a waste for 90% of the time. So I was wondering, maybe on the driver side, it will find out that when dispatchindirect has 0 cs thread, it will ignore the previous and following uav barrier since the memory is not touched.... But I am not sure about that....

and if not, this overhead will make me feel really sad....

and if not, this overhead will make me feel really sad....

But you shouldn't. Assuming you do defragmentation when camera or objects move, there is no real win if you save something just because nothing's happening. It's always the average worst case that matters ;)

Thanks for the reply. So in my case the uav barrier will just ensure all previous write is visible, so from UAV to UAV. The problem in my case is that the barrier is only useful if the following DispatchIndirect is not empty. But in general case since this is dispatchIndirect is just a defragment cs, the indirect parameter will be (0,1,1) 90% of the time so no compute shader will be launched, thus all the overhead for calling uav barriers, setting pso, call dispatchindirect for this defragment cs is just a waste for 90% of the time. So I was wondering, maybe on the driver side, it will find out that when dispatchindirect has 0 cs thread, it will ignore the previous and following uav barrier since the memory is not touched....


Ah, that kind of transition; yeah, I think of that more as a fence but the D3D docs use transition so..

The problem is you need that barrier there to ensure the operations are completed before launch; regards of a CPU choice or a GPU choice those values will need to be completed and, potentially, flushed.

So, you are pretty much stuck with those barriers, even if you wanted to do the work of deciding on the CPU side you'd need those barriers in place to ensure the GPI work was done before you started the readback dance, which might itself require plenty of operations to ensure the data is in a readable format and available to the CPU (CPU visible memory or via copy-queue read back).

Without knowing more about your setup you might be able to reduce the stall by using split-barriers; iirc that involves issuing the start of the barrier call as soon as possible but not closing/finishing it until just before you need it to get the GPU time to do things. (Also, as mentioned, make sure you transition more than one thing at once, rather than having 4 separate barriers in this case) This, however, requires that you have some work between the data build and the defrag operations to give some time for the transition to happen.

The CPU overhead for this, as you have it, shouldn't be that bad anyway; command lists are designed to be cheap to build so I would't worry about that.

GPU wise, well as mentioned regardless you'll need a barrier to make sure the compute work competes, beyond that the dispatch of work happens at the front end level, so it might be able to look at the source buffer, see it has a 'zero dispatch' and just simply not launch any work in to the CUDA/CU cores - that's about as low as you'll get overhead wise and also feels sane as it will need to know what kind of dispatch size it is doing to find space of the work.

You might want to confirm with profiling (or someone who knows the IHVs specifics might be able to chime in) but I think you've probably got a reasonable setup as it stands.

The only thing you might want to tweak is;
- split barriers if possible
- batch transitions together

This topic is closed to new replies.

Advertisement