Sign in to follow this  
griffin77

DirectX GPU upload/download bandwidth

Recommended Posts

Hi all....

I'm trying to implement a complicated GPGPU operation using DirectX 11 (Tried DirectCompute/CUDA with limited success). I'm making good progress, but the bottleneck is uploading the data to the GPU and downloading the results from the GPU. Basically my operation involves:
[list][*]Uploading a 128x64 float4 texture to the GPU[*]Render a 64x64 full screen quad with a trivial pixel/vertex shader[*]Download the resulting 64x64 float4 frame buffer[/list] I've tried various high spec cards (currently using a GeForce GTX 480) and various ways of parallelizing the operation (so more than one "job" is running concurrently) but the fastest I can this operation to happen is about 1ms or so . If i remove the upload and download step (and just wait for the quad to render), then the operation takes around 0.15ms, so 85% of my time is being spent in upload/download,

It seems like I should be able to get much faster bandwidth than this (1ms per job means I can only perform 16 "jobs" in a 16ms frame :( ), given the published bandwidth numbers of this kind of card. Am I being too optimistic about how long this kind of thing should take ? Or I am doing something dumb in my DirectX code ?

I'm uploading my input data as a 64x128 DXGI_FORMAT_R32G32B32A32_FLOAT texture. I create the input texture with usage D3D11_USAGE_DYNAMIC and CPUAccessFlags D3D11_CPU_ACCESS_WRITE. To write my data I map the texture (passing in usage as D3D11_MAP_WRITE_DISCARD) and do a memcpy.

To download my result I create a 64x64 render-target texture (also DXGI_FORMAT_R32G32B32A32_FLOAT) with usage D3D11_USAGE_DEFAULT and CPUAccessFlags 0. I use CopyResource to copy the RT to a second staging texture (Usage D3D11_USAGE_STAGING, CPUAccessFlags D3D11_CPU_ACCESS_READ). then do a map on the staging texture.

In order to get some parallelism (and hopefully some kind of pipelining where one operation is uploading while one is downloading) I tried several things. I have tried having several sets of input/output/RT textures, invoking the draw, etc. commands on all of them, then doing the RT map/unmap. I've also tried doing the same except using deferred contexts to do the upload via a command list built in a different thread. But can't average any faster than 1ms per job no matter how many concurrent jobs I try and run.

Any ideas ? I can include more source code if anyone is interested.

Thanks all

Share this post


Link to post
Share on other sites
[quote name='griffin77' timestamp='1306455802' post='4816249']
It seems like I should be able to get much faster bandwidth than this (1ms per job means I can only perform 16 "jobs" in a 16ms frame :( ), given the published bandwidth numbers of this kind of card. Am I being too optimistic about how long this kind of thing should take ? Or I am doing something dumb in my DirectX code ?
[/quote]

Are you sure? Such as assumption is completely flawed, as there's a lot more stuff going on which may be just overhead. I suggest you [b]TRY[/b] 16 "jobs" (and much more) per frame and see if your framerate is truly 60fps or less; rather than just assuming.

BTW, parallelism isn't going to help you if you're bandwidth limited. And depending on the data you're sending, it may be a problem of latency, rather than bandwidth. If that's the problem, it means you won't be able to make it faster than 1ms, but consuming more bandwith and ALU is still going to take around 1ms.

Cheers
Dark Sylinc

Share this post


Link to post
Share on other sites
There's a difference between bandwidth and latency.
It might only take a fraction of a millisecond to complete your transfer, but but there might be a whole millisecond of latency before the transfer is initiated.

Ideally in these kinds of parallel/distributed systems, you want to initiate the transfer long before you actually want to use the data, then later on come back and check if it's completed.

Share this post


Link to post
Share on other sites
[quote name='Matias Goldberg' timestamp='1306456292' post='4816254']
[quote name='griffin77' timestamp='1306455802' post='4816249']
It seems like I should be able to get much faster bandwidth than this (1ms per job means I can only perform 16 "jobs" in a 16ms frame :( ), given the published bandwidth numbers of this kind of card. Am I being too optimistic about how long this kind of thing should take ? Or I am doing something dumb in my DirectX code ?
[/quote]

Are you sure? Such as assumption is completely flawed, as there's a lot more stuff going on which may be just overhead. I suggest you [b]TRY[/b] 16 "jobs" (and much more) per frame and see if your framerate is truly 60fps or less; rather than just assuming.

BTW, parallelism isn't going to help you if you're bandwidth limited. And depending on the data you're sending, it may be a problem of latency, rather than bandwidth. If that's the problem, it means you won't be able to make it faster than 1ms, but consuming more bandwith and ALU is still going to take around 1ms.

Cheers
Dark Sylinc
[/quote]

Thanks for the quick reply...

Yeah I'm pretty sure my little DX test app is accurately simulating how my code will work in my real application. And no matter how I arrange it, I cannot get more than 16 jobs to execute in 16ms. And that's with a trivial pixel shader does nothing but sample the input texture and write it to the ouput render target. I actually need at least about 4x that for this to be a viable solution :(

Pipelining should be able to help me as PCI express can do concurrent upload/download. So I should, in theory at least, be able to get a 3x speed up by pipelining (e.g. one job is uploading data to the GPU, one is actually drawing, and one is downloading to the CPU all at the same time) right ? At least this is how I had this operation set up in CUDA/OpenCL. Also, I'm not conviced the DirectX API is not in fact doing some work on the CPU that is actually taking alot of that time (maybe some kind of swizzled texture conversion?). If this is the case I should be able to get a big speed up by paralellism on my 8 core box.

It just *seems* to me that these numbers are too slow. I don't have a whole bunch to base that on however other than a back of the envelope calculation based on PCI bandwidth.

Share this post


Link to post
Share on other sites
[quote name='Hodgman' timestamp='1306456765' post='4816255']
There's a difference between bandwidth and latency.
It might only take a fraction of a millisecond to complete your transfer, but but there might be a whole millisecond of latency before the transfer is initiated.

Ideally in these kinds of parallel/distributed systems, you want to initiate the transfer long before you actually want to use the data, then later on come back and check if it's completed.
[/quote]

Which is why I was hoping to get some benefit from paralellism (e,g, one job has a latency of 1ms but I'm executing 16 jobs, so that 1ms delay is amortized over all of them), but I really don't see a big speed up by doing that.

Of course I'm suspicious I'm screwing something up so that in fact the jobs are be serialized (e.g. wait 1ms for job A, then again for job B, then for job C, etc).

Share this post


Link to post
Share on other sites
Oh just realized. You're using Geforce 480. That card [url="http://area.autodesk.com/forum/autodesk-3ds-max/installation---hardware---os/asus-geforce-gtx-480-problems-in-3ds-max-2011/"]SUCKS[/url] [url="http://forums.nvidia.com/index.php?showtopic=173749"]BAD[/url] [url="http://forums.nvidia.com/index.php?showtopic=173517"]in[/url] [url="http://www.nvnews.net/vbulletin/showthread.php?t=154355"]GPU->CPU[/url] transfers.

It's a common problem in the 400 series (except Quadro version, pretty lame if you ask me).

Find another card and try again. Or find a way not to use GPU->CPU transfers that much


[b]Edit:[/b] Also looks to me you're stalling the GPU by breaking async transfers. I strongly suggest you [url="http://msdn.microsoft.com/en-us/library/bb205132%28v=VS.85%29.aspx"]read this[/url] (performance considerations section).

Share this post


Link to post
Share on other sites
[quote name='Matias Goldberg' timestamp='1306462759' post='4816281']
Oh just realized. You're using Geforce 480. That card [url="http://area.autodesk.com/forum/autodesk-3ds-max/installation---hardware---os/asus-geforce-gtx-480-problems-in-3ds-max-2011/"]SUCKS[/url] [url="http://forums.nvidia.com/index.php?showtopic=173749"]BAD[/url] [url="http://forums.nvidia.com/index.php?showtopic=173517"]in[/url] [url="http://www.nvnews.net/vbulletin/showthread.php?t=154355"]GPU->CPU[/url] transfers.

It's a common problem in the 400 series (except Quadro version, pretty lame if you ask me).

Find another card and try again. Or find a way not to use GPU->CPU transfers that much
[/quote]

Ahh that is good to know.

Though I've also tried on my GTX 580 with similar results (will double check tonight though). Does that have the same problem?

I also have a Quadro 4000 somewhere I can try (but when I was attempting to do this in CUDA/OpenCL that actually got much work performance than may GTX 480)

Share this post


Link to post
Share on other sites
OK so I think I figured this out. Thanks for all the advice.

My issue appears to be that the GPU really likes to be "warmed up" before doing any real computation. Obviously something is happening behinds the scenes the first time you do a Map or UpdateSubresource that makes the first run really slow. My loop looks something like this:
[code]
for(int i=0;i<NUM_THREADS;i++)
{

g_pD3DContext->UpdateSubresource(g_inputTexture[i], 0, NULL, g_inputData[i], g_jobDim*16*g_qwordsPerInputItem , g_jobDim*16 );

g_pD3DContext->PSSetShaderResources( 0, 1, &g_inputTextureRV[i] );
g_pD3DContext->PSSetSamplers( 0, 1, &g_pSamplerLinear );

g_pD3DContext->OMSetRenderTargets( 1, &g_renderTargetView[i], NULL );

g_pD3DContext->Draw( 4, 0 );

g_pD3DContext->CopyResource( g_stagingBuffer[i], g_renderTargetTexture[i] );
}

for(int i=0;i<NUM_THREADS;i++)
{
D3D11_MAPPED_SUBRESOURCE mappedResource;
g_pD3DContext->Map( g_stagingBuffer[i], 0, D3D11_MAP_READ, 0, &mappedResource);

float *outputDataGPU = (float*)(mappedResource.pData);

memcpy(outputDataCPU[n*NUM_THREADS + i],outputDataGPU, g_outputDataSize);
g_pD3DContext->Unmap( g_stagingBuffer[i], 0);
}
[/code]

Basically if just run this a couple of times (passing dummy input data and ignoring the result of the final Map statement) before I run it "in anger" then after that I can run my jobs in less than 0.1ms.

Share this post


Link to post
Share on other sites
[quote name='Matias Goldberg' timestamp='1306462759' post='4816281'][b]Edit:[/b] Also looks to me you're stalling the GPU by breaking async transfers. I strongly suggest you [url="http://msdn.microsoft.com/en-us/library/bb205132%28v=VS.85%29.aspx"]read this[/url] (performance considerations section).[/quote]

This can't be emphasised enough. Bandwidth is only one consideration for this kind of use case, and is most often not the most relevant one. You can have all the bandwidth in the world, you can have minimal latency, but if you need to stall the GPU you're still going to suffer.

Share this post


Link to post
Share on other sites
[quote name='mhagain' timestamp='1306498190' post='4816401']
[quote name='Matias Goldberg' timestamp='1306462759' post='4816281'][b]Edit:[/b] Also looks to me you're stalling the GPU by breaking async transfers. I strongly suggest you [url="http://msdn.microsoft.com/en-us/library/bb205132%28v=VS.85%29.aspx"]read this[/url] (performance considerations section).[/quote]

This can't be emphasised enough. Bandwidth is only one consideration for this kind of use case, and is most often not the most relevant one. You can have all the bandwidth in the world, you can have minimal latency, but if you need to stall the GPU you're still going to suffer.
[/quote]

Yeah that's why I run multiple concurrent jobs. I basically kick of ton of jobs at the same time, and only once they are all churning away do I start waiting for results. While I'm stalled waiting for the map to get the results of the first job, the rest can still be running. It seems to be working pretty well.

There just seems to be some weird warm up cost the first time you do a read or write to a GPU buffer, once I do a dummy run to get rid of that, bandwidth/latency are not a problem. Now I'm down to 0.1ms per job I think GPU compute will become my bottleneck (once I implement a non-trivial pixel shader to do the work I want to do)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this