This topic is 2554 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hi all....

I'm trying to implement a complicated GPGPU operation using DirectX 11 (Tried DirectCompute/CUDA with limited success). I'm making good progress, but the bottleneck is uploading the data to the GPU and downloading the results from the GPU. Basically my operation involves:
• Render a 64x64 full screen quad with a trivial pixel/vertex shader
• Download the resulting 64x64 float4 frame buffer I've tried various high spec cards (currently using a GeForce GTX 480) and various ways of parallelizing the operation (so more than one "job" is running concurrently) but the fastest I can this operation to happen is about 1ms or so . If i remove the upload and download step (and just wait for the quad to render), then the operation takes around 0.15ms, so 85% of my time is being spent in upload/download,

It seems like I should be able to get much faster bandwidth than this (1ms per job means I can only perform 16 "jobs" in a 16ms frame ), given the published bandwidth numbers of this kind of card. Am I being too optimistic about how long this kind of thing should take ? Or I am doing something dumb in my DirectX code ?

I'm uploading my input data as a 64x128 DXGI_FORMAT_R32G32B32A32_FLOAT texture. I create the input texture with usage D3D11_USAGE_DYNAMIC and CPUAccessFlags D3D11_CPU_ACCESS_WRITE. To write my data I map the texture (passing in usage as D3D11_MAP_WRITE_DISCARD) and do a memcpy.

To download my result I create a 64x64 render-target texture (also DXGI_FORMAT_R32G32B32A32_FLOAT) with usage D3D11_USAGE_DEFAULT and CPUAccessFlags 0. I use CopyResource to copy the RT to a second staging texture (Usage D3D11_USAGE_STAGING, CPUAccessFlags D3D11_CPU_ACCESS_READ). then do a map on the staging texture.

In order to get some parallelism (and hopefully some kind of pipelining where one operation is uploading while one is downloading) I tried several things. I have tried having several sets of input/output/RT textures, invoking the draw, etc. commands on all of them, then doing the RT map/unmap. I've also tried doing the same except using deferred contexts to do the upload via a command list built in a different thread. But can't average any faster than 1ms per job no matter how many concurrent jobs I try and run.

Any ideas ? I can include more source code if anyone is interested.

Thanks all

##### Share on other sites

It seems like I should be able to get much faster bandwidth than this (1ms per job means I can only perform 16 "jobs" in a 16ms frame ), given the published bandwidth numbers of this kind of card. Am I being too optimistic about how long this kind of thing should take ? Or I am doing something dumb in my DirectX code ?

Are you sure? Such as assumption is completely flawed, as there's a lot more stuff going on which may be just overhead. I suggest you TRY 16 "jobs" (and much more) per frame and see if your framerate is truly 60fps or less; rather than just assuming.

BTW, parallelism isn't going to help you if you're bandwidth limited. And depending on the data you're sending, it may be a problem of latency, rather than bandwidth. If that's the problem, it means you won't be able to make it faster than 1ms, but consuming more bandwith and ALU is still going to take around 1ms.

Cheers
Dark Sylinc

##### Share on other sites
There's a difference between bandwidth and latency.
It might only take a fraction of a millisecond to complete your transfer, but but there might be a whole millisecond of latency before the transfer is initiated.

Ideally in these kinds of parallel/distributed systems, you want to initiate the transfer long before you actually want to use the data, then later on come back and check if it's completed.

##### Share on other sites

[quote name='griffin77' timestamp='1306455802' post='4816249']
It seems like I should be able to get much faster bandwidth than this (1ms per job means I can only perform 16 "jobs" in a 16ms frame ), given the published bandwidth numbers of this kind of card. Am I being too optimistic about how long this kind of thing should take ? Or I am doing something dumb in my DirectX code ?

Are you sure? Such as assumption is completely flawed, as there's a lot more stuff going on which may be just overhead. I suggest you TRY 16 "jobs" (and much more) per frame and see if your framerate is truly 60fps or less; rather than just assuming.

BTW, parallelism isn't going to help you if you're bandwidth limited. And depending on the data you're sending, it may be a problem of latency, rather than bandwidth. If that's the problem, it means you won't be able to make it faster than 1ms, but consuming more bandwith and ALU is still going to take around 1ms.

Cheers
Dark Sylinc
[/quote]

Yeah I'm pretty sure my little DX test app is accurately simulating how my code will work in my real application. And no matter how I arrange it, I cannot get more than 16 jobs to execute in 16ms. And that's with a trivial pixel shader does nothing but sample the input texture and write it to the ouput render target. I actually need at least about 4x that for this to be a viable solution

Pipelining should be able to help me as PCI express can do concurrent upload/download. So I should, in theory at least, be able to get a 3x speed up by pipelining (e.g. one job is uploading data to the GPU, one is actually drawing, and one is downloading to the CPU all at the same time) right ? At least this is how I had this operation set up in CUDA/OpenCL. Also, I'm not conviced the DirectX API is not in fact doing some work on the CPU that is actually taking alot of that time (maybe some kind of swizzled texture conversion?). If this is the case I should be able to get a big speed up by paralellism on my 8 core box.

It just *seems* to me that these numbers are too slow. I don't have a whole bunch to base that on however other than a back of the envelope calculation based on PCI bandwidth.

##### Share on other sites

There's a difference between bandwidth and latency.
It might only take a fraction of a millisecond to complete your transfer, but but there might be a whole millisecond of latency before the transfer is initiated.

Ideally in these kinds of parallel/distributed systems, you want to initiate the transfer long before you actually want to use the data, then later on come back and check if it's completed.

Which is why I was hoping to get some benefit from paralellism (e,g, one job has a latency of 1ms but I'm executing 16 jobs, so that 1ms delay is amortized over all of them), but I really don't see a big speed up by doing that.

Of course I'm suspicious I'm screwing something up so that in fact the jobs are be serialized (e.g. wait 1ms for job A, then again for job B, then for job C, etc).

##### Share on other sites
Oh just realized. You're using Geforce 480. That card SUCKS BAD in GPU->CPU transfers.

It's a common problem in the 400 series (except Quadro version, pretty lame if you ask me).

Find another card and try again. Or find a way not to use GPU->CPU transfers that much

Edit: Also looks to me you're stalling the GPU by breaking async transfers. I strongly suggest you read this (performance considerations section).

##### Share on other sites

Oh just realized. You're using Geforce 480. That card SUCKS BAD in GPU->CPU transfers.

It's a common problem in the 400 series (except Quadro version, pretty lame if you ask me).

Find another card and try again. Or find a way not to use GPU->CPU transfers that much

Ahh that is good to know.

Though I've also tried on my GTX 580 with similar results (will double check tonight though). Does that have the same problem?

I also have a Quadro 4000 somewhere I can try (but when I was attempting to do this in CUDA/OpenCL that actually got much work performance than may GTX 480)

##### Share on other sites
OK so I think I figured this out. Thanks for all the advice.

My issue appears to be that the GPU really likes to be "warmed up" before doing any real computation. Obviously something is happening behinds the scenes the first time you do a Map or UpdateSubresource that makes the first run really slow. My loop looks something like this:
 for(int i=0;i<NUM_THREADS;i++) { g_pD3DContext->UpdateSubresource(g_inputTexture, 0, NULL, g_inputData, g_jobDim*16*g_qwordsPerInputItem , g_jobDim*16 ); g_pD3DContext->PSSetShaderResources( 0, 1, &g_inputTextureRV ); g_pD3DContext->PSSetSamplers( 0, 1, &g_pSamplerLinear ); g_pD3DContext->OMSetRenderTargets( 1, &g_renderTargetView, NULL ); g_pD3DContext->Draw( 4, 0 ); g_pD3DContext->CopyResource( g_stagingBuffer, g_renderTargetTexture ); } for(int i=0;i<NUM_THREADS;i++) { D3D11_MAPPED_SUBRESOURCE mappedResource; g_pD3DContext->Map( g_stagingBuffer, 0, D3D11_MAP_READ, 0, &mappedResource); float *outputDataGPU = (float*)(mappedResource.pData); memcpy(outputDataCPU[n*NUM_THREADS + i],outputDataGPU, g_outputDataSize); g_pD3DContext->Unmap( g_stagingBuffer, 0); } 

Basically if just run this a couple of times (passing dummy input data and ignoring the result of the final Map statement) before I run it "in anger" then after that I can run my jobs in less than 0.1ms.

##### Share on other sites
Though you were right about the 480 BTW. That 0.1ms figure if for my 580. My 480 is half that speed (0.2ms)

##### Share on other sites
Edit: Also looks to me you're stalling the GPU by breaking async transfers. I strongly suggest you read this (performance considerations section).

This can't be emphasised enough. Bandwidth is only one consideration for this kind of use case, and is most often not the most relevant one. You can have all the bandwidth in the world, you can have minimal latency, but if you need to stall the GPU you're still going to suffer.

• 9
• 13
• 41
• 15
• 13