• Create Account

Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

10 replies to this topic

### #1griffin77  Members   -  Reputation: 125

Like
0Likes
Like

Posted 26 May 2011 - 06:23 PM

Hi all....

I'm trying to implement a complicated GPGPU operation using DirectX 11 (Tried DirectCompute/CUDA with limited success). I'm making good progress, but the bottleneck is uploading the data to the GPU and downloading the results from the GPU. Basically my operation involves:
• Render a 64x64 full screen quad with a trivial pixel/vertex shader
I've tried various high spec cards (currently using a GeForce GTX 480) and various ways of parallelizing the operation (so more than one "job" is running concurrently) but the fastest I can this operation to happen is about 1ms or so . If i remove the upload and download step (and just wait for the quad to render), then the operation takes around 0.15ms, so 85% of my time is being spent in upload/download,

It seems like I should be able to get much faster bandwidth than this (1ms per job means I can only perform 16 "jobs" in a 16ms frame ), given the published bandwidth numbers of this kind of card. Am I being too optimistic about how long this kind of thing should take ? Or I am doing something dumb in my DirectX code ?

I'm uploading my input data as a 64x128 DXGI_FORMAT_R32G32B32A32_FLOAT texture. I create the input texture with usage D3D11_USAGE_DYNAMIC and CPUAccessFlags D3D11_CPU_ACCESS_WRITE. To write my data I map the texture (passing in usage as D3D11_MAP_WRITE_DISCARD) and do a memcpy.

To download my result I create a 64x64 render-target texture (also DXGI_FORMAT_R32G32B32A32_FLOAT) with usage D3D11_USAGE_DEFAULT and CPUAccessFlags 0. I use CopyResource to copy the RT to a second staging texture (Usage D3D11_USAGE_STAGING, CPUAccessFlags D3D11_CPU_ACCESS_READ). then do a map on the staging texture.

In order to get some parallelism (and hopefully some kind of pipelining where one operation is uploading while one is downloading) I tried several things. I have tried having several sets of input/output/RT textures, invoking the draw, etc. commands on all of them, then doing the RT map/unmap. I've also tried doing the same except using deferred contexts to do the upload via a command list built in a different thread. But can't average any faster than 1ms per job no matter how many concurrent jobs I try and run.

Any ideas ? I can include more source code if anyone is interested.

Thanks all

### #2Matias Goldberg  Crossbones+   -  Reputation: 3056

Like
0Likes
Like

Posted 26 May 2011 - 06:31 PM

It seems like I should be able to get much faster bandwidth than this (1ms per job means I can only perform 16 "jobs" in a 16ms frame ), given the published bandwidth numbers of this kind of card. Am I being too optimistic about how long this kind of thing should take ? Or I am doing something dumb in my DirectX code ?

Are you sure? Such as assumption is completely flawed, as there's a lot more stuff going on which may be just overhead. I suggest you TRY 16 "jobs" (and much more) per frame and see if your framerate is truly 60fps or less; rather than just assuming.

BTW, parallelism isn't going to help you if you're bandwidth limited. And depending on the data you're sending, it may be a problem of latency, rather than bandwidth. If that's the problem, it means you won't be able to make it faster than 1ms, but consuming more bandwith and ALU is still going to take around 1ms.

Cheers
Dark Sylinc

### #3Hodgman  Moderators   -  Reputation: 28595

Like
0Likes
Like

Posted 26 May 2011 - 06:39 PM

There's a difference between bandwidth and latency.
It might only take a fraction of a millisecond to complete your transfer, but but there might be a whole millisecond of latency before the transfer is initiated.

Ideally in these kinds of parallel/distributed systems, you want to initiate the transfer long before you actually want to use the data, then later on come back and check if it's completed.

### #4griffin77  Members   -  Reputation: 125

Like
0Likes
Like

Posted 26 May 2011 - 06:45 PM

It seems like I should be able to get much faster bandwidth than this (1ms per job means I can only perform 16 "jobs" in a 16ms frame ), given the published bandwidth numbers of this kind of card. Am I being too optimistic about how long this kind of thing should take ? Or I am doing something dumb in my DirectX code ?

Are you sure? Such as assumption is completely flawed, as there's a lot more stuff going on which may be just overhead. I suggest you TRY 16 "jobs" (and much more) per frame and see if your framerate is truly 60fps or less; rather than just assuming.

BTW, parallelism isn't going to help you if you're bandwidth limited. And depending on the data you're sending, it may be a problem of latency, rather than bandwidth. If that's the problem, it means you won't be able to make it faster than 1ms, but consuming more bandwith and ALU is still going to take around 1ms.

Cheers
Dark Sylinc

Yeah I'm pretty sure my little DX test app is accurately simulating how my code will work in my real application. And no matter how I arrange it, I cannot get more than 16 jobs to execute in 16ms. And that's with a trivial pixel shader does nothing but sample the input texture and write it to the ouput render target. I actually need at least about 4x that for this to be a viable solution

Pipelining should be able to help me as PCI express can do concurrent upload/download. So I should, in theory at least, be able to get a 3x speed up by pipelining (e.g. one job is uploading data to the GPU, one is actually drawing, and one is downloading to the CPU all at the same time) right ? At least this is how I had this operation set up in CUDA/OpenCL. Also, I'm not conviced the DirectX API is not in fact doing some work on the CPU that is actually taking alot of that time (maybe some kind of swizzled texture conversion?). If this is the case I should be able to get a big speed up by paralellism on my 8 core box.

It just *seems* to me that these numbers are too slow. I don't have a whole bunch to base that on however other than a back of the envelope calculation based on PCI bandwidth.

### #5griffin77  Members   -  Reputation: 125

Like
0Likes
Like

Posted 26 May 2011 - 06:50 PM

There's a difference between bandwidth and latency.
It might only take a fraction of a millisecond to complete your transfer, but but there might be a whole millisecond of latency before the transfer is initiated.

Ideally in these kinds of parallel/distributed systems, you want to initiate the transfer long before you actually want to use the data, then later on come back and check if it's completed.

Which is why I was hoping to get some benefit from paralellism (e,g, one job has a latency of 1ms but I'm executing 16 jobs, so that 1ms delay is amortized over all of them), but I really don't see a big speed up by doing that.

Of course I'm suspicious I'm screwing something up so that in fact the jobs are be serialized (e.g. wait 1ms for job A, then again for job B, then for job C, etc).

### #6Matias Goldberg  Crossbones+   -  Reputation: 3056

Like
1Likes
Like

Posted 26 May 2011 - 08:19 PM

Oh just realized. You're using Geforce 480. That card SUCKS BAD in GPU->CPU transfers.

It's a common problem in the 400 series (except Quadro version, pretty lame if you ask me).

Find another card and try again. Or find a way not to use GPU->CPU transfers that much

Edit: Also looks to me you're stalling the GPU by breaking async transfers. I strongly suggest you read this (performance considerations section).

### #7griffin77  Members   -  Reputation: 125

Like
0Likes
Like

Posted 26 May 2011 - 08:25 PM

Oh just realized. You're using Geforce 480. That card SUCKS BAD in GPU->CPU transfers.

It's a common problem in the 400 series (except Quadro version, pretty lame if you ask me).

Find another card and try again. Or find a way not to use GPU->CPU transfers that much

Ahh that is good to know.

Though I've also tried on my GTX 580 with similar results (will double check tonight though). Does that have the same problem?

I also have a Quadro 4000 somewhere I can try (but when I was attempting to do this in CUDA/OpenCL that actually got much work performance than may GTX 480)

### #8griffin77  Members   -  Reputation: 125

Like
0Likes
Like

Posted 27 May 2011 - 02:01 AM

OK so I think I figured this out. Thanks for all the advice.

My issue appears to be that the GPU really likes to be "warmed up" before doing any real computation. Obviously something is happening behinds the scenes the first time you do a Map or UpdateSubresource that makes the first run really slow. My loop looks something like this:
    	for(int i=0;i<NUM_THREADS;i++)
{

g_pD3DContext->UpdateSubresource(g_inputTexture[i], 0, NULL, g_inputData[i], g_jobDim*16*g_qwordsPerInputItem , g_jobDim*16 );

g_pD3DContext->PSSetSamplers( 0, 1, &g_pSamplerLinear );

g_pD3DContext->OMSetRenderTargets( 1, &g_renderTargetView[i], NULL );

g_pD3DContext->Draw( 4, 0 );

g_pD3DContext->CopyResource( g_stagingBuffer[i], g_renderTargetTexture[i] );
}

{
D3D11_MAPPED_SUBRESOURCE mappedResource;
g_pD3DContext->Map( g_stagingBuffer[i], 0, D3D11_MAP_READ, 0, &mappedResource);

float *outputDataGPU = (float*)(mappedResource.pData);

g_pD3DContext->Unmap( g_stagingBuffer[i], 0);
}


Basically if just run this a couple of times (passing dummy input data and ignoring the result of the final Map statement) before I run it "in anger" then after that I can run my jobs in less than 0.1ms.

### #9griffin77  Members   -  Reputation: 125

Like
0Likes
Like

Posted 27 May 2011 - 02:12 AM

Though you were right about the 480 BTW. That 0.1ms figure if for my 580. My 480 is half that speed (0.2ms)

### #10mhagain  Crossbones+   -  Reputation: 7603

Like
1Likes
Like

Posted 27 May 2011 - 06:09 AM

Edit: Also looks to me you're stalling the GPU by breaking async transfers. I strongly suggest you read this (performance considerations section).

This can't be emphasised enough. Bandwidth is only one consideration for this kind of use case, and is most often not the most relevant one. You can have all the bandwidth in the world, you can have minimal latency, but if you need to stall the GPU you're still going to suffer.

It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.

### #11griffin77  Members   -  Reputation: 125

Like
0Likes
Like

Posted 27 May 2011 - 10:31 AM

Edit: Also looks to me you're stalling the GPU by breaking async transfers. I strongly suggest you read this (performance considerations section).

This can't be emphasised enough. Bandwidth is only one consideration for this kind of use case, and is most often not the most relevant one. You can have all the bandwidth in the world, you can have minimal latency, but if you need to stall the GPU you're still going to suffer.

Yeah that's why I run multiple concurrent jobs. I basically kick of ton of jobs at the same time, and only once they are all churning away do I start waiting for results. While I'm stalled waiting for the map to get the results of the first job, the rest can still be running. It seems to be working pretty well.

There just seems to be some weird warm up cost the first time you do a read or write to a GPU buffer, once I do a dummy run to get rid of that, bandwidth/latency are not a problem. Now I'm down to 0.1ms per job I think GPU compute will become my bottleneck (once I implement a non-trivial pixel shader to do the work I want to do)

Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

PARTNERS