• Announcements

    • khawk

      Download the Game Design and Indie Game Marketing Freebook   07/19/17

      GameDev.net and CRC Press have teamed up to bring a free ebook of content curated from top titles published by CRC Press. The freebook, Practices of Game Design & Indie Game Marketing, includes chapters from The Art of Game Design: A Book of Lenses, A Practical Guide to Indie Game Marketing, and An Architectural Approach to Level Design. The GameDev.net FreeBook is relevant to game designers, developers, and those interested in learning more about the challenges in game development. We know game development can be a tough discipline and business, so we picked several chapters from CRC Press titles that we thought would be of interest to you, the GameDev.net audience, in your journey to design, develop, and market your next game. The free ebook is available through CRC Press by clicking here. The Curated Books The Art of Game Design: A Book of Lenses, Second Edition, by Jesse Schell Presents 100+ sets of questions, or different lenses, for viewing a game’s design, encompassing diverse fields such as psychology, architecture, music, film, software engineering, theme park design, mathematics, anthropology, and more. Written by one of the world's top game designers, this book describes the deepest and most fundamental principles of game design, demonstrating how tactics used in board, card, and athletic games also work in video games. It provides practical instruction on creating world-class games that will be played again and again. View it here. A Practical Guide to Indie Game Marketing, by Joel Dreskin Marketing is an essential but too frequently overlooked or minimized component of the release plan for indie games. A Practical Guide to Indie Game Marketing provides you with the tools needed to build visibility and sell your indie games. With special focus on those developers with small budgets and limited staff and resources, this book is packed with tangible recommendations and techniques that you can put to use immediately. As a seasoned professional of the indie game arena, author Joel Dreskin gives you insight into practical, real-world experiences of marketing numerous successful games and also provides stories of the failures. View it here. An Architectural Approach to Level Design This is one of the first books to integrate architectural and spatial design theory with the field of level design. The book presents architectural techniques and theories for level designers to use in their own work. It connects architecture and level design in different ways that address the practical elements of how designers construct space and the experiential elements of how and why humans interact with this space. Throughout the text, readers learn skills for spatial layout, evoking emotion through gamespaces, and creating better levels through architectural theory. View it here. Learn more and download the ebook by clicking here. Did you know? GameDev.net and CRC Press also recently teamed up to bring GDNet+ Members up to a 20% discount on all CRC Press books. Learn more about this and other benefits here.
Sign in to follow this  
Followers 0
griffin77

DirectX GPU upload/download bandwidth

10 posts in this topic

Hi all....

I'm trying to implement a complicated GPGPU operation using DirectX 11 (Tried DirectCompute/CUDA with limited success). I'm making good progress, but the bottleneck is uploading the data to the GPU and downloading the results from the GPU. Basically my operation involves:
[list][*]Uploading a 128x64 float4 texture to the GPU[*]Render a 64x64 full screen quad with a trivial pixel/vertex shader[*]Download the resulting 64x64 float4 frame buffer[/list] I've tried various high spec cards (currently using a GeForce GTX 480) and various ways of parallelizing the operation (so more than one "job" is running concurrently) but the fastest I can this operation to happen is about 1ms or so . If i remove the upload and download step (and just wait for the quad to render), then the operation takes around 0.15ms, so 85% of my time is being spent in upload/download,

It seems like I should be able to get much faster bandwidth than this (1ms per job means I can only perform 16 "jobs" in a 16ms frame :( ), given the published bandwidth numbers of this kind of card. Am I being too optimistic about how long this kind of thing should take ? Or I am doing something dumb in my DirectX code ?

I'm uploading my input data as a 64x128 DXGI_FORMAT_R32G32B32A32_FLOAT texture. I create the input texture with usage D3D11_USAGE_DYNAMIC and CPUAccessFlags D3D11_CPU_ACCESS_WRITE. To write my data I map the texture (passing in usage as D3D11_MAP_WRITE_DISCARD) and do a memcpy.

To download my result I create a 64x64 render-target texture (also DXGI_FORMAT_R32G32B32A32_FLOAT) with usage D3D11_USAGE_DEFAULT and CPUAccessFlags 0. I use CopyResource to copy the RT to a second staging texture (Usage D3D11_USAGE_STAGING, CPUAccessFlags D3D11_CPU_ACCESS_READ). then do a map on the staging texture.

In order to get some parallelism (and hopefully some kind of pipelining where one operation is uploading while one is downloading) I tried several things. I have tried having several sets of input/output/RT textures, invoking the draw, etc. commands on all of them, then doing the RT map/unmap. I've also tried doing the same except using deferred contexts to do the upload via a command list built in a different thread. But can't average any faster than 1ms per job no matter how many concurrent jobs I try and run.

Any ideas ? I can include more source code if anyone is interested.

Thanks all
0

Share this post


Link to post
Share on other sites
[quote name='griffin77' timestamp='1306455802' post='4816249']
It seems like I should be able to get much faster bandwidth than this (1ms per job means I can only perform 16 "jobs" in a 16ms frame :( ), given the published bandwidth numbers of this kind of card. Am I being too optimistic about how long this kind of thing should take ? Or I am doing something dumb in my DirectX code ?
[/quote]

Are you sure? Such as assumption is completely flawed, as there's a lot more stuff going on which may be just overhead. I suggest you [b]TRY[/b] 16 "jobs" (and much more) per frame and see if your framerate is truly 60fps or less; rather than just assuming.

BTW, parallelism isn't going to help you if you're bandwidth limited. And depending on the data you're sending, it may be a problem of latency, rather than bandwidth. If that's the problem, it means you won't be able to make it faster than 1ms, but consuming more bandwith and ALU is still going to take around 1ms.

Cheers
Dark Sylinc
0

Share this post


Link to post
Share on other sites
There's a difference between bandwidth and latency.
It might only take a fraction of a millisecond to complete your transfer, but but there might be a whole millisecond of latency before the transfer is initiated.

Ideally in these kinds of parallel/distributed systems, you want to initiate the transfer long before you actually want to use the data, then later on come back and check if it's completed.
0

Share this post


Link to post
Share on other sites
[quote name='Matias Goldberg' timestamp='1306456292' post='4816254']
[quote name='griffin77' timestamp='1306455802' post='4816249']
It seems like I should be able to get much faster bandwidth than this (1ms per job means I can only perform 16 "jobs" in a 16ms frame :( ), given the published bandwidth numbers of this kind of card. Am I being too optimistic about how long this kind of thing should take ? Or I am doing something dumb in my DirectX code ?
[/quote]

Are you sure? Such as assumption is completely flawed, as there's a lot more stuff going on which may be just overhead. I suggest you [b]TRY[/b] 16 "jobs" (and much more) per frame and see if your framerate is truly 60fps or less; rather than just assuming.

BTW, parallelism isn't going to help you if you're bandwidth limited. And depending on the data you're sending, it may be a problem of latency, rather than bandwidth. If that's the problem, it means you won't be able to make it faster than 1ms, but consuming more bandwith and ALU is still going to take around 1ms.

Cheers
Dark Sylinc
[/quote]

Thanks for the quick reply...

Yeah I'm pretty sure my little DX test app is accurately simulating how my code will work in my real application. And no matter how I arrange it, I cannot get more than 16 jobs to execute in 16ms. And that's with a trivial pixel shader does nothing but sample the input texture and write it to the ouput render target. I actually need at least about 4x that for this to be a viable solution :(

Pipelining should be able to help me as PCI express can do concurrent upload/download. So I should, in theory at least, be able to get a 3x speed up by pipelining (e.g. one job is uploading data to the GPU, one is actually drawing, and one is downloading to the CPU all at the same time) right ? At least this is how I had this operation set up in CUDA/OpenCL. Also, I'm not conviced the DirectX API is not in fact doing some work on the CPU that is actually taking alot of that time (maybe some kind of swizzled texture conversion?). If this is the case I should be able to get a big speed up by paralellism on my 8 core box.

It just *seems* to me that these numbers are too slow. I don't have a whole bunch to base that on however other than a back of the envelope calculation based on PCI bandwidth.
0

Share this post


Link to post
Share on other sites
[quote name='Hodgman' timestamp='1306456765' post='4816255']
There's a difference between bandwidth and latency.
It might only take a fraction of a millisecond to complete your transfer, but but there might be a whole millisecond of latency before the transfer is initiated.

Ideally in these kinds of parallel/distributed systems, you want to initiate the transfer long before you actually want to use the data, then later on come back and check if it's completed.
[/quote]

Which is why I was hoping to get some benefit from paralellism (e,g, one job has a latency of 1ms but I'm executing 16 jobs, so that 1ms delay is amortized over all of them), but I really don't see a big speed up by doing that.

Of course I'm suspicious I'm screwing something up so that in fact the jobs are be serialized (e.g. wait 1ms for job A, then again for job B, then for job C, etc).
0

Share this post


Link to post
Share on other sites
Oh just realized. You're using Geforce 480. That card [url="http://area.autodesk.com/forum/autodesk-3ds-max/installation---hardware---os/asus-geforce-gtx-480-problems-in-3ds-max-2011/"]SUCKS[/url] [url="http://forums.nvidia.com/index.php?showtopic=173749"]BAD[/url] [url="http://forums.nvidia.com/index.php?showtopic=173517"]in[/url] [url="http://www.nvnews.net/vbulletin/showthread.php?t=154355"]GPU->CPU[/url] transfers.

It's a common problem in the 400 series (except Quadro version, pretty lame if you ask me).

Find another card and try again. Or find a way not to use GPU->CPU transfers that much


[b]Edit:[/b] Also looks to me you're stalling the GPU by breaking async transfers. I strongly suggest you [url="http://msdn.microsoft.com/en-us/library/bb205132%28v=VS.85%29.aspx"]read this[/url] (performance considerations section).
1

Share this post


Link to post
Share on other sites
[quote name='Matias Goldberg' timestamp='1306462759' post='4816281']
Oh just realized. You're using Geforce 480. That card [url="http://area.autodesk.com/forum/autodesk-3ds-max/installation---hardware---os/asus-geforce-gtx-480-problems-in-3ds-max-2011/"]SUCKS[/url] [url="http://forums.nvidia.com/index.php?showtopic=173749"]BAD[/url] [url="http://forums.nvidia.com/index.php?showtopic=173517"]in[/url] [url="http://www.nvnews.net/vbulletin/showthread.php?t=154355"]GPU->CPU[/url] transfers.

It's a common problem in the 400 series (except Quadro version, pretty lame if you ask me).

Find another card and try again. Or find a way not to use GPU->CPU transfers that much
[/quote]

Ahh that is good to know.

Though I've also tried on my GTX 580 with similar results (will double check tonight though). Does that have the same problem?

I also have a Quadro 4000 somewhere I can try (but when I was attempting to do this in CUDA/OpenCL that actually got much work performance than may GTX 480)
0

Share this post


Link to post
Share on other sites
OK so I think I figured this out. Thanks for all the advice.

My issue appears to be that the GPU really likes to be "warmed up" before doing any real computation. Obviously something is happening behinds the scenes the first time you do a Map or UpdateSubresource that makes the first run really slow. My loop looks something like this:
[code]
for(int i=0;i<NUM_THREADS;i++)
{

g_pD3DContext->UpdateSubresource(g_inputTexture[i], 0, NULL, g_inputData[i], g_jobDim*16*g_qwordsPerInputItem , g_jobDim*16 );

g_pD3DContext->PSSetShaderResources( 0, 1, &g_inputTextureRV[i] );
g_pD3DContext->PSSetSamplers( 0, 1, &g_pSamplerLinear );

g_pD3DContext->OMSetRenderTargets( 1, &g_renderTargetView[i], NULL );

g_pD3DContext->Draw( 4, 0 );

g_pD3DContext->CopyResource( g_stagingBuffer[i], g_renderTargetTexture[i] );
}

for(int i=0;i<NUM_THREADS;i++)
{
D3D11_MAPPED_SUBRESOURCE mappedResource;
g_pD3DContext->Map( g_stagingBuffer[i], 0, D3D11_MAP_READ, 0, &mappedResource);

float *outputDataGPU = (float*)(mappedResource.pData);

memcpy(outputDataCPU[n*NUM_THREADS + i],outputDataGPU, g_outputDataSize);
g_pD3DContext->Unmap( g_stagingBuffer[i], 0);
}
[/code]

Basically if just run this a couple of times (passing dummy input data and ignoring the result of the final Map statement) before I run it "in anger" then after that I can run my jobs in less than 0.1ms.
0

Share this post


Link to post
Share on other sites
Though you were right about the 480 BTW. That 0.1ms figure if for my 580. My 480 is half that speed (0.2ms)
0

Share this post


Link to post
Share on other sites
[quote name='Matias Goldberg' timestamp='1306462759' post='4816281'][b]Edit:[/b] Also looks to me you're stalling the GPU by breaking async transfers. I strongly suggest you [url="http://msdn.microsoft.com/en-us/library/bb205132%28v=VS.85%29.aspx"]read this[/url] (performance considerations section).[/quote]

This can't be emphasised enough. Bandwidth is only one consideration for this kind of use case, and is most often not the most relevant one. You can have all the bandwidth in the world, you can have minimal latency, but if you need to stall the GPU you're still going to suffer.
1

Share this post


Link to post
Share on other sites
[quote name='mhagain' timestamp='1306498190' post='4816401']
[quote name='Matias Goldberg' timestamp='1306462759' post='4816281'][b]Edit:[/b] Also looks to me you're stalling the GPU by breaking async transfers. I strongly suggest you [url="http://msdn.microsoft.com/en-us/library/bb205132%28v=VS.85%29.aspx"]read this[/url] (performance considerations section).[/quote]

This can't be emphasised enough. Bandwidth is only one consideration for this kind of use case, and is most often not the most relevant one. You can have all the bandwidth in the world, you can have minimal latency, but if you need to stall the GPU you're still going to suffer.
[/quote]

Yeah that's why I run multiple concurrent jobs. I basically kick of ton of jobs at the same time, and only once they are all churning away do I start waiting for results. While I'm stalled waiting for the map to get the results of the first job, the rest can still be running. It seems to be working pretty well.

There just seems to be some weird warm up cost the first time you do a read or write to a GPU buffer, once I do a dummy run to get rid of that, bandwidth/latency are not a problem. Now I'm down to 0.1ms per job I think GPU compute will become my bottleneck (once I implement a non-trivial pixel shader to do the work I want to do)
0

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0