I'm trying to implement a complicated GPGPU operation using DirectX 11 (Tried DirectCompute/CUDA with limited success). I'm making good progress, but the bottleneck is uploading the data to the GPU and downloading the results from the GPU. Basically my operation involves:
- Uploading a 128x64 float4 texture to the GPU
- Render a 64x64 full screen quad with a trivial pixel/vertex shader
- Download the resulting 64x64 float4 frame buffer
It seems like I should be able to get much faster bandwidth than this (1ms per job means I can only perform 16 "jobs" in a 16ms frame ), given the published bandwidth numbers of this kind of card. Am I being too optimistic about how long this kind of thing should take ? Or I am doing something dumb in my DirectX code ?
I'm uploading my input data as a 64x128 DXGI_FORMAT_R32G32B32A32_FLOAT texture. I create the input texture with usage D3D11_USAGE_DYNAMIC and CPUAccessFlags D3D11_CPU_ACCESS_WRITE. To write my data I map the texture (passing in usage as D3D11_MAP_WRITE_DISCARD) and do a memcpy.
To download my result I create a 64x64 render-target texture (also DXGI_FORMAT_R32G32B32A32_FLOAT) with usage D3D11_USAGE_DEFAULT and CPUAccessFlags 0. I use CopyResource to copy the RT to a second staging texture (Usage D3D11_USAGE_STAGING, CPUAccessFlags D3D11_CPU_ACCESS_READ). then do a map on the staging texture.
In order to get some parallelism (and hopefully some kind of pipelining where one operation is uploading while one is downloading) I tried several things. I have tried having several sets of input/output/RT textures, invoking the draw, etc. commands on all of them, then doing the RT map/unmap. I've also tried doing the same except using deferred contexts to do the upload via a command list built in a different thread. But can't average any faster than 1ms per job no matter how many concurrent jobs I try and run.
Any ideas ? I can include more source code if anyone is interested.