# Is GPU-to-CPU data transfer a performance bottle-neck?

This topic is 2134 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hi Everyone,

I have a desktop application that needs to render about one million particles (sprite alike small quads with semi-transparency alpha). In order to render alpha blending properly I need to draw the particles from back-to-front.  To do this I need to calculate the z-coordinates of the particles in each Render() call. I can calculate the z-coordinates on the CPU side in C# in about 22 milliseconds.

To get better performance, I tried to use compute shader to do the calculation. However, the compute shader I implemented took in total about 100 milliseconds to do the calculation.  With some simple profiling I found out that the compute-shader did the calculation actually very quickly (in less than 1 millisecond), but the step to fetch the results (i.e. 1 million floats) from GPU to CPU took about 100 milliseconds. This is the first time I use compute shader, I am wandering whether I did something seriously wrong in my implementation? Could I use different buffer type to improve the performance?, or is this just the nature of  GPU-CPU communication?  GPU calculation looks to me like a pizza that costs $1, but charges$100 for delivery.

The following are snippets of my implementation ( in C# & SlimDX):

// SpaceMap.fx  ===================
matrix mTrans;StructuredBuffer<float3> gpuDataIn : register (t0);
RWStructuredBuffer<float> gpuDataOut : register (u0);

void CS_ComputeDepth(uint id  : SV_DispatchThreadId) {
float4 p = mul(float4(gpuDataIn[id.x], 1), mTrans);
gpuDataOut[id.x] = -p.z/p.w;
}

// SpaceMap.cs ======================
Buffer outputBuffer;
Buffer stagingBuffer;
UnorderedAccessView outputView;

// One time initialization of the compute shader
stagingBuffer = new Buffer(device, new BufferDescription {
BindFlags = BindFlags.None,
OptionFlags = ResourceOptionFlags.StructuredBuffer,
SizeInBytes = bodies.Count * 4,
StructureByteStride = sizeof(float),
Usage = ResourceUsage.Staging,
});

outputBuffer = new Buffer(device, new BufferDescription {
OptionFlags = ResourceOptionFlags.StructuredBuffer,
SizeInBytes = 4*bodies.Count,
StructureByteStride = sizeof(float),
Usage = ResourceUsage.Default,
});

outputView = new UnorderedAccessView(device, outputBuffer,
new UnorderedAccessViewDescription {
ElementCount = bodies.Count,
Format = Format.Unknown,
Dimension = UnorderedAccessViewDimension.Buffer
});

using (DataStream vertices = new DataStream(12* bodies.Count, true, true)) {
foreach (Body body in bodies) {
vertices.Write((float)body.X);
vertices.Write((float)body.Y);
vertices.Write((float)body.Z);
}
vertices.Position = 0;
var desc = new BufferDescription{
CpuAccessFlags = CpuAccessFlags.None,
OptionFlags = ResourceOptionFlags.StructuredBuffer,
SizeInBytes = bodies.Count * 12,
StructureByteStride = sizeof(float),
Usage = ResourceUsage.Default
};

var inBuffer = new Buffer(device, vertices, desc);
}
}

// Function called in Render() for each frame.
void CalculateDepth(float[] zValues) {
var ctx = device.ImmediateContext;
effect.GetTechniqueByName("SpaceMap").GetPassByName("ComputeDepth").Apply(ctx);

// This statement takes less than one millisecond.
using (TimeCheck.Check)
ctx.Dispatch(bodies.Count, 1, 1);              //   !!!!!!!!!!!!!!

ctx.CopyResource(outputBuffer, stagingBuffer);
using (TimeCheck.Check)   // This block about 100 milliseconds ???????
{
}
ctx.UnmapSubresource(stagingBuffer, 0);

}



##### Share on other sites
The transfer of data between CPU and gpu in general can be a problem. Firstly because data has to move across the pcie bus, which is significantly narrower than main memory, and also due to some extraneous copies that occur.

You have your data that you want to move to the card somewhere in memory, when you queue that up for the gpu your data is silently copied to another block of memory that's suitably aligned for the gpu to DMA transfer, the DMA occurs writing the data to the gpu, the gpu does the calculations, a DMA is performed back to the aligned memory, and finally the results from aligned memory are copied back to a main memory buffer.

You can program your way around some of this, and the newest CPUs and gpus take steps to share memory more efficiently, but you do definitely have to consider the latency of transfer, whether due to buffering, bandwidth, or extraneous copying.

Do you actually need the order back on the CPU? Do you really need to sort the entire set? Does the order change dramatically each frame, and if not, can you use a sort that runs better when the list is mostly sorted?

##### Share on other sites
Another option would be to consider Pinvoking some C++ code to perform the sort using vector intrinsics. You can probably get down to sub 5ms range that way.

##### Share on other sites
22 ms seems to be awefully slow for just one million particles. One would also expect the sorting to be the most time consuming piece of code, not the single dot product per particle that is the z value computation. You should double check that code.

Gpus can also sort data rather nicely. You could consider doing the sorting on the GPU too, thus removing the need to copy the data back to the cpu in the first place.

One last thing: Measuring the time it takes for the dispatch call to complete will not give you the time it takes the gpu to complete. You need to use a dedicated tool or the functionality offered by the api to measure the actual running time of the kernel.

##### Share on other sites

The standard trick with particles to avoid completely the need to sort them is to set the blend mode to additive. That's because it doesn't matter what order you add a bunch of things up in, you get the same result either way. To do that you set both SrcBlend and DestBlend to D3D11_BLEND_ONE, and use D3D11_BLEND_OP_ADD.

A texture for additive blending doesn't need an alpha channel - you just make it black where you want it transparent, and coloured where you don't. A blurry white dot is a common image to use for small additive particles.

The downside is that some effects are hard to get with just additive blending. However where sorting is required the particle count can generally be small enough to process on the CPU.

Here's some simple 2D additive particles rendered using WebGL: http://jsfiddle.net/BB37j/

##### Share on other sites
Out of curiosity I just tested the z value computation on my notebook. This is my setup:
#include <time.h>
#include <iostream>
#include <vector>

class CPUStopWatch
{
public:
CPUStopWatch();

void start();
size_t getNanoseconds();
protected:
timespec m_start;
};

CPUStopWatch::CPUStopWatch()
{
start();
}

void CPUStopWatch::start()
{
clock_gettime(CLOCK_MONOTONIC_RAW, &m_start);
}

size_t CPUStopWatch::getNanoseconds()
{
timespec stop;
clock_gettime(CLOCK_MONOTONIC_RAW, &stop);

if (stop.tv_sec > m_start.tv_sec) {
return (long int)(stop.tv_sec - m_start.tv_sec) * 1000000000l
+ ((long int) stop.tv_nsec - (long int)m_start.tv_nsec);
} else {
return stop.tv_nsec - m_start.tv_nsec;
}
}

void fillParticleData(float *xyz, unsigned num)
{
for (unsigned i = 0; i < num*3; i++)
xyz[i] = i*12.345f; // doesn't really matter whats in there
}

void computeZValues(const float *xyz, unsigned num, const float *focalPlane, float *dst)
{
for (unsigned i = 0; i < num; i++) {
dst[i] = xyz[i*3+0] * focalPlane[0] +
xyz[i*3+1] * focalPlane[1] +
xyz[i*3+2] * focalPlane[2] +
focalPlane[3];
}
}

int main(int argc, char **argv)
{
const unsigned numParticles = 1000000;
std::vector<float> particle_XYZ;
particle_XYZ.resize(numParticles*3);
fillParticleData(&particle_XYZ[0], numParticles);

float focalPlane[4];
focalPlane[0] = 1.0f;
focalPlane[1] = 2.0f;
focalPlane[2] = 3.0f;
focalPlane[3] = 4.0f;

std::vector<float> zValues;
zValues.resize(numParticles);
unsigned numIterations = 1000;
CPUStopWatch stopWatch;
for (unsigned i = 0; i < numIterations; i++) {
computeZValues(&particle_XYZ[0], numParticles, focalPlane, &zValues[0]);
}

std::cout << "Avg. time for " << numParticles << ": " << stopWatch.getNanoseconds() / (float)numIterations / 1e6f << " ms!" << std::endl;

}


Computing the z values for one million particles without any fancy hand optimizations takes about 1.5 ms on my CPU. With some SIMD you can probably at least halve that, but even if that's not possible in C#, your code is still 15 times slower than it should be. Maybe you can post your code for the z value computation? Edited by Ohforf sake

##### Share on other sites

Hi Ohrforf,

thanks for your effort to do the test. The following is my code snippet in C#:

float minValue = float.MaxValue;
float maxValue = float.MinValue;
float minV = float.MaxValue;
float maxV = float.MinValue;
for (int i = begin; i < end; i++) {
IBody b = bodies[idxList[i]];
//
// We could rotate the current z-axis to the initial coordinator with the inverse mat,
// then use a single 3D dot product to project b.XYZ to that axis. This will improve the
// performance (by 10% for 500K points), but will lost too much accuracy.
//
float v = -(float)((b.X * mat.M13 + b.Y * mat.M23 + b.Z * mat.M33 + mat.M43) / (b.X * mat.M14 + b.Y * mat.M24 + b.Z * mat.M34 + mat.M44));
zValues[i] = v;
if (v < minV) minV = v;
if (v > maxV) maxV = v;
}
lock (this) {
if (minV < minValue) minValue = minV;
if (maxV > maxValue) maxValue = maxV;
}
});


In above code I used multi-threads to speed-up the calculation (my compute has 6 cores at 2.0GHz). The data type for b.X, b.Y and b.Z are all double. In the code, I also calculated the maximum and minimun (for fast sorting later needed). Also, as mentioned in the code comment, I need to rotate (b.X,b.Y,b.Z) directly with the current transformation matrix mat, this is more expensive than projecting (x,y,z) to focalPlanel(normal) as you did in your code. In order to get such focalPlane-normal, I need to apply the reverse of mat on (0, 0, 1). But, it seems that the function mat.Invert() is not very accurate, so that it causes strange artifacts when I rotate the view.

Despite the difference in our implementations, my implementation seems to be way too slow. I will take a close look into this issue, especially since using GPU seems to be a dead-lock. I hope this is not an issue with C#.

Anyway, as some posts already pointed out, the calculation of depths is not the big part of calculation, but the sorting ( that takes 3 to 4 times longer than calculating z-coordinates). I experimented for a quite while to avoid sorting, but I found no solution. Without sorting, semi-transparent textures won't be properly rendered!

Finally, I would thank you all for your replies. This community is really great!

##### Share on other sites

And do you need to use doubles? Computing your v expression with doubles is slower than with floats (the double division will be really slow) and also converting it to float takes some time.

Maybe you could sort on GPU and keep the sorted array on GPU and use it as an indirect parameter to your rendering, perhaps as an index buffer. That way, you wouldn't have to stall at all.

• ### Game Developer Survey

We are looking for qualified game developers to participate in a 10-minute online survey. Qualified participants will be offered a \$15 incentive for your time and insights. Click here to start!

• 11
• 15
• 21
• 26
• 11