Is GPU-to-CPU data transfer a performance bottle-neck?

James X. Li · 2013-12-17T20:04:07

Hi Everyone, I have a desktop application that needs to render about one million particles (sprite alike small quads with semi-transparency alpha). In order to render alpha blending properly I need to draw the particles from back-to-front. To do this I need to calculate the z-coordinates of the particles in each Render() call. I can calculate the z-coordinates on the CPU side in C# in about 22 milliseconds. To get better performance, I tried to use compute shader to do the calculation. However, the compute shader I implemented took in total about 100 milliseconds to do the calculation. With some simple profiling I found out that the compute-shader did the calculation actually very quickly (in less than 1 millisecond), but the step to fetch the results (i.e. 1 million floats) from GPU to CPU took about 100 milliseconds. This is the first time I use compute shader, I am wandering whether I did something seriously wrong in my implementation? Could I use different buffer type to improve the performance?, or is this just the nature of GPU-CPU communication? GPU calculation looks to me like a pizza that costs $1, but charges $100 for delivery. Thank you in advance. The following are snippets of my implementation ( in C# & SlimDX): // SpaceMap.fx =================== matrix mTrans;StructuredBuffer<float3> gpuDataIn : register (t0); RWStructuredBuffer<float> gpuDataOut : register (u0); [numthreads(1, 1, 1)] void CS_ComputeDepth(uint id : SV_DispatchThreadId) { float4 p = mul(float4(gpuDataIn[id.x], 1), mTrans); gpuDataOut[id.x] = -p.z/p.w; } // SpaceMap.cs ====================== Buffer outputBuffer; Buffer stagingBuffer; UnorderedAccessView outputView; ShaderResourceView inBufferView; // One time initialization of the compute shader void SetupComputeShader() { stagingBuffer = new Buffer(device, new BufferDescription { BindFlags = BindFlags.None, CpuAccessFlags = CpuAccessFlags.Read, OptionFlags = ResourceOptionFlags.StructuredBuffer, SizeInBytes = bodies.Count * 4, StructureByteStride = sizeof(float), Usage = ResourceUsage.Staging, }); outputBuffer = new Buffer(device, new BufferDescription { BindFlags = BindFlags.UnorderedAccess | BindFlags.ShaderResource, OptionFlags = ResourceOptionFlags.StructuredBuffer, SizeInBytes = 4*bodies.Count, StructureByteStride = sizeof(float), Usage = ResourceUsage.Default, }); outputView = new UnorderedAccessView(device, outputBuffer, new UnorderedAccessViewDescription { ElementCount = bodies.Count, Format = Format.Unknown, Dimension = UnorderedAccessViewDimension.Buffer }); using (DataStream vertices = new DataStream(12* bodies.Count, true, true)) { foreach (Body body in bodies) { vertices.Write((float)body.X); vertices.Write((float)body.Y); vertices.Write((float)body.Z); } vertices.Position = 0; var desc = new BufferDescription{ BindFlags = BindFlags.ShaderResource | BindFlags.UnorderedAccess, CpuAccessFlags = CpuAccessFlags.None, OptionFlags = ResourceOptionFlags.StructuredBuffer, SizeInBytes = bodies.Count * 12, StructureByteStride = sizeof(float), Usage = ResourceUsage.Default }; var inBuffer = new Buffer(device, vertices, desc); inBufferView = new ShaderResourceView(device, inBuffer); } } // Function called in Render() for each frame. void CalculateDepth(float[] zValues) { var ctx = device.ImmediateContext; effect.GetTechniqueByName("SpaceMap").GetPassByName("ComputeDepth").Apply(ctx); ctx.ComputeShader.SetShaderResource(inBufferView, 0); ctx.ComputeShader.SetUnorderedAccessView(outputView, 0); // This statement takes less than one millisecond. using (TimeCheck.Check) ctx.Dispatch(bodies.Count, 1, 1); // !!!!!!!!!!!!!! ctx.CopyResource(outputBuffer, stagingBuffer); using (TimeCheck.Check) // This block about 100 milliseconds ??????? { ctx.MapSubresource(stagingBuffer, D3D.MapMode.Read, D3D.MapFlags.None).Data.ReadRange(zValues, 0, bodies.Count); } ctx.UnmapSubresource(stagingBuffer, 0); ctx.ComputeShader.SetShaderResource(null, 0); ctx.ComputeShader.SetUnorderedAccessView(null, 0); }

Graphics and GPU Programming Programming

Started by jamesxli December 14, 2013 05:14 PM

13 comments, last by jamesxli 10 years, 3 months ago

21st Century Moose

13,459

December 16, 2013 06:38 PM

From the look of this you're making two passes over your items - once to store them in the "bodies" list, and a second time to do the distance calculation. Can you not do the distance calculation at storage time?

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

jamesxli

303

Author

December 17, 2013 05:13 AM

@Ohforf

Hi Ohforf,

I have looked the code of our two tests above more closely. I found out that the real reason for the huge performance difference between the two implementation is that your code accesses the xyz coordinates sequentially, while my code accesses them rather randomly as dictated by the index list idxList. To verify this you can change computeZValues() as following and repeat your test:


void computeZValues(const float *xyz, unsigned num, const float *focalPlane, float *dst)
{
    for (unsigned k = 0; k < num; k++) {
        unsigned i = k * 373 % num;    // enforce random access
        // unsigned i = k;              // use sequential access.
        dst[j] = xyz[i*3+0] * focalPlane[0] +
                 xyz[i*3+1] * focalPlane[1] +
                 xyz[i*3+2] * focalPlane[2] +
                 focalPlane[3];
    }
}

I could sort the xyz coordinates table together with z-values to get sequential access, but I doubt that the gain will offset the slow-down of the sorting algorithm (that has a non-trivial implementation itself).

best regards.

Ohforf sake

2,052

December 17, 2013 10:58 AM

Ok, so plain vanilla (what I posted above): 1.5 ms - 1.6 ms

+ with min/max value extraction: 1.6 ms

+ with reciprocal z instead of linear z: 2.6 ms

+ in double: 4.4 ms

+ random access: 27 ms

Also sorting 1 million floats with std::sort: ~42ms

So bottom line:
- min/max costs nothing extra
- double nearly doubles the cost
- random access is just as bad as everyone always says it is
- it's the sort that ultimately kills you

Note however that std::sort is O(n log n). Maybe s.o. with a radix sort at hand can post some numbers, but even then I believe the sort will be prohibitively slow. So the best advice is what others already said: Can you get away without sorting them, or can you sort them on the GPU?

What exactly are you actually trying to do? To draw some benefit from 1 million particles they should cover a significant portion of the screen. At HD your screen only has ~1 million pixels after all. Anything smaller than that and you would have more particles then pixels covered by the effect. And if they are all blended, then you are in for an overdraw nightmare.

If this is some sort of point splatting effect, maybe you can sort the particles into screen tiles and perform the sorting and blending for each tile in compute+fragment programs, thus reducing the overdraw and sorting overhead.

PunCrathod

596

December 17, 2013 02:05 PM

And do you need to use doubles? Computing your v expression with doubles is slower than with floats (the double division will be really slow) and also converting it to float takes some time.

You don't really need to worry about doubles or floats. As long as your memory bandwith does not run out doubles are almost always as fast as floats on a 64 bit processor as they are both usually calculated with the same alu using the same 80bit registers. Granted with 1 million 3d points you have 3 million doubles and thats about 24 megabytes and there goes your bandwith. If all the data could fit in L3 cache(or L4 if you have it) then you propably would not get much of a difference in performance without using some fancy stuff like SSE(wich can infact process double the amount of floats than doubles in one cycle). But in this case even with SSE you wouldn't get much of an improvement as your bandwith is already holding you back.

Maybe you could sort on GPU and keep the sorted array on GPU and use it as an indirect parameter to your rendering, perhaps as an index buffer. That way, you wouldn't have to stall at all.

This is propably the best bet on getting a million particles sorted inside a reasonable amount of time. Just as a comparison a game that I'm making with a couple of friends has a particle system that has two textures containing the particle data. One for reading and one for writing and swapping them around after rendering. Granted it's only in 2d and no need for sorting but since all the data is kept on vram all the time its blazing fast and we can have 5 million particles without any noticeably decrease in performance(less than 1ms difference in frame time between 100k and 5mil particles with a nvidia gtx 660). And as op said in the first post the shader did the job in less than 1ms too so the best solution would be to figure out a way to avoid transferring all the data between cpu and gpu.

We actually did ours by having three different shaders. One that "rendered" new particles in the particle data texture, one that updated the data between frames and one that rendered them to a framebuffer to be displayed on the screen. So the only data that needed to be sent anywhere was a list of particle emitters active during a frame sent to the first shader and the rest was just a few draw calls. I believe they call this approach ping-ponging. I'm not too familiar with how it works as my two friends do most of the rendering code in our project but I hope this gives you the motivation to try something similar yourself.

However if your particles interact with the rest of the simulation then it gets complicated and I have no idea how to make that happen.

jamesxli

303

Author

December 17, 2013 08:04 PM

Thank you all for sharing your valuable experiences and insights. The application I mentioned shows data from flow cytometry, each data point represents attributes of a particle or a cell fragment. A key requirement for this application is correctly implementing the semi-transparent alpha channel, so that people can see the shape of high concentration regions by varying the alpha value and rotating the 3D view.

By replacing the old 2D sprite with geometry shader of DX11, we already increased the frame rate from less than 1 fps to about 10 fps, this is probably enough for our users. But in order to justify the switch from DX9 to DX11, we like to get as much as possible performance. If we could calculate depths on GPU, without the delay to fetch results, we can get 12 to 13 fps. If we don't sort at all, we can get up to 35 fps (but with lots of artifacts.) The sorting algorithm we implemented is fairly sophisticated, it uses the multi-cores cpu to exploit various data statistics. So, porting it to HLSL would be a challenge for us, but it could be interesting option for future improvement.

Anyway, unlike in game programming, our users will likely accept a frame rate of 10 or 5 fps. But they'll really get confused if a few dozens particles (i.e. 0.01%) rendered wrongly for just few frames from time to time, as they often hope to find needles in (data) hays. This is the reason that we always use double precision when possible, and this not a problem for modern CPUs, but an issue for most GPUs, I guess.

Is GPU-to-CPU data transfer a performance bottle-neck?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Is GPU-to-CPU data transfer a performance bottle-neck?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines