# Suggestion on GPU reduction

## Recommended Posts

Hey Guys,

I need some advise and suggestion of doing a tricky GPU-CPU task very efficient, here comes the background first:

=====================Background (optional)===========================

In my recent project I am trying to align two point clouds: Think about using a depth camera taking pictures(each pixel is depth, think about as depthbuffer with real depth not 1/z somekind)  of a target from two slight different views (so from using two different mView, camera pose1, and pose2). Then you will get two point clouds (reproject the 'depthbuffer'). Now our job is to find the matrix M to align those two point cloud (the matrix to transform pose1 to pose2). and there are algorithms to do the work, in my case I use FastICP (fast iterative closest point). As the name suggest, it's a iterative method so the routine looks like the following:

=============================Detail================================

Texture2D<float4> depth_and_normalmap1; // 512x424 pixels
Texture2D<float4> depth_and_normalmap2; // 512x424 pixels
StructuredBuffer<float4> workingBuf[7]; // 512x424 element(float4)

float reprojection_error = FLT_MAX;
int iterations = 0;
matrix m = IdentityMatrix; // 4x4 matrix
float4 result[7] = {};

do{
m = CPU_ICPSolver( result ); // Nothing to do with GPU inside

GPU_PrepareWorkingBuffer(
depth_and_normalmap1, // input as SRV
depth_and_normalmap2, // input as SRV
matrix,               // input as CBV
workingBuf);          // output as UAV (all 7 buffer)

for (int i = 0; i < 7; ++i) {
GPU_Reduction::Process(workingBuf[i]); // reduction to 1 float4 value inside GPU, but not copied to ReadBack buffer
}
GPU_Reduction::Readback( result ); // Read the reduction result, copy from default heap to readback heap, need to wait GPU inside

reprojection_error = GetReprojectionError( result );
}while(iterations < 20 && reprojection_error > threshold)



Above is how the workflow looks like. Right now I have tested and profile 1 iteration case on my GTX680

this part alone:

    for (int i = 0; i < 7; ++i) {
GPU_Reduction::Process(workingBuf[i]); // reduction to 1 float4 value inside GPU, but not copied to ReadBack buffer
}
GPU_Reduction::Readback( result ); // Read the reduction result, copy from default heap to readback heap, need to wait GPU inside


took 0.65ms  (is that seems reasonable? or it's incredibly slow, please let me know, thanks), so if I add the  GPU_PrepareWorkingBuffer and do 20 iterations I probably will end up with 16ms.... which seems too much...

The reduction shader I write is very standard one which guided by this post so not a naive one (but there are some tricky things, I will cover latter....)

============================Questions==============================
So for standard GPU reduction, as this post described, the result dimension is equal to threadgroup number (for example Dispatch(threadgroupX, threadgroupY, 1) means you reduction result will be of size (threadgroupX, threadgroupY)).
In my case I need to reduce to only one value per one buffer, so ideally I should only dispatch one threadgroup, but one threadgroup could only have maximum 1024 threads, so even each thread do 16 fetches, one threadgroup could only handle 1024*16 size. which is not enough in my case 512x424.
There are other way to do reduction into 1 element even with multiple threadgroup, by using atomicAdd, however that only work with 32bit integer so I can't use that method (mine is sum over float4), (or there are some efficient workaround?? thanks)
To sum up 512x424 to 1 pixel I need to to at least 2 pass, but with that configuration, the second pass I will end up with summing up 1024*16 elements but with only 14 actually needed. So the ideal configuration will be  thread_per_group * fetches_per_thread = sqrt(512*424) ~ 466, which means with per thread fetch set to 8, I should have 64 threads per threadgroup...isn't it to small compared to 1024 maximum? and should I not worry about 'summing up 1024*16 elements but with only 14 actually needed' at all?

Other question I have is about should I use compute shader at all? I can change my input format to Texture2D instead of Structured buffer, and use fullscreen quad pixel shader do log2(n) pass with linear_sampler times 4(linear_sampler will average 4 neighboring pixel right?) as sum over 4 neighboring pixel? (I can bound 8 RTs for one pass right?)  I have such pixel shader reduction implemented way back using Dx9, but don't have GPU_Profiler implemented back then, so don't know the exact cost (and I hope not spending time figuring that out, so if someone know such cs vs. ps reduction perf result, please let me know thanks

Question 3, with 512x424 elements reduction it seems CPU could handle it pretty well in terms of performance (even though every element is float4 and I have 7 of such buf), anyone have compared GPU/CPU reduction?  I haven't tried CPU reduction on this due to the fear that copying that 7 StructuredBuffer<float4> from GPU->CPU may cost much more (am I wrong?)
Question 4, as you may notice every iteration I have CPU GPU data dependency (cpu will wait for the reduction result to compute the matrix, and then gpu will wait for the matrix to perform the work....), so any suggestion about that?  The reason is that I need to solve a small linear system Ax = b (A is 6x6), which seems is GPU solvable... So anyone know how to solve small linear system on GPU?

##### Share on other sites

64 threads per threadgroup...isn't it to small compared to 1024 maximum? and should I not worry about 'summing up 1024*16 elements but with only 14 actually needed' at all?

64 threads is fine, i expect it to work better than 1024. For 1024 i assume each CU have to work too much with another CUs LDS memory which potentially has some cost.

You should test if smaller thread groups perform better even for the first reduction (usually the sweet spot is 64, 128, 256 in my experience), i'm curious...

Probably you can do the recution only with registers without LDS at all, may be 10 times faster.

A large workgroup would do sub-reductions with registers (so 64 values by each 32 threads warp), either store those sub-results in LDS to do a final reduction (with other threads idle), or store sub-results to memory and further process them with the next dispatch (so more dispatches).

I have not optimized for intrinsics but expect some nice speed ups. Curious here too :)

##### Share on other sites

64 threads per threadgroup...isn't it to small compared to 1024 maximum? and should I not worry about 'summing up 1024*16 elements but with only 14 actually needed' at all?

64 threads is fine, i expect it to work better than 1024. For 1024 i assume each CU have to work too much with another CUs LDS memory which potentially has some cost.

You should test if smaller thread groups perform better even for the first reduction (usually the sweet spot is 64, 128, 256 in my experience), i'm curious...

Probably you can do the recution only with registers without LDS at all, may be 10 times faster.

A large workgroup would do sub-reductions with registers (so 64 values by each 32 threads warp), either store those sub-results in LDS to do a final reduction (with other threads idle), or store sub-results to memory and further process them with the next dispatch (so more dispatches).

I have not optimized for intrinsics but expect some nice speed ups. Curious here too :)

Thanks JoeJ, the intrinsics looks very promising, but I hope to run my program on any DX12 hardware, so want to stay away from vendor-specific intrinsics (hope sm6 will have all these, and coming out pretty soon)...

It seems in your opinion have multiple passes is not bad (smaller threadgroup size)... though I haven't test it yet, but multiple passes also means multiple device memory read/write,  the compute shader itself only have one device read and one device write, all intermmediate result is in LDS so I guess fatter shader maybe better than more passes (please correct me if I got something wrong...)

Fine tuning those  params (thread_per_group, fetch_per_thread, etc.) are tedious :( and I also wish to use the same shader on a 1080p one channel data, which seems sub-optimal... Have you ever use the ps for reduction? I remember been told that texel read is 2x2 quad and with linear sampler you can get sum of 4 pixel in one sample and times 4 which should be faster than compute shader's 4 sample and add..... But you will have log2(max(width, height)) passes...

##### Share on other sites

After testing on a different machine GTX1080, I got weird result:

so I was mentioning that

this part alone:

for (int i = 0; i < 7; ++i) {
GPU_Reduction::Process(workingBuf[i]); // reduction to 1 float4 value inside GPU, but not copied to ReadBack buffer
}
GPU_Reduction::Readback( result ); // Read the reduction result, copy from default heap to readback heap, need to wait GPU inside 

took 0.65ms

so 0.65ms on GTX680m. but I got 1.46ms on GTX1080 (though I was using google remote desktop to get that number, not sure how that will affect the perf...)

Any idea what happened there?

here is the compute shader I write:

#define TYPE float4

StructuredBuffer<TYPE> buf_srvInput : register(t0);
RWStructuredBuffer<TYPE> buf_uavOutput : register(u0);

//==============================================================================
// Note: in my case there are only 2 case for reduction: Depth which is 512x424,
// and Color which is 960x512(downsampled), since in both case I can't get final
// result in 1 pass, so I have to do 2 pass, and to avoid too much padding, I
// need to figured out some number: thread_per_group, fetch_per_group. Thus I
// got
//    thread_per_group * fetch_per_group * group_count >= pixel_count
//    thread_per_group * fetch_per_group * 1 >= group_count
// ==> thread_per_group * fetch_per_group ~ sqrt(pixel_count) which is 702
// ==> thread_per_group set to 128 fetch_per_group set to 8
//==============================================================================
#define FETCH_COUNT 8
cbuffer CBdata : register(b0)
{
uint uResultOffset;
uint uFinalResult;
}

void main(uint3 Gid : SV_GroupID, uint3 DTid : SV_DispatchThreadID,
uint GI : SV_GroupIndex)
{
uint uIdx = DTid.x * FETCH_COUNT;
sharedMem[GI] = buf_srvInput[uIdx]
+ buf_srvInput[uIdx + 1]
+ buf_srvInput[uIdx + 2]
+ buf_srvInput[uIdx + 3]
+ buf_srvInput[uIdx + 4]
+ buf_srvInput[uIdx + 5]
+ buf_srvInput[uIdx + 6]
+ buf_srvInput[uIdx + 7];
GroupMemoryBarrierWithGroupSync();

if (GI < 64) sharedMem[GI] += sharedMem[GI + 64];
GroupMemoryBarrierWithGroupSync();

if (GI < 32) sharedMem[GI] += sharedMem[GI + 32];
if (GI < 16) sharedMem[GI] += sharedMem[GI + 16];
if (GI < 8) sharedMem[GI] += sharedMem[GI + 8];
if (GI < 4) sharedMem[GI] += sharedMem[GI + 4];
if (GI < 2) sharedMem[GI] += sharedMem[GI + 2];
if (GI < 1) sharedMem[GI] += sharedMem[GI + 1];

if (GI == 0) {
buf_uavOutput[uFinalResult ? uResultOffset : Gid.x] = sharedMem[0];
}
}


Thanks

##### Share on other sites

Your FETCH_COUNT of 8 seems quite large. Not sure if enough other work can be done while waiting on all those reads.

For the reduction you could try a different approach (needs no branches, but does a lot of unused additions and LDS access):

for (int i = WG_WIDTH>>1; i > 0; i>>=1) // WG_WIDTH = thread group size
{
float temp = lds[lID] + lds[lID ^ i]; // lID = thread index
BARRIER_LOCAL
lds[lID] = temp;
BARRIER_LOCAL
}


Intrinsics are there for AMD too: https://www.khronos.org/registry/spir-v/extensions/AMD/

But AMD does not yet expose their instructions similar to warp shuffle, which would be the most important.

Intel has only 8 threads, so nothing to expect there.

SM 6.0 will not have warp shuffle (only the quad shuffle which every GPU supports). Because hardware differences (64 AMD, 32 NV) i do not expect a future generalization here and vendor extensions will keep the way to go.

But i also decided to finish my project without extensions first and care for that later.

Fine tuning those params (thread_per_group, fetch_per_thread, etc.) are tedious :( and I also wish to use the same shader on a 1080p one channel data, which seems sub-optimal... Have you ever use the ps for reduction? I remember been told that texel read is 2x2 quad and with linear sampler you can get sum of 4 pixel in one sample and times 4 which should be faster than compute shader's 4 sample and add..... But you will have log2(max(width, height)) passes...

Keep those tuning params configurable. That's some additional work but it's worth it, can care for proper settings per chip later. (I've had different settings for Fermi, Kepler and GCN in the past with noticable perf. diff.)

I have not used PS, but you also can sample images from compute shader - may be worth it.

PS alone without the advantege of reductions in LDS? I'm almost sure it's not worth to try.

Other than that i can not really limit all your options - it's easier to ADD options to your todo list :)

so 0.65ms on GTX680m. but I got 1.46ms on GTX1080 (though I was using google remote desktop to get that number, not sure how that will affect the perf...)

In general Fermi was faster than Kepler with compute. I got a tie between 480 and 670 with my work. AMD R9 280X was twice as fast than a Kepler Titan (!)

No experience yet with Pascal, but i would not expect it became worse again. I assume it's a GPU <-> CPU sync issue or some additional transition work going on. Can you port the Matrix thing to GPU as well?

What overall speedup from 680 to 1080 do you get?

##### Share on other sites

Keep those tuning params configurable. That's some additional work but it's worth it, can care for proper settings per chip later. (I've had different settings for Fermi, Kepler and GCN in the past with noticable perf. diff.)

I changed to use configurable reduction kernel settings, get shader explosion (7 kernel size * 6 fetch size, with 2 different data type, now I have 84 compute shaders, but it's seems ok) And you are right the good configuration is actually 4 fetch and  128 thread/threadgroup, with that I got around 0.4ms instead of 0.6ms, and the weird GTX1080 perf is caused by GPU idle (I incorrectly place end timestamp on the next commandlist, so it included GPU idle time), and now GTX1080 runs best with 256 thread/threadgroup with 2 fetch which only take 0.14ms.

With correct time stamp, I was astonished to find that my frame time on GTX1080 is 8ms, and out of 5.5ms GPU was idle!!  And at the same time TaskManager report that every CPU core usage is less than 20% what that means ? (sadly, my crappy engine is single-threaded right now...)   Any suggestion or check list for me? thanks

##### Share on other sites
Posted (edited)
I was astonished to find that my frame time on GTX1080 is 8ms, and out of 5.5ms GPU was idle!!

How do you get those numbers? By the subtracting the differences between prev command list end and next command list start from total runtime?

If you have a readback after each batch, it should work to dispach something else in between that can execute while CPU waits on result.

Edit: Also make sure you use only 2 timestamps per frame to get total runtime. 2 timestamps per command list can be very expensive and is only good to relate runtimes of individual tasks.

TaskManager report that every CPU core usage is less than 20% what that means ?

Probably to less to do for CPU while waiting on GPU.

Edited by JoeJ

##### Share on other sites

I was astonished to find that my frame time on GTX1080 is 8ms, and out of 5.5ms GPU was idle!!

How do you get those numbers? By the subtracting the differences between prev command list end and next command list start from total runtime?

If you have a readback after each batch, it should work to dispach something else in between that can execute while CPU waits on result.

Edit: Also make sure you use only 2 timestamps per frame to get total runtime. 2 timestamps per command list can be very expensive and is only good to relate runtimes of individual tasks.

TaskManager report that every CPU core usage is less than 20% what that means ?

Probably to less to do for CPU while waiting on GPU.

Yes, I got that number by having time stamp pair separated on two cmdlist before and after present call.   In my current research project, I only have less than 50 draw/dispatch per frame, so I put them all in one cmdlist (Is it a bad idea?), thus for GPU time measurement, I have timestamp pair around all major draw/dispatch call (probably 30 pair of them in one commandlist.... is it really bad to have more than 2 timestamps per commandlist? )

Thanks

##### Share on other sites

Yes, 30 pairs of timestamps is very bad - at least for me. You might save one ms by removing it. Also this makes it hard to profile and to get sure about things, never knowing how time measurement itself affects performance.

I have various settings for my profiler: Off, course, fine grained modes for various tasks. I turn this on / off with #ifdefs.

I have little experince with draws, but 50 comoute dispatches per list sounds good to me. Spreading across multiple command lists so far never was a win but always a noticable loss.

As always - may depend on hardware, API, what you do. It's very good to discuss those things but i guess even after GPU coding for a whole lifetime we would need to keep testing again and again... :)

##### Share on other sites

Yes, 30 pairs of timestamps is very bad - at least for me. You might save one ms by removing it. Also this makes it hard to profile and to get sure about things, never knowing how time measurement itself affects performance.

I have various settings for my profiler: Off, course, fine grained modes for various tasks. I turn this on / off with #ifdefs.

I have little experince with draws, but 50 comoute dispatches per list sounds good to me. Spreading across multiple command lists so far never was a win but always a noticable loss.

As always - may depend on hardware, API, what you do. It's very good to discuss those things but i guess even after GPU coding for a whole lifetime we would need to keep testing again and again... :)

Thanks JoeJ,  I also have different GPU_timer granularity level option available in run-time, but I feel fine grain timer is very helpful, you can easily fine tune params for specific shader. So I abuse GPU timestamp as the following image:

[attachment=34413:Capture.PNG]

But you said that having many of them will make profile hard, and hard to get sure about things, could you explain why is that?  I feel I need to know the detail to be sure about my time measurement. Thanks~~

##### Share on other sites

Taking a timestamp acts like a memory barrier for me. Actually i would not need to use any memory barriers at all with timestamps after each dispatch - this indicates timestamps effect execution somehow.

A very bad case was testing async compute: 1.1ms total runtime, trying to hide 0.1ms i need to split command list to 3 lists to sync properly, so by using 6 timestamps async appeared as a loss with a total of 1.15 ms.

After using just 2 timestamps i got 1.05 ms, so a small win. Problems: Can't be totally sure the first timestamp is in the list that starts first. Can't be sure if even 2 timestamps still hurt async - maybe i have desired runtime of 1.0ms, but no way to find out.

So this is another indication.

I also have about 30 pairs max (about 50 dispatches). If i enable tham all 1.1ms raise to about 1.7ms.

If you remove all profiling except the very start and end, how much do you win? Maybe very different on NV.

##### Share on other sites

I cannot say anything for nvidia as they don't disclose information about that kind of stuff, but on AND GCN, timestamps does not behave as barriers, they just use a writeatendofpipe mechanism, waiting for completion of the ROP for draw calls or for all CS work is done prior to the time stamp write call, but it is perfectly fine for the GPU to start other draws and compute before the write. You should be able to use hundreds of timestamps without noticing any perf difference, unless you do something wrong in the readback.

##### Share on other sites

unless you do something wrong in the readback

Awesome!

My mistake was (although this is Vulkan):

void Profiler::Start (int id, VkCommandBuffer cmd, VkPipelineStageFlagBits pipelineStage)
{
id = (id*2) & (MAX_TIMERS*2-1);
SetUsed (id);
//vkCmdResetQueryPool (cmd, queryPool, id, 2);
vkCmdWriteTimestamp (cmd, pipelineStage, queryPool, id);
}


I did the vkCmdResetQueryPool for each timestamps seperately while recording - totally missed that.

After changing this to one call for all stamps before starting real work there is no more difference from 30 or just 2 stamps. :)

This will make things a lot easier!

Sorry for spreading misinformation and thanks for great help :D