Hey Guys,
I need some advise and suggestion of doing a tricky GPU-CPU task very efficient, here comes the background first:
=====================Background (optional)===========================
In my recent project I am trying to align two point clouds: Think about using a depth camera taking pictures(each pixel is depth, think about as depthbuffer with real depth not 1/z somekind) of a target from two slight different views (so from using two different mView, camera pose1, and pose2). Then you will get two point clouds (reproject the 'depthbuffer'). Now our job is to find the matrix M to align those two point cloud (the matrix to transform pose1 to pose2). and there are algorithms to do the work, in my case I use FastICP (fast iterative closest point). As the name suggest, it's a iterative method so the routine looks like the following:
=============================Detail================================
Texture2D<float4> depth_and_normalmap1; // 512x424 pixels
Texture2D<float4> depth_and_normalmap2; // 512x424 pixels
StructuredBuffer<float4> workingBuf[7]; // 512x424 element(float4)
float reprojection_error = FLT_MAX;
int iterations = 0;
matrix m = IdentityMatrix; // 4x4 matrix
float4 result[7] = {};
do{
m = CPU_ICPSolver( result ); // Nothing to do with GPU inside
GPU_PrepareWorkingBuffer(
depth_and_normalmap1, // input as SRV
depth_and_normalmap2, // input as SRV
matrix, // input as CBV
workingBuf); // output as UAV (all 7 buffer)
for (int i = 0; i < 7; ++i) {
GPU_Reduction::Process(workingBuf[i]); // reduction to 1 float4 value inside GPU, but not copied to ReadBack buffer
}
GPU_Reduction::Readback( result ); // Read the reduction result, copy from default heap to readback heap, need to wait GPU inside
reprojection_error = GetReprojectionError( result );
}while(iterations < 20 && reprojection_error > threshold)
Above is how the workflow looks like. Right now I have tested and profile 1 iteration case on my GTX680
this part alone:
for (int i = 0; i < 7; ++i) {
GPU_Reduction::Process(workingBuf[i]); // reduction to 1 float4 value inside GPU, but not copied to ReadBack buffer
}
GPU_Reduction::Readback( result ); // Read the reduction result, copy from default heap to readback heap, need to wait GPU inside
took 0.65ms (is that seems reasonable? or it's incredibly slow, please let me know, thanks), so if I add the GPU_PrepareWorkingBuffer and do 20 iterations I probably will end up with 16ms.... which seems too much...
The reduction shader I write is very standard one which guided by this post so not a naive one (but there are some tricky things, I will cover latter....)