Hi,
hope this post be useful to you.
1. Introduction
Config: dx9, hlsl, vs.net, geforce8800.
Problem:
I use render-2-texture to output gpgpu results in a quad.
Each input fragment scans a specific array corresponding to it, and output a number(0-255) of results during the scan.
Not every element in the array could trigger an output, thus the number of results is unknown and data-dependent.
Each output unit is a float3.
The imaginary PS psuedocode is like:
float3 out[n] = scan(array(inPos)); //n: (0-255)
if n=0 or 1, then we are immediately done:
Color[0] = scan(array(inPos));
The case n=0 and n=1 can be differentiated by flagging Color[0].alpha as 0 or 1.
But if n>1, then problem occurs: a texel in the rendertarget can only hold 1 pixel, not more.
2. Design
We use multipass. each pass uses MultiRenderTargets(MRT), so each pass, each pixel could output at most nRT results, where nRT is the number of output textures. But one issue: in the 2nd pass, how does a pixel know that his scan in the array(inPos) should start from where he has stopped by the end of the 1st pass? In other words, he should start from the (nRT+1)th result, rather than starting over from 1st result, thus each pass could move ahead in the array a little. That requires keeping a history of his stop position in the array in the last pass.
For the position where the pixel stopped in its corresponding array last pass, we call it a "lastprint". since each pixel has different and unknown "lastprint" in each pass, we have to keep a texture for this, to write to (in this pass) and to read from (in next pass, in order to resume the scan).
So we prepare a pair of pingpong texture, 1 for read and 1 for write, and at the end of each pass, we swap these 2 textures.
Since the card limitation of max number of MRTs is 4, and we have to spare 1 RT for "lastprint" pingpong texture, now we are left 3 RTs for output results.
In the end of each result, we use GetRenderTargetData to copy the 3 RT's back to system memory, and in cpu, we check the color.alpha: 0 means invalid output, which means the fragment at this position has already finished scan, thus it will no longer output anything.
How to stop the multi-pass? i.e., how to know that EVERY pixel has reached the end of its array, thus no more output would be generated, and we could exit?
A naive way would be to scan the copied-back RTs by CPU, to check if ALL color.alpha == 0. That would be slow.
As an optimization, we use occlusion query to check if no more output is generated. We need cull the fragments which has already finished scanning its corresponding array. We use depth culling. By the end of ps, we check if the lastprint == arrayLength. If so, we set the depth of this fragment as a big value (e.g. 1.0), thus it will fail depth test. (does this make sense? I still don't know). Another naive way is to set all output colors as containing negative components, and use clip(color) to kill the pixel. The performance of these 2 are about the same.
When finishing one pass, we use occlusionquery to get the number of pixels drawn; if it equals zero, then we can exit the multipass.
3. Results
We test the performance. If the data is nearly uniform, we need only 1 pass, then the time consumption of this scheme is like:
<texturesize(i.e.,data size): time cost on CPU counterpart: time cost on GPU>
<1M : 280ms : 230 ms>
<8M : 2800ms : 1100ms>
<16M : GPU crashes, since not enough memory. we have to use a lot A32B32G32R32F textures>
We could see that as the texture size growing, the speedup of gpgpu is increasing. we have acheived > 2x and near 3x speedups.
4. Todo's
We know the method above is far from efficient, but we don't know other better ideas on dx9.(in cuda or dx10, however, we're more ourselves) Todo's include:
#how to write back only valid results in RTs to system memory? since there're many black(zero) results in every RT texture, and we copy back all of them. Actually the majority of each RT is zero. That's a waste of bandwidth.
#how to do the ps and culling more efficient. We don't know.
A mountain of thanks for the warmhearted guys helping us all these days, such as ET3D, and jollyjeffers the Moderator.
Thanks for any suggestions!