Sign in to follow this  
yk_cadcg

[gpgpu]report: output more than 1 results at each pixel

Recommended Posts

yk_cadcg    100
Hi, hope this post be useful to you. 1. Introduction Config: dx9, hlsl, vs.net, geforce8800. Problem: I use render-2-texture to output gpgpu results in a quad. Each input fragment scans a specific array corresponding to it, and output a number(0-255) of results during the scan. Not every element in the array could trigger an output, thus the number of results is unknown and data-dependent. Each output unit is a float3. The imaginary PS psuedocode is like:
 float3 out[n] = scan(array(inPos)); //n: (0-255)
if n=0 or 1, then we are immediately done:
Color[0] = scan(array(inPos)); 
The case n=0 and n=1 can be differentiated by flagging Color[0].alpha as 0 or 1. But if n>1, then problem occurs: a texel in the rendertarget can only hold 1 pixel, not more. 2. Design We use multipass. each pass uses MultiRenderTargets(MRT), so each pass, each pixel could output at most nRT results, where nRT is the number of output textures. But one issue: in the 2nd pass, how does a pixel know that his scan in the array(inPos) should start from where he has stopped by the end of the 1st pass? In other words, he should start from the (nRT+1)th result, rather than starting over from 1st result, thus each pass could move ahead in the array a little. That requires keeping a history of his stop position in the array in the last pass. For the position where the pixel stopped in its corresponding array last pass, we call it a "lastprint". since each pixel has different and unknown "lastprint" in each pass, we have to keep a texture for this, to write to (in this pass) and to read from (in next pass, in order to resume the scan). So we prepare a pair of pingpong texture, 1 for read and 1 for write, and at the end of each pass, we swap these 2 textures. Since the card limitation of max number of MRTs is 4, and we have to spare 1 RT for "lastprint" pingpong texture, now we are left 3 RTs for output results. In the end of each result, we use GetRenderTargetData to copy the 3 RT's back to system memory, and in cpu, we check the color.alpha: 0 means invalid output, which means the fragment at this position has already finished scan, thus it will no longer output anything. How to stop the multi-pass? i.e., how to know that EVERY pixel has reached the end of its array, thus no more output would be generated, and we could exit? A naive way would be to scan the copied-back RTs by CPU, to check if ALL color.alpha == 0. That would be slow. As an optimization, we use occlusion query to check if no more output is generated. We need cull the fragments which has already finished scanning its corresponding array. We use depth culling. By the end of ps, we check if the lastprint == arrayLength. If so, we set the depth of this fragment as a big value (e.g. 1.0), thus it will fail depth test. (does this make sense? I still don't know). Another naive way is to set all output colors as containing negative components, and use clip(color) to kill the pixel. The performance of these 2 are about the same. When finishing one pass, we use occlusionquery to get the number of pixels drawn; if it equals zero, then we can exit the multipass. 3. Results We test the performance. If the data is nearly uniform, we need only 1 pass, then the time consumption of this scheme is like: <texturesize(i.e.,data size): time cost on CPU counterpart: time cost on GPU> <1M : 280ms : 230 ms> <8M : 2800ms : 1100ms> <16M : GPU crashes, since not enough memory. we have to use a lot A32B32G32R32F textures> We could see that as the texture size growing, the speedup of gpgpu is increasing. we have acheived > 2x and near 3x speedups. 4. Todo's We know the method above is far from efficient, but we don't know other better ideas on dx9.(in cuda or dx10, however, we're more ourselves) Todo's include: #how to write back only valid results in RTs to system memory? since there're many black(zero) results in every RT texture, and we copy back all of them. Actually the majority of each RT is zero. That's a waste of bandwidth. #how to do the ps and culling more efficient. We don't know. A mountain of thanks for the warmhearted guys helping us all these days, such as ET3D, and jollyjeffers the Moderator. Thanks for any suggestions!

Share this post


Link to post
Share on other sites
ET3D    810
Here are a couple of rough ideas for speeding up this scheme.

First, you can try to render in parts, into smaller render targets. I assume that the processing on the next pass is only dependent on the same position, and not other pixels, so it doesn't matter if you process 1/64 of your data, then another, then another. With this method, different areas may have different numbers of passes, and you won't have to read a full size render target for the maximal number of passes. You will also need much smaller render targets, so you won't run out of memory as quickly. You'll have some overhead, but I imagine that it won't be significant.

Another idea, which would work in conjunction with sections, is to have one part of the data take over another part. Assuming each pixel has a value telling it what data to read, and a value saying if it's no longer used, you can run two sections like this:

First sections shader: If this section's pixel has stopped processing, and the second section's pixel (at the same position) is still processing, take second section's data, and process.

Second section shader:
If this section's pixel is still processing, and the first section's pixel has stopped processing, stop processing. (The first section has taken over.)

This way some pixel processing will move from the second to the first section, making it more likely that the second section will finish after fewer passes, so the total amount of processing should be lessened.

Share this post


Link to post
Share on other sites
yk_cadcg    100
Thank you very much! The "smaller render target" idea is GREAT!!! It'll surely boost the performance a lot.
For the 2-sections idea, I'm not clear about "work in conjunction with sections":
2 sections means 2 ps? 2 passes?
"work in conjunction" means work sequentially or in meantime(simultanuously)?
Yes we do have that "each pixel has a value telling it what data to read, and a value saying if it's no longer used", they are stored in pingpong render target textures.
Anyway, we could try the multipass&smaller RT idea first:)
Thank you.

Quote:
Original post by ET3D
Here are a couple of rough ideas for speeding up this scheme.

First, you can try to render in parts, into smaller render targets. I assume that the processing on the next pass is only dependent on the same position, and not other pixels, so it doesn't matter if you process 1/64 of your data, then another, then another. With this method, different areas may have different numbers of passes, and you won't have to read a full size render target for the maximal number of passes. You will also need much smaller render targets, so you won't run out of memory as quickly. You'll have some overhead, but I imagine that it won't be significant.

Another idea, which would work in conjunction with sections, is to have one part of the data take over another part. Assuming each pixel has a value telling it what data to read, and a value saying if it's no longer used, you can run two sections like this:

First sections shader: If this section's pixel has stopped processing, and the second section's pixel (at the same position) is still processing, take second section's data, and process.

Second section shader:
If this section's pixel is still processing, and the first section's pixel has stopped processing, stop processing. (The first section has taken over.)

This way some pixel processing will move from the second to the first section, making it more likely that the second section will finish after fewer passes, so the total amount of processing should be lessened.


Share this post


Link to post
Share on other sites
ET3D    810
Quote:
Original post by yk_cadcg
For the 2-sections idea, I'm not clear about "work in conjunction with sections"

I meant that it could be used with the "small render targets". Instead of working on each of them separately, work on two at a time, with one being able to take work from the other.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this