HLSL SM3 Loop into a 1280x720 sampler texture

Started by
15 comments, last by Styves 8 years, 2 months ago

Hi everyone!

I am trying to loop on a 1280x720 texture sampler in a pixel shader. This is the code i am using:


static const float2 rtFLFTexel = float2( 1.0f/vRenderTarget.x, 1.0f/vRenderTarget.y );
static const float2 rtFLHTexel = float2( 0.5f/vRenderTarget.x, 0.5f/vRenderTarget.y );

for( int x=0;x<vRenderTarget.x; x++ )
{
      for( int y=0;y<vRenderTarget.y; y++ )
      {
         //Get the current texture coordinates
         float2 uvScreen = float2(x,y)*rtFLFTexel;
         float2 uvTexture = uvScreen+rtFLHTexel;
         
         //Get main values from light info
         float4 myData = tex2Dlod(mySampler, float4(uvTexture,0,0));
         //.....do stuff
     }
}

The thing is that it is only sampling the first bits but the variable vRenderTarget is set to (1280, 720). I have tryed using other functions like tex2D with no success. And i have sampled it by using the values (256, 256) but then i will lose sample positions...How can i do it?.

Thanks so much in advance :)

Advertisement

You can't.

1280x720 is almost 1,000,000 - so best case is that you're going to have over 2,000,000 pixel shader instructions in your shader.

No hardware is capable of that.

Perhaps you should talk some about what you're trying to do, rather than how you're trying to do it, and a better solution may present itself.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

Thx so much for your answer mhagain smile.png

So which should be? 256x256 = 65536 why is this working fine?

Cant understand it very well sorry.

Cant transform a for in a simple 4-5 instructions that repeats continuously? biggrin.png, It doesnt matter the time cost here.

256x256 does fit within your hardware's capabilities. Still, 65536 samples per pixel is a very large number.

Depending on what you're trying to do, you could use mip-mapping to downsample your source texture so that you'd drastically reduce the amount of samples needed. Of course, this sacrifices a little bit of precision, but you would gain a lot of performance in return.

Niko Suni

You should never write code like that for a GPU. GPUs work very, very differently to CPUs.
GPU's work well when you have thousands threads (pixels) each doing simple task (e.g. 1 to 10 texture reads)..
Instead you've got what looks like one pixel doing an incredibly complex task - a million texture reads!
This will literally run your GPU at 0.1% efficiency, if it worked at all. You really need to break this task up into parallel units of work so that you can utilize the GPU properly with a large number of threads.
Tell us what problem you're actually trying to solve and we'll tell you how to parallelise it, so that it will actually make proper use of your GPU.

Another reason that this is bad is that GPU's generally have to remain responsive at all times (if you stall the GPU, windows can't render the desktop any more... so windows will pull the rug out from under you and reboot the GPU, causing a "lost device" error condition in your app). Writing a pixel shader that literally performs a million texture samples is a great way to cause windows to think that you've stalled the GPU and have it be force-rebooted.

As for why it's not working at all... SM3's loops are very hacky when it comes to dynamic conditions. Assuming the compiler hasn't unrolled the loop, if you look at the asm, you'll see an abomination along these lines:


for( int x=0; x!=256; ++x )
{
  if( x>=vRenderTarget.x )
    break;

When looking at this generated code, it's obvious that your algorithm is running into SM3's maximum loop limitation before it has a chance to hit your regular loop condition.

Upgrading to SM4 removes this stupid loop limit... however, as above, this is not your biggest problem. Trying to do a million bits of work per pixel is a real problem that needs addressing.

Your x+y should be supplied through an UV shader input (eg. TEXCOORD or such), and you render two polys covering the rendertarget area, which will call your pixel shader for the entire surface one pixel at a time :)

.:vinterberg:.

This will literally run your GPU at 0.1% efficiency, if it worked at all. You really need to break this task up into parallel units of work so that you can utilize the GPU properly with a large number of threads.
Tell us what problem you're actually trying to solve and we'll tell you how to parallelise it, so that it will actually make proper use of your GPU.

How could I parallelise this? I have a medium-large texture plenty of the data that will conform another texture´s data.

And I havent only the problem of the number of samples the texture can have but also an unavoidable loop overhead maxed due to the SM3. I'm condemned! :D

As a solution that comes to my mind is to fragment the texture into more tiny ones but this will traduce in even more texture reads but ¿should increase cache?...

I can also try unroll myself the loop manually...

I really dont know how to board the problem from other way....

How could I parallelise this?

I have no idea what your algorithm is, besides the fact that it uses a large texture as input... How can I answer? tongue.png

How could I parallelise this? I have a medium-large texture plenty of the data that will conform another texture´s data.

But what exactly are you trying to do? All you're telling us is that you're trying to do lots of texture reads - but that's useless information. We already know that. Are you trying to do a bloom? A weighted-average? Luminance calculations? Deferred rendering? Something else?

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

Oh, sorry, I am doing Luminance calculations.

I am traducing that texture in multiple lights for located probes.

I have it done with a 256x256 texture but I wanted more samples. Is it possible? How can I do it following a nice performance/parallelized way?

This topic is closed to new replies.

Advertisement