Is sampling textures slow?

Started by
6 comments, last by _the_phantom_ 10 years, 11 months ago

I have a pixel shader for simple sprites like this:


float4 ps_Sprite( vs_out input_p ) : SV_TARGET{

	return tex2D.Sample( samplerMode, input_p.uv ) * input_p.color;
} 

Should I check for alpha == 0 and return float4(0.0, 0.0, 0.0, 0.0) ? Should I also check for color (0.0, 0.0, 0.0) and return the color without sampling?

Advertisement
Conditional statements are likely far worse for the gpu than just crunching through, because it has to orepare to do both of the possible execution paths.

o3o

Sampling textures is something that GPUs are very specifically designed to be good at. It's cheaper than computation in a lot of cases. It's WAY better than conditionals on SIMD architectures; you're better off doing samples and multiplying by zeros at the end than making jumps which cause a bunch of pipelines to be parked and run later. Basically at each decision, one path is taken. All the pixels in the current batch which would go the other way are parked and will be restarted when the chosen path has finished. You can see how decisions are expensive.

On MIMD GPUs, there are usually massively specialised pipelines for doing texture fetches. The individual latency isn't great on these. But you don't care. You've got millions of pixels to paint. You don't care about the latency on each paint, you care about the total pixel throughput; and that can be insane. It's not unusual for 100 or 200 pixels at a time to be doing texture fetches... each one takes 200 cycles to complete, but one completes every cycle if you can keep the pipeline loaded..

Seriously; make the fetches efficient is what keeps GPU designers awake at night, so you don't have to.

I'm not sure you got the idea where sampling textures might be slow from - this is something that 3D accelerators have been doing since before 1996 (and we're talking D3D11 so we're talking reasonably modern hardware here, which makes the whole notion even odder). The whole question smells heavily of pre-emptive optimization to me.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

I'm not sure you got the idea where sampling textures might be slow from - this is something that 3D accelerators have been doing since before 1996 (and we're talking D3D11 so we're talking reasonably modern hardware here, which makes the whole notion even odder). The whole question smells heavily of pre-emptive optimization to me.

Stop assuming, I asked a very simple question, I didnt get any ideas from anywhere

"The whole question smells heavily of preemptive optimization to me"

Really? explain please, cause your post smells as wanabe all over it..cant I ask a simple question on a subject I dont know about without someone using the opportunity to say "premature ..root of evil"

Your post is so contradictory in every sense. If its so obvious ("since 1996..blablabla") how can this be premature optimization? If its obvious and I didnt know about, I should have asked much sooner since I should know the obvious information. Katie pretty much posted everything I wanted to know and more, and than theres your useless followup there just for the sake of promoting yourself over my lack of knowledge on the subject..seriously.. its unbelievable

I'm not sure you got the idea where sampling textures might be slow from - this is something that 3D accelerators have been doing since before 1996 (and we're talking D3D11 so we're talking reasonably modern hardware here, which makes the whole notion even odder). The whole question smells heavily of pre-emptive optimization to me.

Stop assuming, I asked a very simple question, I didnt get any ideas from anywhere

"The whole question smells heavily of preemptive optimization to me"

Really? explain please, cause your post smells as wanabe all over it..cant I ask a simple question on a subject I dont know about without someone using the opportunity to say "premature ..root of evil"

Your post is so contradictory in every sense. If its so obvious ("since 1996..blablabla") how can this be premature optimization? If its obvious and I didnt know about, I should have asked much sooner since I should know the obvious information. Katie pretty much posted everything I wanted to know and more, and than theres your useless followup there just for the sake of promoting yourself over my lack of knowledge on the subject..seriously.. its unbelievable

You're not being persecuted by mhagain. Were you really that offended by his post?

Taking a simple, low level operation (ordinary texture sampling) and looking for a way to speed it up despite not having found it to be an actual bottleneck in a specific instance is pretty much the definition of premature optimization. I should know, I've been guilty of it far more times than I can count.

Sampling textures is something that GPUs are very specifically designed to be good at. It's cheaper than computation in a lot of cases.

Actually it is the contrary. Yes, sampling textures is very fast, however, computations are even faster. I can't find good reference here but ALU : Tex ratio is getting higher ie. shaders should do more calculations for each texture fetch. Nowadays even look up textures can be replaced with the actual calculations, at least in the simple cases.

It is true that conditionals may be something to avoid at some point.

Should I check for alpha == 0 and return float4(0.0, 0.0, 0.0, 0.0) ?

Only if you wish to output black pixels for texels with 0 alpha (although it would be faster to preprocess such a texture).

Should I also check for color (0.0, 0.0, 0.0) and return the color without sampling?

You mean if your input_p.color is black you'd just return the color directly. Probably the hardware would still sample the texture.

For such a short shader it is pretty useless to think for optimizations. What ever you do, the speed increase is very likely zero.

Cheers!

Sampling textures is something that GPUs are very specifically designed to be good at. It's cheaper than computation in a lot of cases.


Actually it is the contrary. Yes, sampling textures is very fast, however, computations are even faster. I can't find good reference here but ALU : Tex ratio is getting higher ie. shaders should do more calculations for each texture fetch. Nowadays even look up textures can be replaced with the actual calculations, at least in the simple cases.
Well, to be correct about this sampling textures may be fast depending on what is going on.

If you are lucky and your texture data sits in cache then yay! fast data return and all is good.
If you are unlucky and we have to make a request out to VRAM then noooo! slow data return as while GDDR has high bandwidth the latency is pretty damned poor - plus you are now queued with other requests.

The good news is that if the GPU has enough work to do then you might never notice these stalls as your thread group will get swapped out and someone else will get a run at the resources covering the latency of your data fetch. If you don't have enough work or other threads can't be swapped in however then the GPU will stall while it waits for data to come back so it can resume a thread group.

Note; thread group.

When thinking about branching you have to consider how it will effect all the threads running at a time in that group. The common group size on DX11 hardware (which I'll focus on as the topic was tagged DX11) is 32 or 64 threads to a wave front which is all the threads working in lock-step together.

When it comes to branching the rules are simple.

if(conditionA)
{
     /* code block A */
}
else
{
    /* code block B */
}
/* code block C */
If the GPU is executing this code then ALL threads will evaluate 'conditionA'.
If all threads evaluate 'conditionA' to 'true' then they will all run 'code block A' followed by 'code block C'.
If all threads evaluate 'conditionB' to 'false' then they will all run 'code block B' followed by 'code block C'.

The fun starts when some threads evaluate to 'true' and some to 'false', at which point the GPU does the following;
- mask out all lanes which evaluated 'false'
- run 'code block A'
- mask out all lanes which evaluated 'true'
- run 'code block B'
- mask all lanes back in
- run 'code block C'

As you can see the GPU has to execute both sides of the 'if' condition which, depending on the code in blocks A and B could cause a major performance hit as it has to do all the work in both branches (texture fetches might not happen, depending on arch and other issues, so you can potentially save some bandwidth/texture cache).

That said, if you know your data and you know how things are going to branch on a group basis then branch can help in certain cases (even on DX9 hardware where I've had wins doing it).

The best example of this was on a game I worked on where our roads where draw as a mesh being overlaid onto the world best and blended in. The road texture consisted of a large section in the middle which was alpha=1, a few transition texels where alpha tended towards 0 and a few pixels where alpha=0 along the sides.

When applied to the terrain large portions of the road mesh where covered with the alpha=0 part and others where alpha=1; by placing an 'if' statement on the initial alpha value (sampled from a texture) large amounts of work could be skipped (the road had other textures applied to it) and the pixels discarded (saving blending).

This worked well because inside a pixel quad the majority of the threads either had everyone with alpha=0 or alpha=1 with only the small border section varying and as the 'else' case was a simple 'discard' statement the cost from running both code paths in that instance was small resulting in a large performance increase on the target hardware.

The key point to get across here is that branching CAN help but you have to be aware of the data you are feeding it and the amount of work to do.

In this specific case I probably wouldn't branch as you don't really have enough work to do anyway; if the texturing statement was a bit more complex AND the colour == (0,0,0,0) has a frequency such that there was a high chance all the threads in a group would take one path or the other then I'd be tempted to use it.

Branching has it's uses, we are beyond the days of it totally destroying your performance if used, but you still have to use it sensibly.

And yes, burning ALU time is often better than burning bandwidth because, as mentioned, the ALU:tex ratio has long been shifting in that direction; modern GPUs are ALU monsters frankly.

This topic is closed to new replies.

Advertisement