Sign in to follow this  
etothex

NVIDIA Cg tex2D function is limiting fps

Recommended Posts

I've determined that my fp30 pixel shader is dropping my fps by 30, from 62 to 32. The problem seems to be I do a bunch of lookups using tex2D. I'm just looking up adjacent pixels. I have "The Cg Tutorial" (purchased through GameDev) but it doesn't go into how expensive the lookups are. Is there any way to speed it up?

Share this post


Link to post
Share on other sites
Quote:
Original post by Sneftel
Are you doing any dependent texture lookups? How many separate textures do you have bound, and how many tex2Ds do you perform?


Well, the program only shades 1 textured(non-multitextured) quad, and performs 14 tex2D's, I think. Definately a bottleneck, it seems, because if I replace the tex2D function with half3(1.0,1.0,1.0) it speeds up dramatically.

Maybe I should be clearer, this texture is the result of a scene rendered to a texture and I want to do some postprocessing on it before displaying it to the screen. But I don't think that has any affect on it, because although I'm using multi-texturing on the actual scene, there's only one texture bound when drawing this quad.

I have a Geforce FX 5700 Ultra, so far I haven't had any problems with slowness of any function before(though I could have just not noticed tex2D's slowness before) I have the latest NVIDIA drivers. Oh, and this is running on linux/freeglut. But I don't think that has anything to do with it either.

Share this post


Link to post
Share on other sites
Alright, so you're taking (I assume) multiple samples, each with a different offset from the "base" texture position. Now, are you computing these offsets in the pixel shader or the vertex shader?

Share this post


Link to post
Share on other sites
Quote:
Original post by Sneftel
Alright, so you're taking (I assume) multiple samples, each with a different offset from the "base" texture position. Now, are you computing these offsets in the pixel shader or the vertex shader?


Well, here's a snippet from the shader:

const half offset=1.0/512.0;
for(int j = -3;j<=3;j++) {
res += tex2D(texture, coords + half2(offset*j, 0.0));
}

(or)

for(int j = -3;j<=3;j++) {
res += half3(0.1f, 0.1f, 0.1f);
}



The second one is much faster. I've tried manually unrolling the loop, replacing ints with floats/halfs, etc.

All this is happening in the pixel shader.

Share this post


Link to post
Share on other sites
Aight, bingo. Consider what's happening: Every tex2D also involves a floating point multiplication and addition. Moreover, the driver has no hints as to what texture location you're going to be grabbing until those computations are done. Add that to the fact that a good optimizer will turn your half3 code into a single op, and it's no wonder you're seeing such a speedup without the tex2D.



Try this: offload the offset-generation to your vertex shader; pack the coordinates in to TEXCOORD0 through TEXCOORD6. That makes the FP multiplication+addition not an operation to be computed seven times per pixel, but merely seven times per vertex plus interpolation (which your card is already really badass at doing).

Share this post


Link to post
Share on other sites
Thanks! I didn't think that it would cause trouble, since offset*j is a constant, but I guess the fp add shot the pipeline to hell anyway. :) ratings++

Though, hmm, it seems to just pass through the texture coords with no interpolation (so the fragment program recieves the same texcoords for every pixel)

I'm sure it's something stupid I'm doing. :) Is there an enable I need to set or something in order to use the other semantics TEXCOORD1-6?

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
many tex lookups are expensive.
The best is to compute as much texcoords as possible in the vertex shader (that way, the card ps has a better view of how it should sample the tex)
then, sample at pixel corners and not centers, that way you will profit from the bilinear interpolation doing 4 averaged samples in the first place.

If you can, separate horizontal sampling from vertical sampling by doing two passes. That will sqrt the number of ops needed.

Make sure that you use a texture with minimap bpp. 16 bits should be ok, esp. since you intend on filtering that.

Finally, if you want really high filtering kernel sizes, do more passes that use only 4 samples (to balance texread, frame buffer writes and ps overhead).
for instance, if you want a (rather coarse) 64 bit kernel width,
perform 2 4 samples passes (the 4 lerped samples really perform as 8 unfiltered samples).

In any case, texture lookups can be expensive, but there are many ways to reduce the number of lookups.

(can we have screenies?)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this