NVIDIA Cg tex2D function is limiting fps

Started by
6 comments, last by GameDev.net 19 years, 7 months ago
I've determined that my fp30 pixel shader is dropping my fps by 30, from 62 to 32. The problem seems to be I do a bunch of lookups using tex2D. I'm just looking up adjacent pixels. I have "The Cg Tutorial" (purchased through GameDev) but it doesn't go into how expensive the lookups are. Is there any way to speed it up?
Advertisement
Are you doing any dependent texture lookups? How many separate textures do you have bound, and how many tex2Ds do you perform?
Quote:Original post by Sneftel
Are you doing any dependent texture lookups? How many separate textures do you have bound, and how many tex2Ds do you perform?


Well, the program only shades 1 textured(non-multitextured) quad, and performs 14 tex2D's, I think. Definately a bottleneck, it seems, because if I replace the tex2D function with half3(1.0,1.0,1.0) it speeds up dramatically.

Maybe I should be clearer, this texture is the result of a scene rendered to a texture and I want to do some postprocessing on it before displaying it to the screen. But I don't think that has any affect on it, because although I'm using multi-texturing on the actual scene, there's only one texture bound when drawing this quad.

I have a Geforce FX 5700 Ultra, so far I haven't had any problems with slowness of any function before(though I could have just not noticed tex2D's slowness before) I have the latest NVIDIA drivers. Oh, and this is running on linux/freeglut. But I don't think that has anything to do with it either.
Alright, so you're taking (I assume) multiple samples, each with a different offset from the "base" texture position. Now, are you computing these offsets in the pixel shader or the vertex shader?
Quote:Original post by Sneftel
Alright, so you're taking (I assume) multiple samples, each with a different offset from the "base" texture position. Now, are you computing these offsets in the pixel shader or the vertex shader?


Well, here's a snippet from the shader:
const half offset=1.0/512.0;for(int j = -3;j<=3;j++) {	res += tex2D(texture, coords + half2(offset*j, 0.0));}(or)for(int j = -3;j<=3;j++) {	res += half3(0.1f, 0.1f, 0.1f);		}


The second one is much faster. I've tried manually unrolling the loop, replacing ints with floats/halfs, etc.

All this is happening in the pixel shader.
Aight, bingo. Consider what's happening: Every tex2D also involves a floating point multiplication and addition. Moreover, the driver has no hints as to what texture location you're going to be grabbing until those computations are done. Add that to the fact that a good optimizer will turn your half3 code into a single op, and it's no wonder you're seeing such a speedup without the tex2D.



Try this: offload the offset-generation to your vertex shader; pack the coordinates in to TEXCOORD0 through TEXCOORD6. That makes the FP multiplication+addition not an operation to be computed seven times per pixel, but merely seven times per vertex plus interpolation (which your card is already really badass at doing).
Thanks! I didn't think that it would cause trouble, since offset*j is a constant, but I guess the fp add shot the pipeline to hell anyway. :) ratings++

Though, hmm, it seems to just pass through the texture coords with no interpolation (so the fragment program recieves the same texcoords for every pixel)

I'm sure it's something stupid I'm doing. :) Is there an enable I need to set or something in order to use the other semantics TEXCOORD1-6?
many tex lookups are expensive.
The best is to compute as much texcoords as possible in the vertex shader (that way, the card ps has a better view of how it should sample the tex)
then, sample at pixel corners and not centers, that way you will profit from the bilinear interpolation doing 4 averaged samples in the first place.

If you can, separate horizontal sampling from vertical sampling by doing two passes. That will sqrt the number of ops needed.

Make sure that you use a texture with minimap bpp. 16 bits should be ok, esp. since you intend on filtering that.

Finally, if you want really high filtering kernel sizes, do more passes that use only 4 samples (to balance texread, frame buffer writes and ps overhead).
for instance, if you want a (rather coarse) 64 bit kernel width,
perform 2 4 samples passes (the 4 lerped samples really perform as 8 unfiltered samples).

In any case, texture lookups can be expensive, but there are many ways to reduce the number of lookups.

(can we have screenies?)

This topic is closed to new replies.

Advertisement