• Advertisement
Sign in to follow this  

Trilinear texture filtering

This topic is 995 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I was wondering if anyone knew where I could find some information on how trilinear texture filtering is implemented on GPUs. I remember that in the past GPU vendors would claim that their cards could perform a trilinear filtered texture sample per cycle. It would be interesting to know the architectural details of how that was accomplished and how things may have changed now that GPU architectures have become more general purpose. Information on how it might be efficiently implemented in software using SIMD would also be welcome.

Share this post


Link to post
Share on other sites
Advertisement

The calculation of memory addresses for texture sampling is somewhat specialized logic. I wouldn't be surprised if even the latest GPU hardware had dedicated units for this specific purpose.

 

Conceptually, the operation finds the nearest texels of the sample vector(s), loads said texels, blends them based on their distance from the sample vector, and stores the result to shader-accessible registers. In practice, there is a bit more complexity than this smile.png

 

For selecting the mip levels to sample from, the sampler observes the partial derivatives ddx and ddy over the sample coordinate, and uses the length of the derivative vector [ddx, ddy] to determine the effective detail level of the current output pixel. Depending on the "sampling quality level" (as specified in control panel of some drivers), minimum, maximum or an arbitrarily biased value of [ddx, ddy] may also be used.

 

Additionally, when anisotropic filtering is enabled, the hardware considers the direction and maximum dimension of the derivative vector and takes n samples along it. This is done to get enough data for the sample, even though one of the sample coordinate derivatives might be considerably smaller than the other (in case the geometry slopes steeply wrt the screen space), and effectively improves the detail level of the texture in case the texture is not viewed exactly straight on.

Edited by Nik02

Share this post


Link to post
Share on other sites

I think I'd be more interested in how they manage to hide the memory latency considering that in the case of a monochrome texture they have to read 8 bytes, maybe from 8 different cache lines.

Share this post


Link to post
Share on other sites

Because of texturecoord-swizzling (Morton order) it will hopefully be 2 cache-lines, one per mip-level, not counting anisotropic filtering, + fetching will probably be done in parallel. Cache etc. is certainly optimized to match common usage as well.

For GPUs latency is also very well hidden by multiple pixels being shaded at once (like hyperthreading). Just as a simplified example, say the shader is run 25 times per core, and there are 4 cycles of calculations setting up texture-coords and start a texture-read, followed by 100 cycles latency waiting for the texture data. That would allow the first instance to do its 4 cycles and start a texture-fetch, at which point the scheduler would interrupt that instance and tell a second instance to run and do its 4 cycles followed by its fetch, etc.. until all the 25 instances are done with their first 4 cycles and 100 cycles have passed.

At that point the results from the texture-fetches have started coming in, and the first instance is allowed to resume, now with its texture data readily available in a register, and each instance is run in sequence again doing what they need to do once the texture color is available. So the entire 100 cycle latency is hidden in this case.

Most of the 25 instances will also probably fetch texture-data from shared cache-lines so not all 25 fetches have to result in actual texture reads (which may matter more or less in real scenarios).

There are often a very large amount of registers in total per core, and each of the 25 instances may have for example 10 registers each (if there are 250 per core), which makes switching between them possible without wasting any cycles.

If each shader-instance would only require 5 registers for example when 250 are available, 50 instances could be run instead on one core, which would allow 200 cycles of latency to be completely hidden instead of just 100, for the same 4 cycles of calculations per texture-fetch. Exactly how this works probably differs by GPU.

 

EDIT: Edited for clarity (I hope :) )

Edited by Erik Rufelt

Share this post


Link to post
Share on other sites

In addition to the tiled texturing and hardware hiding memory latency across multiple threads it should also be mentioned that both AMD and Nvidia have texture filtering quality options in there respective driver control panels.  They allow you to choose between quality, speed and a mix between the two.  As to how they work... don't remember.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement