Texture memory access patterns

Started by
2 comments, last by NotTakenSN 11 years, 1 month ago

How is the texture cache constructed (I'd suppose different hardware would have different implementations, but wouldn't there be some similarities)? From what I've read, texture memory is just global memory with a dedicated special texture cache that is designed for better memory access when threads in the same warp read data that is near each other in 2D space. What constitutes as being "near" in 2D space? If a thread requests data from (5,5), what data ultimately gets sent to the cache along with it? Does it depend on the data type as well? If your warp size is 32, what type of grid pattern would you use to most efficiently read/write to each texel (2x16, 4x8, 8x4, etc.)?

The documentation on global memory access is quite detailed, but I can't seem to find much about texture memory access (maybe because the implementation varies too much from hardware to hardware).

Advertisement

I don't think the GPU manufacturers release detailed info on their implementations, but Fabian Giesen has a series of articles "A trip through the Graphics Pipeline" that provides some info.

Specifically this entry talks about the texture cache:

http://fgiesen.wordpress.com/2011/07/04/a-trip-through-the-graphics-pipeline-2011-part-4/

You already mentioned the main reason that you can't find standard info on the topic - it is vendor specific, and isn't exposed to the APIs for muddling with, so details are generally scarce. However, there is quite a bit of information floating around out there in the Cuda and OpenCL documentation about memory accesses and the like. You will probably gain some insight into how they work from reading up on those libraries.

In general (from what I have read over the years), the texture cache is a 2D cache that evolved with the GPU and is based on the rasterization group size. The old way to produce the pixels was to grab a group of pixels from a rasterized primitive and process them in a block. The texture cache is probably sized such as to make the best use of those block's spatial coherency.

Of course, that was all before the generalized stream processors, so it might be all different now... Sorry, but I don't know much more about it than that...

Thanks for the responses. In regards to the CUDA and OpenCL documentation, I've read most of them, and while they give lots of details and guidelines on global memory access, they don't mention very much about texture memory. The only guideline they provide is to have 2D spatial coherency when using textures (although they don't explicitly define what they mean by spatial coherency). The CUDA documentation is extremely detailed about how to get coalesced global memory access within a warp, how to avoid memory bank conflicts, and many more optimizations, but it's surprising there is next to nothing about how to minimize texture cache misses. I would think there would be at least a guideline for which texels to access for each thread in a warp to achieve the greatest memory throughput. Wouldn't the performance be different if each warp read the texels in a 32x1 manner compared to a 16x2 or a 8x4?

The article that phil_t linked to was very helpful and provided lots of insightful information on how texture memory works in the graphics pipeline. One section mentioned how the L1 texture cache can make use of compressed texture formats. These formats are compressed in blocks of 4x4 pixels, and when requested, are uncompressed and stored in the L1 texture cache. If the threads in the same warp make use of some of these 16 pixels, you can get multiple pixels worth of data in one memory fetch and decompression (well, if I understood the article correctly). So I suppose I'll stick to trying to read texels in a 4x4 pattern within a warp, unless someone tells me otherwise.

This topic is closed to new replies.

Advertisement