How is the texture cache constructed (I'd suppose different hardware would have different implementations, but wouldn't there be some similarities)? From what I've read, texture memory is just global memory with a dedicated special texture cache that is designed for better memory access when threads in the same warp read data that is near each other in 2D space. What constitutes as being "near" in 2D space? If a thread requests data from (5,5), what data ultimately gets sent to the cache along with it? Does it depend on the data type as well? If your warp size is 32, what type of grid pattern would you use to most efficiently read/write to each texel (2x16, 4x8, 8x4, etc.)?
The documentation on global memory access is quite detailed, but I can't seem to find much about texture memory access (maybe because the implementation varies too much from hardware to hardware).