Thanks for the responses. In regards to the CUDA and OpenCL documentation, I've read most of them, and while they give lots of details and guidelines on global memory access, they don't mention very much about texture memory. The only guideline they provide is to have 2D spatial coherency when using textures (although they don't explicitly define what they mean by spatial coherency). The CUDA documentation is extremely detailed about how to get coalesced global memory access within a warp, how to avoid memory bank conflicts, and many more optimizations, but it's surprising there is next to nothing about how to minimize texture cache misses. I would think there would be at least a guideline for which texels to access for each thread in a warp to achieve the greatest memory throughput. Wouldn't the performance be different if each warp read the texels in a 32x1 manner compared to a 16x2 or a 8x4?
The article that phil_t linked to was very helpful and provided lots of insightful information on how texture memory works in the graphics pipeline. One section mentioned how the L1 texture cache can make use of compressed texture formats. These formats are compressed in blocks of 4x4 pixels, and when requested, are uncompressed and stored in the L1 texture cache. If the threads in the same warp make use of some of these 16 pixels, you can get multiple pixels worth of data in one memory fetch and decompression (well, if I understood the article correctly). So I suppose I'll stick to trying to read texels in a 4x4 pattern within a warp, unless someone tells me otherwise.