Based on the guidelines from NVIDIA for CUDA and OpenCL (DirectCompute documentation is quite lacking), the largest memory transaction size for compute capability 2.0 is 128 bytes, while the largest word that can be accessed is 16 bytes. Global memory accesses can be coalesced when the data being accessed by the threads in a warp fall into the same 128 byte segment. With this in mind, wouldn't structured buffers be detrimental for memory coalescing if the structure is larger than 16 bytes?
Suppose you have a structure of two float4's, call them A and B. You can access either A or B, but not both in a single memory transaction for an instruction issued in a non-divergent warp. The layout of the memory would look like ABABABAB. If you're trying to read consecutive structures into shared memory, wouldn't memory bandwidth be wasted by storing the data in this manner? For example, you can only access the A elements, but the hardware coalesces the memory transaction so it reads in 128 bytes of consecutive data, half of which is the B elements. Essentially, you're wasting half of your memory bandwidth. Wouldn't it be better to store the data like AAAABBBB, which is a structure of buffers instead of a buffer of structures? Or is this handled by the L1 cache, where the B elements are cached so you can access them faster when the next instruction is to read in the B elements? The only other solution would be to have even numbered threads access the A elements, while odd numbered elements access the B elements.
Hopefully I explained this well enough so someone could understand. I would ask this on the NVIDIA developer forums, but I think they're still down. Visual Studio keeps crashing when I try to run the NVIDIA Nsight frame profiler, so it's difficult to see how the memory bandwidth is affected by changes in how the data is stored. P.S., has anyone been able to successfuly run the NVIDIA Nsight frame profiler? Thanks!
I did some testing and found out that I could get up to a 2 time speed increase if I arranged the data in a AAAABBBB fashion (for a structure of 2 float4's). If you have a structure of one float4, you can get at maximum 8 float4 accesses in one memory transaction (the full 128 byte segment). If you have a structure of 2 float4's, you can get at maximum 4 float4 accesses. If you have 3 float4's, you can get at maximum 2 float4 accesses (most likely less because 3 float4's do not line up on the 128 byte boundaries). Many advanced shaders will be limited by memory bandwidth, so a doubled increase in usable memory bandwidth will most likely double your framerate. If you can separate your structures in order to align the same value types (max size of 16 bytes) next to each other (at least 128 bytes worth of data), you will most likely see a significant increase in frame rate. If someone else would like to try this on their shader and report back, I'd like to see the results.
Also, I updated my FFT shader yesterday. I'm now using SoA instead of AoS aligned groupshared memory. The performance increased from 2.9ms to 2.1ms for my bloom technique that uses the FFT shader 4 times.