This is very interesting, because i've got very different results (gtx 670 and 480).
My use case is a compute shader tree traversal, and the nodes have vec4 data for position, color, direction and integer data packed in uvec4 (tree indices etc.)
First i've used a single shader storage buffer the AoS way. That was too slow, so i tried to put each vec in its own texture, so SoA.
Can't remember exactly but the speed up was 10-30 times i think.
Does this make sense? I assumed using SoA is faster because multiple texture units can be used to grab the data.
The other fact is that i do not need to read all the data for any node i visit, which is a difference to the fixed vertex pipeline example from above.
I do not need to read node direction if position is already too far away etc., and in AoS method i did read the full struct in any case before any test.
But i doupt this alone explains the huge speed up.
Please let me know what you think, i'm new to gpu and it's still hard to predict performance.
And there are crazy things happening, f. ex. sometimes it's faster to reserve shared memory without using it.
Sounds stupid, but it's true, especially for simple shaders. Someone discovered this before and posted on NV forum, but no official responce.
I assume the reason is reserving shared memory prevents the thread sheduler from doing too much task switching.
I don't know if this happens on other languages too (Cuda, CL, DX).
Other crazy things are:
It's faster to do a blur on the tree with all 4 children, 4 neighbours and parent, than to do a simple color averaging from children to parent on the same tree. ???
It's >2x faster to do a stackless but very divergent tree traversal one thread per node, than to do a perfect data / code / runtime coherent parallel traversal using a satck (stack is too big for shared memory).
I really have the feeling that drivers are not well polished for compute shader performance, but it's a little too much work to port to cuda to see the difference...
Any thoughts welcome :)