Hey!
So I decided to try some fun little compute shader experiments and see if I could make a simple box filter faster by using shared memory. I now have 4 short compute shaders that do the same thing using different techniques, but the performance I'm getting is quite baffling. The visual result of all 4 methods is exactly the same. Varying the work group and filter sizes affects performance of course, but the relative performance between the different techniques is pretty much constant.
1. The first implementation simply gathers all samples using imageLoad() and was intended to be some kind of reference implementation.
#version 430
layout (binding = 0, rgba16f) uniform image2D inputImg;
layout (binding = 1, rgba16f) uniform image2D outputImg;
layout (local_size_x = WORK_GROUP_SIZE) in;
void main(){
ivec2 pixelPos = ivec2(gl_GlobalInvocationID.xy);
vec3 total = vec3(0);
for(int i = -FILTER_RADIUS; i <= FILTER_RADIUS; i++){
total += imageLoad(inputImg, pixelPos + ivec2(i, 0)).rgb;
}
total /= FILTER_RADIUS*2+1;
imageStore(outputImg, pixelPos, vec4(total, 1));
}
2. The second one is identical, except it reads from a texture using texelFetch() instead of using imageLoad() to take advantage of the texture cache.
3.After that I implemented a more advanced version based on http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/Efficient%20Compute%20Shader%20Programming.pps which caches the values in shared memory.
#version 430
layout (binding = 0, rgba16f) uniform image2D inputImg;
layout (binding = 1, rgba16f) uniform image2D outputImg;
layout (local_size_x = WORK_GROUP_SIZE) in;
#define CACHE_SIZE (WORK_GROUP_SIZE+FILTER_RADIUS*2)
shared vec3[CACHE_SIZE] cache;
void main(){
ivec2 pixelPos = ivec2(gl_GlobalInvocationID.xy);
int localIndex = int(gl_LocalInvocationID.x);
for(int i = localIndex; i < CACHE_SIZE; i += WORK_GROUP_SIZE){
cache[i] = imageLoad(inputImg, pixelPos + ivec2(i-localIndex - FILTER_RADIUS, 0)).rgb;
}
barrier();
vec3 total = vec3(0);
for(int i = 0; i <= FILTER_RADIUS*2; i++){
total += cache[localIndex + i];
}
total /= FILTER_RADIUS*2+1;
imageStore(outputImg, pixelPos, vec4(total, 1));
}
4. The last one is exactly the same as above, but just like the 2nd one uses texelFetch() instead of imageLoad().
The performance of the four techniques using a 256x1x1 sized work-group and a 64 radius filter on my GTX 770 at 1920x1080 is:
1) 38 FPS.
2) 414 FPS (!)
3) 223 FPS
4) 234 FPS
As you can see, manually caching values isn't helping at all. Changing the cache array to a vec4[] instead to improve the memory layout only marginally improved performance (230 --> 240 FPS or so). Frankly I'm at a loss. Is texture memory simply so fast and cached so well that using shared memory for performance has become redundant? Am I doing something clearly wrong?