Smart compute shader box blur is slower?!

Started by
2 comments, last by theagentd 9 years, 2 months ago

Hey!

So I decided to try some fun little compute shader experiments and see if I could make a simple box filter faster by using shared memory. I now have 4 short compute shaders that do the same thing using different techniques, but the performance I'm getting is quite baffling. The visual result of all 4 methods is exactly the same. Varying the work group and filter sizes affects performance of course, but the relative performance between the different techniques is pretty much constant.

1. The first implementation simply gathers all samples using imageLoad() and was intended to be some kind of reference implementation.


#version 430

layout (binding = 0, rgba16f) uniform image2D inputImg;
layout (binding = 1, rgba16f) uniform image2D outputImg;

layout (local_size_x = WORK_GROUP_SIZE) in;

void main(){
	
	ivec2 pixelPos = ivec2(gl_GlobalInvocationID.xy);
	
	vec3 total = vec3(0);
	for(int i = -FILTER_RADIUS; i <= FILTER_RADIUS; i++){
		total += imageLoad(inputImg, pixelPos + ivec2(i, 0)).rgb;
	}
	
	total /= FILTER_RADIUS*2+1;
	
	imageStore(outputImg, pixelPos, vec4(total, 1));
}

2. The second one is identical, except it reads from a texture using texelFetch() instead of using imageLoad() to take advantage of the texture cache.

3.After that I implemented a more advanced version based on http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/Efficient%20Compute%20Shader%20Programming.pps which caches the values in shared memory.


#version 430

layout (binding = 0, rgba16f) uniform image2D inputImg;
layout (binding = 1, rgba16f) uniform image2D outputImg;

layout (local_size_x = WORK_GROUP_SIZE) in;

#define CACHE_SIZE (WORK_GROUP_SIZE+FILTER_RADIUS*2)
shared vec3[CACHE_SIZE] cache;

void main(){
	
	ivec2 pixelPos = ivec2(gl_GlobalInvocationID.xy);
	int localIndex = int(gl_LocalInvocationID.x);
	
	for(int i = localIndex; i < CACHE_SIZE; i += WORK_GROUP_SIZE){
		cache[i] = imageLoad(inputImg, pixelPos + ivec2(i-localIndex - FILTER_RADIUS, 0)).rgb;
	}
	
	barrier();
	
	vec3 total = vec3(0);
	for(int i = 0; i <= FILTER_RADIUS*2; i++){
		total += cache[localIndex + i];
	}
	total /= FILTER_RADIUS*2+1;
	imageStore(outputImg, pixelPos, vec4(total, 1));
}

4. The last one is exactly the same as above, but just like the 2nd one uses texelFetch() instead of imageLoad().

The performance of the four techniques using a 256x1x1 sized work-group and a 64 radius filter on my GTX 770 at 1920x1080 is:

1) 38 FPS.

2) 414 FPS (!)

3) 223 FPS

4) 234 FPS

As you can see, manually caching values isn't helping at all. Changing the cache array to a vec4[] instead to improve the memory layout only marginally improved performance (230 --> 240 FPS or so). Frankly I'm at a loss. Is texture memory simply so fast and cached so well that using shared memory for performance has become redundant? Am I doing something clearly wrong?

Advertisement

This might give some pointers. http://diaryofagraphicsprogrammer.blogspot.fi/2015/01/reloaded-compute-shader-optimizations.html

Thanks, kalle_h, but that one isn't really applicable in my case I think.

I did some experiments trying to reduce shared memory bank collisions. I tried splitting up the cache into 3 separate float[]s for R, G and B. This gave a noticeable improvement in FPS, but the cached shader is still only around 75% as fast as the brute force texture sample version...

The values above were from using a GL_RGBA16F texture. Switching to 32-bit values makes the shared memory version faster:

1) 18 FPS

2) 233 FPS

3) 213 FPS

4) 283 FPS

With 16-bit texture, the texture cache only needs to store and load half as much data as the 32-bit shared memory registers, which also increases its capacity. With 32-bit values, the cache is less effective, so the cached version is actually faster. Sadly, blurring 32-bit floats is rare to say the least...

This topic is closed to new replies.

Advertisement