Jump to content

  • Log In with Google      Sign In   
  • Create Account

theagentd

Member Since 15 Nov 2010
Offline Last Active Yesterday, 09:47 AM

Topics I've Started

Smart compute shader box blur is slower?!

19 January 2015 - 02:05 PM

Hey!

So I decided to try some fun little compute shader experiments and see if I could make a simple box filter faster by using shared memory. I now have 4 short compute shaders that do the same thing using different techniques, but the performance I'm getting is quite baffling. The visual result of all 4 methods is exactly the same. Varying the work group and filter sizes affects performance of course, but the relative performance between the different techniques is pretty much constant.

1. The first implementation simply gathers all samples using imageLoad() and was intended to be some kind of reference implementation.

#version 430

layout (binding = 0, rgba16f) uniform image2D inputImg;
layout (binding = 1, rgba16f) uniform image2D outputImg;

layout (local_size_x = WORK_GROUP_SIZE) in;

void main(){
	
	ivec2 pixelPos = ivec2(gl_GlobalInvocationID.xy);
	
	vec3 total = vec3(0);
	for(int i = -FILTER_RADIUS; i <= FILTER_RADIUS; i++){
		total += imageLoad(inputImg, pixelPos + ivec2(i, 0)).rgb;
	}
	
	total /= FILTER_RADIUS*2+1;
	
	imageStore(outputImg, pixelPos, vec4(total, 1));
}

2. The second one is identical, except it reads from a texture using texelFetch() instead of using imageLoad() to take advantage of the texture cache.

 

3.After that I implemented a more advanced version based on http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/Efficient%20Compute%20Shader%20Programming.pps which caches the values in shared memory.

#version 430

layout (binding = 0, rgba16f) uniform image2D inputImg;
layout (binding = 1, rgba16f) uniform image2D outputImg;

layout (local_size_x = WORK_GROUP_SIZE) in;

#define CACHE_SIZE (WORK_GROUP_SIZE+FILTER_RADIUS*2)
shared vec3[CACHE_SIZE] cache;

void main(){
	
	ivec2 pixelPos = ivec2(gl_GlobalInvocationID.xy);
	int localIndex = int(gl_LocalInvocationID.x);
	
	for(int i = localIndex; i < CACHE_SIZE; i += WORK_GROUP_SIZE){
		cache[i] = imageLoad(inputImg, pixelPos + ivec2(i-localIndex - FILTER_RADIUS, 0)).rgb;
	}
	
	barrier();
	
	vec3 total = vec3(0);
	for(int i = 0; i <= FILTER_RADIUS*2; i++){
		total += cache[localIndex + i];
	}
	total /= FILTER_RADIUS*2+1;
	imageStore(outputImg, pixelPos, vec4(total, 1));
}

4. The last one is exactly the same as above, but just like the 2nd one uses texelFetch() instead of imageLoad().

 

 

 

The performance of the four techniques using a 256x1x1 sized work-group and a 64 radius filter on my GTX 770 at 1920x1080 is:

 

1) 38 FPS.

2) 414 FPS (!)

3) 223 FPS

4) 234 FPS

 

As you can see, manually caching values isn't helping at all. Changing the cache array to a vec4[] instead to improve the memory layout only marginally improved performance (230 --> 240 FPS or so). Frankly I'm at a loss. Is texture memory simply so fast and cached so well that using shared memory for performance has become redundant? Am I doing something clearly wrong?


Compute shader runs more than once?!

22 September 2014 - 07:45 AM

Hello.

 

I'm having problems with a compute shader. It seems like the compute shader is randomly run more than once, which screws up my test shader. Note that I'm using Java, so the syntax of some commands (glMapBufferRange() for example) are slightly different.

 
 
 
 

 

I have a persistently mapped coherent buffer which I use for uploads and downloads:

		buffer = glGenBuffers();
		glBindBuffer(GL_SHADER_STORAGE_BUFFER, buffer);
		glBufferStorage(GL_SHADER_STORAGE_BUFFER, BUFFER_LENGTH * ATTRIBUTE_SIZE, GL_MAP_WRITE_BIT | GL_MAP_READ_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT);
		mappedBuffer = glMapBufferRange(GL_SHADER_STORAGE_BUFFER, 0, BUFFER_LENGTH * ATTRIBUTE_SIZE, GL_MAP_WRITE_BIT | GL_MAP_READ_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT, null);

The buffer length is 16 and attribute size is 4, to fit 16 integers.

 

 

Each frame, the buffer is initialized to all 0:

		//Reset persistent buffer to 0
		int total = 0;
		for(int i = 0; i < BUFFER_LENGTH; i++){
			int v = 0;
			mappedBuffer.putInt(v);
			total += v;
		}
		System.out.println("Before: " + total); //prints 0
		mappedBuffer.clear(); //Resets the Java ByteBuffer wrapper around the pointer

I then run my compute shader:

		//Add 1 to first 8 values in buffer.
		computeShader.bind();
		glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, buffer);
		glDispatchCompute(1, 1, 1);

I wait for the GPU to finish running the shader.

		//Wait for the GPU to finish running the compute shader
		GLSync syncObject = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
		glClientWaitSync(syncObject, GL_SYNC_FLUSH_COMMANDS_BIT, 1000*1000*1000);
		
		glFinish(); //Should not be needed, but there just in case for now.

And finally I read back the data:

		//Read back result from persistent buffer
		System.out.println("Result:");
		total = 0;
		for(int i = 0; i < BUFFER_LENGTH; i++){
			int v = mappedBuffer.getInt();
			total += v;
			System.out.println(v); //Print value
		}
		System.out.println("After: " + total);
		mappedBuffer.clear(); //Reset Java wrapper around pointer

And here's my compute shader:

#version 430

layout (binding = 0, rgba16f) uniform image2D img;

layout(std430, binding = 0) buffer Data{

	int data[];
	
} dataBuffer;

layout (local_size_x = 16, local_size_y = 1, local_size_z = 1) in;

void main(){

	int offset = int(gl_GlobalInvocationID.x);
	//int offset = int(gl_WorkGroupSize.x * gl_WorkGroupID.x + gl_LocalInvocationID.x);
	
	if(offset < 8){
		//dataBuffer.data[offset]++;
		atomicAdd(dataBuffer.data[offset], 1);
	}
}

 
 
Summary:
 - I have a persistently mapped coherent buffer which I try to update using a compute shader.
 - I initialize this 16-int buffer to all zeroes.
 - I call the compute shader with 1x1x1 work groups = 1 workgroup, and each workgroup has a work group size of 16x1x1, e.g. a single line of 16 invocations.
 - The shader increments the first 8 elements of the buffer by 1.
 - I correctly wait for the results and everything, but the compute shader randomly seems to be running twice.
 - I read back and print the result from the buffer.

 

 

 

The result is a buffer which 99% of the time contains the value 2 instead of 1!!!

 

 

Before: 0

Result:
2
2
2
2
2
2
2
2
0
0
0
0
0
0
0
0
After: 16
 
Before: 0
Result:
2
2
2
2
2
2
2
2
0
0
0
0
0
0
0
0
After: 16
 
Before: 0
Result:
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
After: 8
 
Before: 0
Result:
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
After: 8

 

This randomly occurs regardless of if the shader uses an atomicAdd() or not. It seems like the compute shader is actually run twice for each element in the buffer instead of once, but I see no possible way of how this could happen. What is going on?!


Distance to cone light / cone light culling

31 August 2014 - 06:09 PM

Hello!

I am trying to figure out a good way of determining

1) if a cone light with a certain position, direction, field of view and range is visible on the screen (frustum culling) and

2) how far away the camera is from the light so I can pick a fitting shadow map resolution.

 

I am currently treating the cone light like a point light with the center of the sphere being the origin of the cone light and the radius of the sphere being the range of the light, but this creates a lot of false positives and overestimates the shadow map resolution. I want to do more precise frustum culling, and I also want to calculate a more accurate distance from the camera to the closest point that lies inside the cone light's volume to better estimate the shadow map resolution.

 

For point lights I simply use the highest shadow map resolution available when the camera is inside the point light's volume, and then gradually reduce the resolution proportionally to

    1 / (distance from center of point light - radius of point light)

if the camera is outside the point light's volume.

 

For cone lights I wish to do something similar. I want to determine if the camera is inside the cone light. If it's not, then I want to calculate exactly how far away it is from the cone light's volume, and pick a shadow map resolution based on that.

 

Thanks for reading!

 


SOLVED: Compute shader atomic shared variable problem

26 June 2014 - 09:05 AM

SOLUTION: Ugh. Don't do 

    minDepth = atomicMin(minDepth, depth);

Just do

    atomicMin(minDepth, depth);

The assignment breaks the atomicity. 

 

 

 

 

I was doing some experiments on tiled deferred shading and ended up with a very strange problem. I tried to compute the minimum and maximum depth of each tile using two shared uints.

shared uint minDepth = 0xFFFFFFFF;
shared uint maxDepth = 0;

The idea was to compute the maximum and minimum depth using atomicMin() and atomicMax() like this:

uint depth = ...;
minDepth = atomicMin(minDepth, depth); //DON'T DO THIS, SEE ABOVE
maxDepth = atomicMax(maxDepth, depth); //DON'T DO THIS, SEE ABOVE
barrier();

However, this is not working correctly. I seem to be getting synchronization problems since the result flickers a lot despite the camera not moving so the depth buffer identical. barrier() has no effect at all on the result. 

 

EnYUpj0.png

 

I fail to see how this could possibly happen. The atomic min and max functions do not seem to work correctly, with some pixels being randomly ignored. For testing, I decided to output the following:

if(minDepth > d){
    result = vec3(5, 0, 0);
}else{
    result = vec3(0);
}

In other words, if the depth of the current pixel is less than the calculated minimum depth of the tile, make it bright red. Here's the horrifying flickering result:

cFMxZCB.png

 

 

What am I doing wrong? How can this possibly happen when I'm using barrier()? Why isn't barrier() nor memoryBarrierShared() working as they should?

 

SOLVED, see above.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


Temporal Subpixel Reconstruction Antialiasing

09 June 2014 - 05:25 AM

Hello!

 

I came up with a way of extending Subpixel Reconstruction Antialiasing with a temporal component to improve subpixel accuracy and achieve antialiasing of shading as well, while also completely preventing ghosting.

 

Here's a comparison of no antialiasing, SRAA and TSRAA to help catch your interest. =P

 

1x+shading+zoom.png8x+SRAA+zoom.png8x+TSRAA+zoom.png

 

Here's a link to the article I wrote about it.


PARTNERS