Compute shader runs more than once?!

Started by
24 comments, last by theagentd 9 years, 7 months ago

Hello.

I'm having problems with a compute shader. It seems like the compute shader is randomly run more than once, which screws up my test shader. Note that I'm using Java, so the syntax of some commands (glMapBufferRange() for example) are slightly different.

I have a persistently mapped coherent buffer which I use for uploads and downloads:


		buffer = glGenBuffers();
		glBindBuffer(GL_SHADER_STORAGE_BUFFER, buffer);
		glBufferStorage(GL_SHADER_STORAGE_BUFFER, BUFFER_LENGTH * ATTRIBUTE_SIZE, GL_MAP_WRITE_BIT | GL_MAP_READ_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT);
		mappedBuffer = glMapBufferRange(GL_SHADER_STORAGE_BUFFER, 0, BUFFER_LENGTH * ATTRIBUTE_SIZE, GL_MAP_WRITE_BIT | GL_MAP_READ_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT, null);

The buffer length is 16 and attribute size is 4, to fit 16 integers.

Each frame, the buffer is initialized to all 0:


		//Reset persistent buffer to 0
		int total = 0;
		for(int i = 0; i < BUFFER_LENGTH; i++){
			int v = 0;
			mappedBuffer.putInt(v);
			total += v;
		}
		System.out.println("Before: " + total); //prints 0
		mappedBuffer.clear(); //Resets the Java ByteBuffer wrapper around the pointer

I then run my compute shader:


		//Add 1 to first 8 values in buffer.
		computeShader.bind();
		glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, buffer);
		glDispatchCompute(1, 1, 1);

I wait for the GPU to finish running the shader.


		//Wait for the GPU to finish running the compute shader
		GLSync syncObject = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
		glClientWaitSync(syncObject, GL_SYNC_FLUSH_COMMANDS_BIT, 1000*1000*1000);
		
		glFinish(); //Should not be needed, but there just in case for now.

And finally I read back the data:


		//Read back result from persistent buffer
		System.out.println("Result:");
		total = 0;
		for(int i = 0; i < BUFFER_LENGTH; i++){
			int v = mappedBuffer.getInt();
			total += v;
			System.out.println(v); //Print value
		}
		System.out.println("After: " + total);
		mappedBuffer.clear(); //Reset Java wrapper around pointer

And here's my compute shader:


#version 430

layout (binding = 0, rgba16f) uniform image2D img;

layout(std430, binding = 0) buffer Data{

	int data[];
	
} dataBuffer;

layout (local_size_x = 16, local_size_y = 1, local_size_z = 1) in;

void main(){

	int offset = int(gl_GlobalInvocationID.x);
	//int offset = int(gl_WorkGroupSize.x * gl_WorkGroupID.x + gl_LocalInvocationID.x);
	
	if(offset < 8){
		//dataBuffer.data[offset]++;
		atomicAdd(dataBuffer.data[offset], 1);
	}
}

Summary:
- I have a persistently mapped coherent buffer which I try to update using a compute shader.
- I initialize this 16-int buffer to all zeroes.
- I call the compute shader with 1x1x1 work groups = 1 workgroup, and each workgroup has a work group size of 16x1x1, e.g. a single line of 16 invocations.
- The shader increments the first 8 elements of the buffer by 1.
- I correctly wait for the results and everything, but the compute shader randomly seems to be running twice.
- I read back and print the result from the buffer.

The result is a buffer which 99% of the time contains the value 2 instead of 1!!!

Before: 0

Result:
2
2
2
2
2
2
2
2
0
0
0
0
0
0
0
0
After: 16
Before: 0
Result:
2
2
2
2
2
2
2
2
0
0
0
0
0
0
0
0
After: 16
Before: 0
Result:
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
After: 8
Before: 0
Result:
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
After: 8

This randomly occurs regardless of if the shader uses an atomicAdd() or not. It seems like the compute shader is actually run twice for each element in the buffer instead of once, but I see no possible way of how this could happen. What is going on?!

Advertisement

This is pretty cool. If I remove the sync with the wait for the GPU to finish computing I get output similar to this:

Before: 0

Result:
0
0
0
0
0
0
0
0
0
0
0
1
1
2
2
2
After: 8

This almost certainly proves that it runs the compute shader twice!!!

No, it doesn't prove anything - for starters unless the wrapper you are using is broken it won't magically do things without you telling it to, that is not how computers work and if you go in thinking 'magic is happening' then you stand no hope.

You are getting the output you told the computer to do.

As to what the problem could be; do you check the returned value from 'glClientWaitSync' to make sure the operation DID complete as expected?

At least this problem helped me to solve mine :)

I've not read much yet about sync objects, but adding it to my code fixed my synchrinisation issues:

void GPUwait ()
{
GLsync syncObject = glFenceSync (GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
GLenum ret = glClientWaitSync(syncObject, GL_SYNC_FLUSH_COMMANDS_BIT, 1000*1000*1000);
if (ret == GL_WAIT_FAILED || ret == GL_TIMEOUT_EXPIRED)
SystemTools::Log ("glClientWaitSync failed./n");
glMemoryBarrier (GL_ALL_BARRIER_BITS);
glDeleteSync (syncObject);
}

Maybe you should try to add the glMemoryBarrier (i assumed this alone should block until shader is finished, but seems i was wrong)

And in my paranoia i've made all my data coherent in shader:

layout(std430, binding = 0) coherent buffer

Just try, i don't know what i'm talking about :)

My confusion raised too high the last time and i went to OpenCL... seems much easier to learn GPGPU.

No, it doesn't prove anything - for starters unless the wrapper you are using is broken it won't magically do things without you telling it to, that is not how computers work and if you go in thinking 'magic is happening' then you stand no hope.

You are getting the output you told the computer to do.

As to what the problem could be; do you check the returned value from 'glClientWaitSync' to make sure the operation DID complete as expected?

I'm not calling magic anywhere. I simply have no idea how this could happen, and I know that it's most likely a problem with my code.

- I have added calls to glMemoryBarrier(GL_ALL_BARRIER_BITS); after uploading data to the persistent buffer and after dispatching the compute shader.

- I have made all variables used coherent in the compute shader.

- I have updated the code to check the return value of glClientWaitSync(), and it never fails.

- I have modified my compute shader to the following:


#version 430

layout(std430, binding = 0) buffer Data{

	coherent int data[];
	
} dataBuffer;

layout (local_size_x = 16, local_size_y = 1, local_size_z = 1) in;

void main(){

	int offset = int(gl_GlobalInvocationID.x);
	
	int v = dataBuffer.data[offset];
	
	if(v == 0){
		//Okay, first invocation. Add 1 to the current value.
		v += 1;
	}else if(v == 1){
		//Wut, second invocation?! Set it to a weird value!
		v = -1000;
	}
	
	
	
	dataBuffer.data[offset] = v;
	
}

With glClientWaitSync(), I mostly get this output (parenthesis shows expected value, and wait result printed):


Before: 0
Wait successful!
Result:
-1000 (1)
-1000 (1)
-1000 (1)
-1000 (1)
-1000 (1)
-1000 (1)
-1000 (1)
-1000 (1)
-1000 (1)
-1000 (1)
-1000 (1)
-1000 (1)
-1000 (1)
-1000 (1)
-1000 (1)
-1000 (1)
After: -16000

Very rarely (1 / 1000 runs or so) I get the expected value:


Before: 0
Wait successful!
Result:
1 (1)
1 (1)
1 (1)
1 (1)
1 (1)
1 (1)
1 (1)
1 (1)
1 (1)
1 (1)
1 (1)
1 (1)
1 (1)
1 (1)
1 (1)
1 (1)
After: 16

Now, WITHOUT glClientWaitSync() I get really interesting results (with 2 comments added by me):


Before: 0
Result:
0 (1)
0 (1)
0 (1)
0 (1)
0 (1)
0 (1)
0 (1)
0 (1)
0 (1)
0 (1)
//----------- first invocation completes
1 (1)
1 (1)
1 (1)
//----------- magic second invocation completes
-1000 (1)
-1000 (1)
-1000 (1)
After: -2997

How does this NOT prove that the shader is executed twice? It doesn't say anything about WHY it's executed twice, but it is.

Changing the compute shader to this:


#version 430

layout(std430, binding = 0) buffer Data{

	coherent int data[];
	
} dataBuffer;

layout (local_size_x = 16, local_size_y = 1, local_size_z = 1) in;

coherent shared int invocation = 0;

void main(){

	int offset = int(gl_GlobalInvocationID.x);
	
	int v = dataBuffer.data[offset];
	dataBuffer.data[offset] = v += atomicAdd(invocation, 1);
	
}

Should fill the buffer with values going from 0 to 15 (technically, they could very well be in random order, but they never seem to be). The result without syncing:


Before: 0
Result:
0 (0)
0 (1)
0 (2)
0 (3)
0 (4)
0 (5)
0 (6)
0 (7)
0 (8)
0 (9)
0 (10)
0 (11)
//-----------First invocation completes
12 (12)
13 (13)
//-----------Magic second invocation completes
28 (14)
30 (15)
After: 83

I have also tried outputting the values of gl_NumWorkGroups, gl_WorkGroupID, gl_LocalInvocationID and the rest, and those values are always correct, implying that the work group size is correct, but that there's an extra work group being executed.

I tried switching to a compute shader which simply adds 1 to the value and dispatching that shader in a loop:


		for(int i = 0; i < 10; i++){
			glDispatchCompute(1, 1, 1);
			glMemoryBarrier(GL_ALL_BARRIER_BITS);
		}

This produces a buffer filled with a single random value between 10 and 20.

Okay, here's my favorite so far. I reduced the work group size to 8x1x1 and I dispatch the compute shader once with a 16-int buffer.


#version 430

layout(std430, binding = 0) buffer Data{

	coherent int data[];
	
} dataBuffer;

layout (local_size_x = 8, local_size_y = 1, local_size_z = 1) in;

void main(){

	int offset = int(gl_GlobalInvocationID.x);
	
	int v = dataBuffer.data[offset];
	
	if(v == 0){
		//Okay, first invocation. Add 1 to the current value.
		v = 1;
	}else{
		//Wut, second invocation?! Set write to a new place!
		offset += 8;
		v = 2;
	}
	
	dataBuffer.data[offset] = v;
	
}

Output with syncing:


Before: 0
Wait successful!
Result:
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
After: 24

and very I get the correct result again:


Before: 0
Wait successful!
Result:
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
After: 8

Some things i'd try:

1. Remove GL_MAP_COHERENT_BIT? & GL_MAP_PERSISTENT_BIT,

and instead map the buffer (ensuring the shader is not still running, with both glSync and MemoryBarrier)

I read with code like this:

glBindBuffer (GL_SHADER_STORAGE_BUFFER, gpuData.ssbDbgOut);
int* result = (int*) glMapBuffer (GL_SHADER_STORAGE_BUFFER, GL_READ_ONLY);
int count = 128;
for (int i=0; i<count; i++)
{
int value = result;
base_debug::logF->Print ("dbg: ", float(value));
}
glUnmapBuffer (GL_SHADER_STORAGE_BUFFER);

2. Instead of running each frame or inside a loop without sync, run it really only once.

(To eliminate the doupts 100% that the shader is called again while still running)

If that works as expected, add the syncing and a second call...

3. Do syncing also inside the shader:

void main(){

int offset = int(gl_GlobalInvocationID.x);

barrier(); // block all threads until they're all there
memoryBarrier(); // block until all data has been written... both may be necessary in some casese but not in this example

int v = dataBuffer.data[offset];

if(v == 0){
//Okay, first invocation. Add 1 to the current value.
v = 1;
}else{
//Wut, second invocation?! Set write to a new place!
offset += 8;
v = 2;
}

dataBuffer
.data[offset] = v;

barrier(); // yay, it's paranoia
memoryBarrier(); // too


}

Thanks for your answer, JoeJ.

I have tried adding barrier() + memoryBarrier() in my shader, but there's no difference. I also disabled the loop so that the shader is executed exactly once. The result is unchanged. This is the complete output of the program:


[LWJGL] ARB_debug_output message
	ID: 131185
	Source: API
	Type: OTHER
	Severity: Unknown (0x826B)
	Message: Buffer detailed info: Buffer object 1 (bound to GL_SHADER_STORAGE_BUFFER, usage hint is GL_DYNAMIC_DRAW) will use DMA CACHED memory as the source for buffer object operations.
	Stack trace:
java.lang.Exception: Stack trace
	at java.lang.Thread.dumpStack(Thread.java:1329)
	at drone.test.LWJGLTest$1.handleMessage(LWJGLTest.java:116)
	at org.lwjgl.opengl.GL30.nglMapBufferRange(Native Method)
	at org.lwjgl.opengl.GL30.glMapBufferRange(GL30.java:1001)
	at drone.test.GPUSyncTest.<init>(GPUSyncTest.java:36)
	at drone.test.GPUSyncTest.main(GPUSyncTest.java:105)
[LWJGL] ARB_debug_output message
	ID: 131185
	Source: API
	Type: OTHER
	Severity: Unknown (0x826B)
	Message: Buffer detailed info: Buffer object 1 (bound to GL_SHADER_STORAGE_BUFFER, usage hint is GL_DYNAMIC_DRAW) has been mapped in DMA CACHED memory.
	Stack trace:
java.lang.Exception: Stack trace
	at java.lang.Thread.dumpStack(Thread.java:1329)
	at drone.test.LWJGLTest$1.handleMessage(LWJGLTest.java:116)
	at org.lwjgl.opengl.GL30.nglMapBufferRange(Native Method)
	at org.lwjgl.opengl.GL30.glMapBufferRange(GL30.java:1001)
	at drone.test.GPUSyncTest.<init>(GPUSyncTest.java:36)
	at drone.test.GPUSyncTest.main(GPUSyncTest.java:105)
Before: 0
Wait successful!
Result:
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
After: 24

Exiting.

Note that the buffer is not actually mapped twice. The driver simply prints that message twice (confirmed with debugging; stepping over glMapBufferRange() immediately produced the message twice).

I'm gonna try running this on an AMD GPU and an Intel GPU (wait, does Intel support compute shaders? >___>) and see if I can reproduce it there.

How does this NOT prove that the shader is executed twice? It doesn't say anything about WHY it's executed twice, but it is.


No; all you have proven is that observed results != expected results.
The API will not do things you do not tell it.
The shader runs as many times as you tell it.

The error is likely in the buffer management code.

This topic is closed to new replies.

Advertisement