• Advertisement
Sign in to follow this  

Compute shader runs more than once?!

This topic is 1247 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hello.

 

I'm having problems with a compute shader. It seems like the compute shader is randomly run more than once, which screws up my test shader. Note that I'm using Java, so the syntax of some commands (glMapBufferRange() for example) are slightly different.

 
 
 
 

 

I have a persistently mapped coherent buffer which I use for uploads and downloads:

		buffer = glGenBuffers();
		glBindBuffer(GL_SHADER_STORAGE_BUFFER, buffer);
		glBufferStorage(GL_SHADER_STORAGE_BUFFER, BUFFER_LENGTH * ATTRIBUTE_SIZE, GL_MAP_WRITE_BIT | GL_MAP_READ_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT);
		mappedBuffer = glMapBufferRange(GL_SHADER_STORAGE_BUFFER, 0, BUFFER_LENGTH * ATTRIBUTE_SIZE, GL_MAP_WRITE_BIT | GL_MAP_READ_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT, null);

The buffer length is 16 and attribute size is 4, to fit 16 integers.

 

 

Each frame, the buffer is initialized to all 0:

		//Reset persistent buffer to 0
		int total = 0;
		for(int i = 0; i < BUFFER_LENGTH; i++){
			int v = 0;
			mappedBuffer.putInt(v);
			total += v;
		}
		System.out.println("Before: " + total); //prints 0
		mappedBuffer.clear(); //Resets the Java ByteBuffer wrapper around the pointer

I then run my compute shader:

		//Add 1 to first 8 values in buffer.
		computeShader.bind();
		glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, buffer);
		glDispatchCompute(1, 1, 1);

I wait for the GPU to finish running the shader.

		//Wait for the GPU to finish running the compute shader
		GLSync syncObject = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
		glClientWaitSync(syncObject, GL_SYNC_FLUSH_COMMANDS_BIT, 1000*1000*1000);
		
		glFinish(); //Should not be needed, but there just in case for now.

And finally I read back the data:

		//Read back result from persistent buffer
		System.out.println("Result:");
		total = 0;
		for(int i = 0; i < BUFFER_LENGTH; i++){
			int v = mappedBuffer.getInt();
			total += v;
			System.out.println(v); //Print value
		}
		System.out.println("After: " + total);
		mappedBuffer.clear(); //Reset Java wrapper around pointer

And here's my compute shader:

#version 430

layout (binding = 0, rgba16f) uniform image2D img;

layout(std430, binding = 0) buffer Data{

	int data[];
	
} dataBuffer;

layout (local_size_x = 16, local_size_y = 1, local_size_z = 1) in;

void main(){

	int offset = int(gl_GlobalInvocationID.x);
	//int offset = int(gl_WorkGroupSize.x * gl_WorkGroupID.x + gl_LocalInvocationID.x);
	
	if(offset < 8){
		//dataBuffer.data[offset]++;
		atomicAdd(dataBuffer.data[offset], 1);
	}
}

 
 
Summary:
 - I have a persistently mapped coherent buffer which I try to update using a compute shader.
 - I initialize this 16-int buffer to all zeroes.
 - I call the compute shader with 1x1x1 work groups = 1 workgroup, and each workgroup has a work group size of 16x1x1, e.g. a single line of 16 invocations.
 - The shader increments the first 8 elements of the buffer by 1.
 - I correctly wait for the results and everything, but the compute shader randomly seems to be running twice.
 - I read back and print the result from the buffer.

 

 

 

The result is a buffer which 99% of the time contains the value 2 instead of 1!!!

 

 

Before: 0

Result:
2
2
2
2
2
2
2
2
0
0
0
0
0
0
0
0
After: 16
 
Before: 0
Result:
2
2
2
2
2
2
2
2
0
0
0
0
0
0
0
0
After: 16
 
Before: 0
Result:
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
After: 8
 
Before: 0
Result:
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
After: 8

 

This randomly occurs regardless of if the shader uses an atomicAdd() or not. It seems like the compute shader is actually run twice for each element in the buffer instead of once, but I see no possible way of how this could happen. What is going on?!

Edited by theagentd

Share this post


Link to post
Share on other sites
Advertisement

This is pretty cool. If I remove the sync with the wait for the GPU to finish computing I get output similar to this:

 

Before: 0

Result:
0
0
0
0
0
0
0
0
0
0
0
1
1
2
2
2
After: 8

This almost certainly proves that it runs the compute shader twice!!!

Share this post


Link to post
Share on other sites
No, it doesn't prove anything - for starters unless the wrapper you are using is broken it won't magically do things without you telling it to, that is not how computers work and if you go in thinking 'magic is happening' then you stand no hope.

You are getting the output you told the computer to do.

As to what the problem could be; do you check the returned value from 'glClientWaitSync' to make sure the operation DID complete as expected?

Share this post


Link to post
Share on other sites

At least this problem helped me to solve mine :)

 

I've not read much yet about sync objects, but adding it to my code fixed my synchrinisation issues:

 

void GPUwait ()
{
    GLsync syncObject = glFenceSync (GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
    GLenum ret = glClientWaitSync(syncObject, GL_SYNC_FLUSH_COMMANDS_BIT, 1000*1000*1000);
    if (ret == GL_WAIT_FAILED || ret == GL_TIMEOUT_EXPIRED)
        SystemTools::Log ("glClientWaitSync failed./n");
    glMemoryBarrier (GL_ALL_BARRIER_BITS);
    glDeleteSync (syncObject);
}

 

Maybe you should try to add the glMemoryBarrier (i assumed this alone should block until shader is finished, but seems i was wrong)

And in my paranoia i've made all my data coherent in shader:

 

layout(std430, binding = 0) coherent buffer

 

Just try, i don't know what i'm talking about :)

My confusion raised too high the last time and i went to OpenCL... seems much easier to learn GPGPU.

Share this post


Link to post
Share on other sites

No, it doesn't prove anything - for starters unless the wrapper you are using is broken it won't magically do things without you telling it to, that is not how computers work and if you go in thinking 'magic is happening' then you stand no hope.

You are getting the output you told the computer to do.

As to what the problem could be; do you check the returned value from 'glClientWaitSync' to make sure the operation DID complete as expected?

I'm not calling magic anywhere. I simply have no idea how this could happen, and I know that it's most likely a problem with my code.

 

 - I have added calls to glMemoryBarrier(GL_ALL_BARRIER_BITS); after uploading data to the persistent buffer and after dispatching the compute shader.

 - I have made all variables used coherent in the compute shader.

 - I have updated the code to check the return value of glClientWaitSync(), and it never fails.

 - I have modified my compute shader to the following:

#version 430

layout(std430, binding = 0) buffer Data{

	coherent int data[];
	
} dataBuffer;

layout (local_size_x = 16, local_size_y = 1, local_size_z = 1) in;

void main(){

	int offset = int(gl_GlobalInvocationID.x);
	
	int v = dataBuffer.data[offset];
	
	if(v == 0){
		//Okay, first invocation. Add 1 to the current value.
		v += 1;
	}else if(v == 1){
		//Wut, second invocation?! Set it to a weird value!
		v = -1000;
	}
	
	
	
	dataBuffer.data[offset] = v;
	
}

With glClientWaitSync(), I mostly get this output (parenthesis shows expected value, and wait result printed):

Before: 0
Wait successful!
Result:
-1000 (1)
-1000 (1)
-1000 (1)
-1000 (1)
-1000 (1)
-1000 (1)
-1000 (1)
-1000 (1)
-1000 (1)
-1000 (1)
-1000 (1)
-1000 (1)
-1000 (1)
-1000 (1)
-1000 (1)
-1000 (1)
After: -16000

Very rarely (1 / 1000 runs or so) I get the expected value:

Before: 0
Wait successful!
Result:
1 (1)
1 (1)
1 (1)
1 (1)
1 (1)
1 (1)
1 (1)
1 (1)
1 (1)
1 (1)
1 (1)
1 (1)
1 (1)
1 (1)
1 (1)
1 (1)
After: 16

Now, WITHOUT glClientWaitSync() I get really interesting results (with 2 comments added by me):

Before: 0
Result:
0 (1)
0 (1)
0 (1)
0 (1)
0 (1)
0 (1)
0 (1)
0 (1)
0 (1)
0 (1)
//----------- first invocation completes
1 (1)
1 (1)
1 (1)
//----------- magic second invocation completes
-1000 (1)
-1000 (1)
-1000 (1)
After: -2997

How does this NOT prove that the shader is executed twice? It doesn't say anything about WHY it's executed twice, but it is.

Edited by theagentd

Share this post


Link to post
Share on other sites

Changing the compute shader to this:

#version 430

layout(std430, binding = 0) buffer Data{

	coherent int data[];
	
} dataBuffer;

layout (local_size_x = 16, local_size_y = 1, local_size_z = 1) in;

coherent shared int invocation = 0;

void main(){

	int offset = int(gl_GlobalInvocationID.x);
	
	int v = dataBuffer.data[offset];
	dataBuffer.data[offset] = v += atomicAdd(invocation, 1);
	
}

Should fill the buffer with values going from 0 to 15 (technically, they could very well be in random order, but they never seem to be). The result without syncing:

Before: 0
Result:
0 (0)
0 (1)
0 (2)
0 (3)
0 (4)
0 (5)
0 (6)
0 (7)
0 (8)
0 (9)
0 (10)
0 (11)
//-----------First invocation completes
12 (12)
13 (13)
//-----------Magic second invocation completes
28 (14)
30 (15)
After: 83

I have also tried outputting the values of gl_NumWorkGroups, gl_WorkGroupID, gl_LocalInvocationID and the rest, and those values are always correct, implying that the work group size is correct, but that there's an extra work group being executed.

 

I tried switching to a compute shader which simply adds 1 to the value and dispatching that shader in a loop:

		for(int i = 0; i < 10; i++){
			glDispatchCompute(1, 1, 1);
			glMemoryBarrier(GL_ALL_BARRIER_BITS);
		}

This produces a buffer filled with a single random value between 10 and 20.

Share this post


Link to post
Share on other sites

Okay, here's my favorite so far. I reduced the work group size to 8x1x1 and I dispatch the compute shader once with a 16-int buffer.

#version 430

layout(std430, binding = 0) buffer Data{

	coherent int data[];
	
} dataBuffer;

layout (local_size_x = 8, local_size_y = 1, local_size_z = 1) in;

void main(){

	int offset = int(gl_GlobalInvocationID.x);
	
	int v = dataBuffer.data[offset];
	
	if(v == 0){
		//Okay, first invocation. Add 1 to the current value.
		v = 1;
	}else{
		//Wut, second invocation?! Set write to a new place!
		offset += 8;
		v = 2;
	}
	
	dataBuffer.data[offset] = v;
	
}

Output with syncing:

Before: 0
Wait successful!
Result:
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
After: 24

and very I get the correct result again:

Before: 0
Wait successful!
Result:
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
After: 8

Share this post


Link to post
Share on other sites

Some things i'd try:

 

1. Remove GL_MAP_COHERENT_BIT? & GL_MAP_PERSISTENT_BIT,

and instead map the buffer (ensuring the shader is not still running, with both glSync and MemoryBarrier)

I read with code like this:

 

glBindBuffer (GL_SHADER_STORAGE_BUFFER, gpuData.ssbDbgOut);
        int* result = (int*) glMapBuffer (GL_SHADER_STORAGE_BUFFER, GL_READ_ONLY);    
        int count = 128;
        for (int i=0; i<count; i++)
        {
            int value = result;
            base_debug::logF->Print ("dbg: ", float(value));
        }
        glUnmapBuffer (GL_SHADER_STORAGE_BUFFER);

 

2. Instead of running each frame or inside a loop without sync, run it really only once.

(To eliminate the doupts 100% that the shader is called again while still running)

If that works as expected, add the syncing and a second call...

 

3. Do syncing also inside the shader:

 

 

void main(){

    int offset = int(gl_GlobalInvocationID.x);
    

barrier(); // block all threads until they're all there
memoryBarrier(); // block until all data has been written... both may be necessary in some casese but not in this example

 

 

    int v = dataBuffer.data[offset];
    
    
if(v == 0){
        //Okay, first invocation. Add 1 to the current value.
        v = 1;
    }else{
        //Wut, second invocation?! Set write to a new place!
        offset += 8;
        v = 2;
    }
    
    dataBuffer
.data[offset] = v;

 

barrier(); // yay, it's paranoia
memoryBarrier(); // too

 

    
}

Share this post


Link to post
Share on other sites

Thanks for your answer, JoeJ.

 

I have tried adding barrier() + memoryBarrier() in my shader, but there's no difference. I also disabled the loop so that the shader is executed exactly once. The result is unchanged. This is the complete output of the program:

[LWJGL] ARB_debug_output message
	ID: 131185
	Source: API
	Type: OTHER
	Severity: Unknown (0x826B)
	Message: Buffer detailed info: Buffer object 1 (bound to GL_SHADER_STORAGE_BUFFER, usage hint is GL_DYNAMIC_DRAW) will use DMA CACHED memory as the source for buffer object operations.
	Stack trace:
java.lang.Exception: Stack trace
	at java.lang.Thread.dumpStack(Thread.java:1329)
	at drone.test.LWJGLTest$1.handleMessage(LWJGLTest.java:116)
	at org.lwjgl.opengl.GL30.nglMapBufferRange(Native Method)
	at org.lwjgl.opengl.GL30.glMapBufferRange(GL30.java:1001)
	at drone.test.GPUSyncTest.<init>(GPUSyncTest.java:36)
	at drone.test.GPUSyncTest.main(GPUSyncTest.java:105)
[LWJGL] ARB_debug_output message
	ID: 131185
	Source: API
	Type: OTHER
	Severity: Unknown (0x826B)
	Message: Buffer detailed info: Buffer object 1 (bound to GL_SHADER_STORAGE_BUFFER, usage hint is GL_DYNAMIC_DRAW) has been mapped in DMA CACHED memory.
	Stack trace:
java.lang.Exception: Stack trace
	at java.lang.Thread.dumpStack(Thread.java:1329)
	at drone.test.LWJGLTest$1.handleMessage(LWJGLTest.java:116)
	at org.lwjgl.opengl.GL30.nglMapBufferRange(Native Method)
	at org.lwjgl.opengl.GL30.glMapBufferRange(GL30.java:1001)
	at drone.test.GPUSyncTest.<init>(GPUSyncTest.java:36)
	at drone.test.GPUSyncTest.main(GPUSyncTest.java:105)
Before: 0
Wait successful!
Result:
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
After: 24

Exiting.

Note that the buffer is not actually mapped twice. The driver simply prints that message twice (confirmed with debugging; stepping over glMapBufferRange() immediately produced the message twice).

 

I'm gonna try running this on an AMD GPU and an Intel GPU (wait, does Intel support compute shaders? >___>) and see if I can reproduce it there.

Share this post


Link to post
Share on other sites

How does this NOT prove that the shader is executed twice? It doesn't say anything about WHY it's executed twice, but it is.


No; all you have proven is that observed results != expected results.
The API will not do things you do not tell it.
The shader runs as many times as you tell it.

The error is likely in the buffer management code.

Share this post


Link to post
Share on other sites

Can you post the full source code, or at least a minimal working program that still reproduces the problem? It's hard to see from the snippets you've provided when they're called, and how they're used.

 

Also, have you tried using a non-persistently mapped buffer as JoeJ suggested? i.e. glBufferData to allocate the memory, and then glMapBuffer or glMapBufferRange and glUnmapBuffer only when you read the data back.

Edited by Xycaleth

Share this post


Link to post
Share on other sites

Ran the exact same code unchanged on an AMD HD 7790 card. I always get the correct result.

Before: 0
Wait successful!
Result:
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
After: 8

Exiting.

This is an Nvidia driver bug. It has to be. I'm getting wrong, undeterministic behaviour when I'm following the spec exactly.

 

This is what the spec says:

 - No glMemoryBarrier()s should be required since I do not have multiple compute shaders running after each other, which those barriers handle.

 - No barriers should be required in the shader since those barriers only work inside the same work group.

 - No barriers should be needed after the upload to the persistent coherent buffer, as the changes are always visible to the GPU after the CPU has finished writing to it.

 - Only a glClientWaitSync() should be required since after that the data should be visible in the persistent buffer after the wait is complete.

 

Thank you everyone for helping me find the error, especially JoeJ who came with a lot of good ideas, which at least made me look up the spec and see if I was doing everything correctly.

 

 

How does this NOT prove that the shader is executed twice? It doesn't say anything about WHY it's executed twice, but it is.


No; all you have proven is that observed results != expected results.
The API will not do things you do not tell it.
The shader runs as many times as you tell it.

The error is likely in the buffer management code.

All that I've proven is that the observed results matches what I'd expect if I ran the shader twice. It's a symptom, not a diagnosis. Stop nitpicking words. If you don't want to help then you have no reason to post, especially if your only reason of posting is some imaginary tone you think my post have.

 

EDIT: I'll be making a single-file standalone test tomorrow, since I need to focus on studying for a test right now. I code in Java using LWJGL though, so if possible it'd be nice if someone could port it to some more mainstream programming language (e.g. C/C++). It'd most likely be a 5-minute job.

Edited by theagentd

Share this post


Link to post
Share on other sites
OK, lets return to your first post;
 

It seems like the compute shader is randomly run more than once, which screws up my test shader.


The 'magic' implication in my first reply was down to the 'it is randomly doing this', implying it is not doing it with any intervention or indeed it is not doing what you told it thus 'magically running more than once'.

As to your second point, no, you did not and still have not proven that the shader is running more times than you have told it to, what you have proven is that there is an unexpected issue updating the memory you expect to hold the result such that between issuing two runs of the compute shader the buffer is written two twice before your code reads it back once. Ergo the compute shader is running as many times as you tell it but you are not seeing the client side update correctly; buffer management problem.

The big warning for me is bit which says 'mapped to DMA cached memory' which implies that someone is having to pull and push the data and that things aren't being written quite as expected; could this be a driver bug? Maybe.. OpenGL can be a bit woolly with requirements and unfortunately running it on another driver doesn't always prove there is a driver bug.

My point in all this, however, was that you have gone in thinking 'the shader is being run twice randomly' and thus your results match your preconception of the problem.

My point was that the driver isn't doing it more than once, but there is something else going on with the persistently mapped memory which could be down to your wrapper, to the flags, to the driver trying to be clever, sync problems or indeed a bug with the buffer management in the driver.

Not seeing a complete run of results makes it difficult to diagnose completely however your second post heavily implies that 'something' is going on with the memory buffer as removing the sync objects is showing data appearing further down the buffer than expected. Edited by phantom

Share this post


Link to post
Share on other sites
I copied the code to my project, but i had to change buffer setup, seems my glew is outdated.
I get correct results on gtx480, but i got a shader compiler error on your shader (messages about missing uvec4 functionality without any sense)
 
To fix that i had to move the local sizing on top:
 
 
[source]
#version 430

layout (local_size_x = 16, local_size_y = 1, local_size_z = 1) in; // <-- changed this

layout(std430, binding = 15) buffer Data
{
    int data[];    
} dataBuffer;

void main()
{
    int offset = int(gl_GlobalInvocationID.x);
    //int offset = int(gl_WorkGroupSize.x * gl_WorkGroupID.x + gl_LocalInvocationID.x);
    
    if(offset < 8)
    {
        //dataBuffer.data[offset]++;
        atomicAdd(dataBuffer.data[offset], 1);
    }
}
[/source] 
 
 
The rest of my code:
 
 
[source] 
int BUFFER_LENGTH = 16;
        int size = sizeof(int) * BUFFER_LENGTH;
        int* data = (int*)_aligned_malloc (size, 16);    
        for (int i=0; i<BUFFER_LENGTH; i++) data = 0;

        GLuint buf;
        glGenBuffers (1, &buf);    
        glBindBuffer (GL_SHADER_STORAGE_BUFFER, buf);
        glBufferData (GL_SHADER_STORAGE_BUFFER, size, data, GL_DYNAMIC_COPY);
        glBindBufferBase (GL_SHADER_STORAGE_BUFFER, 15, buf);

        // verify
        int* mappedBuf = (int*) glMapBuffer (GL_SHADER_STORAGE_BUFFER, GL_READ_ONLY);
        for (int i=0; i<BUFFER_LENGTH; i++) SystemTools::Log ("before: %i\n", mappedBuf);
        glUnmapBuffer (GL_SHADER_STORAGE_BUFFER);

        // run
        glUseProgram (gpuData.computeProgramHandleTestATI);
        glDispatchCompute (1, 1, 1);

        // block
        GLsync syncObject = glFenceSync (GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
        GLenum ret = glClientWaitSync(syncObject, GL_SYNC_FLUSH_COMMANDS_BIT, 1000*1000*1000);
        if (ret == GL_WAIT_FAILED || ret == GL_TIMEOUT_EXPIRED)
            SystemTools::Log ("glClientWaitSync failed./n");
    
        // output
        mappedBuf = (int*) glMapBuffer (GL_SHADER_STORAGE_BUFFER, GL_READ_ONLY);
        for (int i=0; i<BUFFER_LENGTH; i++) SystemTools::Log ("after: %i\n", mappedBuf);
        glUnmapBuffer (GL_SHADER_STORAGE_BUFFER);
[/source] 
 
 
 
And here the correct output:
 
before: 0
before: 0
before: 0
before: 0
before: 0
before: 0
before: 0
before: 0
before: 0
before: 0
before: 0
before: 0
before: 0
before: 0
before: 0
before: 0
after: 1
after: 1
after: 1
after: 1
after: 1
after: 1
after: 1
after: 1
after: 0
after: 0
after: 0
after: 0
after: 0
after: 0
after: 0
after: 0
 
 
 
I'll update to newest driver and post again if there is a bug after that.
You do check for compiler errors?
And how do you guys format code in this forum? I can't find that functionality. Edited by phantom

Share this post


Link to post
Share on other sites

And how do you guys format code in this forum? I can't find that functionality.


If are in BBCode mode then [ source ][ /source ] or [ code ][ / code ] (without spaces) gives you formatted blocks; the former scrolls the latter doesn't.
(If you hit 'edit' on your post you'll see the tags I added) Edited by phantom

Share this post


Link to post
Share on other sites

Here's a test program. Unrar and run the bat-file. It may complain that Java is missing. If you're sure you have Java installed, you can manually enter the full path to java.exe (Inside your Java installation folder \bin) into the .bat file.

https://drive.google.com/file/d/0B0dJlB1tP0QZVUdBZFNZZnZKQlU/edit?usp=sharing

 

Complete source code:


import static org.lwjgl.opengl.GL11.*;
import static org.lwjgl.opengl.GL15.*;
import static org.lwjgl.opengl.GL20.*;
import static org.lwjgl.opengl.GL30.*;
import static org.lwjgl.opengl.GL32.*;
import static org.lwjgl.opengl.GL43.*;
import static org.lwjgl.opengl.GL44.*;

import java.nio.ByteBuffer;

import org.lwjgl.opengl.Display;
import org.lwjgl.opengl.DisplayMode;
import org.lwjgl.opengl.GLSync;

public class ComputeShaderBugReproducer {

	private static String SHADER_SOURCE = 
			"#version 430\n"
					+ "\n"
					+ "layout(std430, binding = 0) buffer Data{\n"
					+ "    coherent int data[];\n"
					+ "} dataBuffer;\n"
					+ "\n"
					+ "layout (local_size_x = 1, local_size_y = 1, local_size_z = 1) in;\n"
					+ "\n"
					+ "void main(){\n"
					+ "    int offset = int(gl_GlobalInvocationID.x);\n"
					+ "    dataBuffer.data[offset]++;\n"
					+ "}\n"
					;

	public static void main(String[] args) {


		//Set up a window + an OpenGL context.
		try {
			Display.setDisplayMode(new DisplayMode(640, 480));
			Display.create();
		} catch (Exception e) {
			e.printStackTrace();
			System.exit(0);
		}

		int buffer = glGenBuffers(); //Different in Java.

		glBindBuffer(GL_SHADER_STORAGE_BUFFER, buffer);
		glBufferStorage(GL_SHADER_STORAGE_BUFFER, 4, GL_MAP_WRITE_BIT | GL_MAP_READ_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT); //Create 4-byte buffer.

		//Map 4-byte buffer. Different in Java. The returned pointer is wrapped in a ByteBuffer.
		ByteBuffer mappedBuffer = glMapBufferRange(GL_SHADER_STORAGE_BUFFER, 0, 4, GL_MAP_WRITE_BIT | GL_MAP_READ_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT, null);


		int computeShaderProgram = glCreateProgram();

		int computeShader = glCreateShader(GL_COMPUTE_SHADER);
		glShaderSource(computeShader, SHADER_SOURCE);
		glCompileShader(computeShader);
		glAttachShader(computeShaderProgram, computeShader);
		glLinkProgram(computeShaderProgram);


		//Statistics
		int correctResult = 0;
		int wrongResult = 0;

		//Loop until window is closed.
		while(!Display.isCloseRequested()){

			//Reset buffer value to 0
			mappedBuffer.putInt(0, 0); //write the int 0 to index 0

			//Execute compute shader
			glUseProgram(computeShaderProgram);
			glBindBufferRange(GL_SHADER_STORAGE_BUFFER, 0, buffer, 0, 4);
			glDispatchCompute(1, 1, 1);

			//Wait for the compute shader to finish. Warn if an error occurs.
			GLSync syncObject = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
			int waitResult = glClientWaitSync(syncObject, GL_SYNC_FLUSH_COMMANDS_BIT, 1000*1000*1000);
			if(waitResult == GL_WAIT_FAILED || waitResult == GL_TIMEOUT_EXPIRED){
				System.out.println("WAIT FAILED!!!");
			}
			glDeleteSync(syncObject);

			//Read back result from the buffer.
			int result = mappedBuffer.getInt(0); //Get int from index 0

			//Update statistics
			if(result == 1){
				correctResult++;
			}else{
				wrongResult++;
			}

			//Print out the statistics every 1 000 run
			if((correctResult + wrongResult) % 1000 == 0){

				System.out.println("Correct result: " + correctResult);
				System.out.println("Wrong result:   " + wrongResult);
				System.out.println("Correct percentage: " + 100 * (double)correctResult/(correctResult + wrongResult) + "%");
				System.out.println(); //new line

				glClear(GL_COLOR_BUFFER_BIT);
				Display.update();
			}
		}
	}
}

More readable compute shader:

#version 430

layout(std430, binding = 0) buffer Data{
    coherent int data[];
} dataBuffer;

layout (local_size_x = 1, local_size_y = 1, local_size_z = 1) in;

void main(){
    int offset = int(gl_GlobalInvocationID.x);
    dataBuffer.data[offset]++;
}

The program prints out how many times the expected result (1) is calculated and how many times the wrong result is calculated (the only observed wrong result has been 2 instead of 1). I also fixed a memory leak where I weren't correctly deleting the GLSync objects I was creating.

 

 

Test results on all computers I've tested this on:

AMD HD7790
Catalyst 14.7 (OpenGL version 6.14.10.12967)

Correct result: 1007000
Wrong result:   0
Correct percentage: 100.0%




Nvidia Geforce GTX 770 4GB
Driver version 344.11 (latest)

Correct result: 172160
Wrong result:   846840
Correct percentage: 16.894995093228655%
(Percentage varies between 0% and around 30%.)




Nvidia Geforce GTX 460M 1536MB (Laptop GPU)
Driver version 335.23

Correct result: 1000000
Wrong result:   0
Correct percentage: 100.0%




Nvidia Geforce GTX 460M 1536MB (Laptop GPU)
Driver version 340.52

Correct result: 89000
Wrong result:   0
Correct percentage: 100.0%

E.g. the problem only occurs on Kepler GPUs so far. Unsure if the same problem happens on older drivers or if it happens on the new Maxwell GPUs.

Share this post


Link to post
Share on other sites
As to your second point, no, you did not and still have not proven that the shader is running more times than you have told it to, what you have proven is that there is an unexpected issue updating the memory you expect to hold the result such that between issuing two runs of the compute shader the buffer is written two twice before your code reads it back once. Ergo the compute shader is running as many times as you tell it but you are not seeing the client side update correctly; buffer management problem.

The big warning for me is bit which says 'mapped to DMA cached memory' which implies that someone is having to pull and push the data and that things aren't being written quite as expected; could this be a driver bug? Maybe.. OpenGL can be a bit woolly with requirements and unfortunately running it on another driver doesn't always prove there is a driver bug.

My point in all this, however, was that you have gone in thinking 'the shader is being run twice randomly' and thus your results match your preconception of the problem.

My point was that the driver isn't doing it more than once, but there is something else going on with the persistently mapped memory which could be down to your wrapper, to the flags, to the driver trying to be clever, sync problems or indeed a bug with the buffer management in the driver.

Not seeing a complete run of results makes it difficult to diagnose completely however your second post heavily implies that 'something' is going on with the memory buffer as removing the sync objects is showing data appearing further down the buffer than expected.
 

By "random", I did not mean that the driver did this "on a whim". I meant that it calculates an incorrect result "on some runs".

 

I have proven that it runs twice. In one of the tests I call glDispatchCompute() exactly once and then exit the program after a single run of the program, and the result is still 2 most of the time. I can prove it even further by having the shader do the following:

 

if value == 0 ---> set value to 1

else if value == 1 ---> set value to -1000

 

E.g. there is no way this can be a problem with the buffer mapping or with synchronization, because with only a single call to glDispatchCompute with a single workgroup of size 1x1x1 it should be impossible for the compute shader to set the value to -1000, but that's what I'm getting. it's not a preconception. It's an observation, and I can prove that it behaves that way.

 

Here's some reading to counter your "big warning": http://en.wikipedia.org/wiki/Direct_memory_access. The driver uses the DMA engines on your GPU to keep a persistenly mapped buffer coherent with the data on the GPU. And here's a quote from the spec of ARB_buffer_storage:

 

 

- If MAP_COHERENT_BIT is set and the server does a write, the app must
call FenceSync with SYNC_GPU_COMMANDS_COMPLETE (or Finish). Then the
CPU will see the writes after the sync is complete.

I am not mismanaging my buffers, and there's no sane way that this is a bug in the buffer management of the driver, as I can control exactly what is written by the second pass by checking if something has already been written (see the above pseudocode for getting it to output -1000).

Share this post


Link to post
Share on other sites

I see you do not check if the shader compilation was successful.

I believe the compiler produces a garbage shader, because i've got a garbage output message.

What happens if you move the local_size line, as i suggested above?

Share this post


Link to post
Share on other sites

Sigh. I found the problem. It's Nvidia Surround.

 

I have two Geforce GTX 770 4GB running three 2560x1440 monitors in Nvidia Surround. With Surround enabled, compute shaders and SLI gets messed up. Despite SLI being disabled for the application, I still get 100% load on both cards in OpenGL programs. Even worse, enabling SLI for the application does nothing. Turning off Surround solves both the SLI problem and the compute shader running twice problem. Still investigating, but this explains why I can't reproduce it on other computers with the same graphics card since almost nobody uses Nvidia Surround.

Share this post


Link to post
Share on other sites

I've converted your Java code to C++, and oddly enough I get even worse results than you tongue.png

 

The editor seems to have eaten the tabs, but it compiles/runs.

 

[source=c++]#include <GL/glew.h>

#include <GLFW/glfw3.h>
#include <cstdio>
 
static const char *shaderSource = "#version 430\n"
"\n"
"layout(std430, binding = 0) buffer Data{\n"
"    int data[];\n"
"} dataBuffer;\n"
"\n"
"layout (local_size_x = 1, local_size_y = 1, local_size_z = 1) in;\n"
"\n"
"void main(){\n"
"    int offset = int(gl_GlobalInvocationID.x);\n"
"    dataBuffer.data[offset]++;\n"
"}\n"
;
 
#define CHECK() \
{ \
GLenum err; \
while ((err = glGetError()) != GL_NO_ERROR) { \
printf("GL ERROR: 0x%x\n", err); \
} \
}
 
int main()
{
if ( !glfwInit() )
{
return -1;
}
 
glfwWindowHint(GLFW_OPENGL_PROFILE, GLFW_OPENGL_COMPAT_PROFILE);
glfwWindowHint(GLFW_CONTEXT_VERSION_MAJOR, 4);
glfwWindowHint(GLFW_CONTEXT_VERSION_MINOR, 4);
 
GLFWwindow *window = glfwCreateWindow(640, 480, "Compute", nullptr, nullptr);
if ( window == nullptr )
{
return -1;
}
 
glfwMakeContextCurrent(window);
 
if ( glewInit() != GLEW_OK )
{
return -1;
}
 
GLuint buffer;
glGenBuffers(1, &buffer); //Different in Java.
 
glBindBuffer(GL_SHADER_STORAGE_BUFFER, buffer);
glBufferStorage(GL_SHADER_STORAGE_BUFFER, 4, nullptr, GL_MAP_WRITE_BIT | GL_MAP_READ_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT); //Create 4-byte buffer.
 
//Map 4-byte buffer. Different in Java. The returned pointer is wrapped in a ByteBuffer.
int *mappedBuffer = (int *)glMapBufferRange(GL_SHADER_STORAGE_BUFFER, 0, 4, GL_MAP_WRITE_BIT | GL_MAP_READ_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT);
int computeShader = glCreateShader(GL_COMPUTE_SHADER);
glShaderSource(computeShader, 1, &shaderSource, nullptr);
glCompileShader(computeShader);
 
GLint status;
glGetShaderiv(computeShader, GL_COMPILE_STATUS, &status);
if ( status != GL_TRUE )
{
printf("COMPILE FAILED\n");
return -1;
}
 
CHECK();
 
int computeShaderProgram = glCreateProgram();
glAttachShader(computeShaderProgram, computeShader);
glLinkProgram(computeShaderProgram);
 
CHECK();
 
int passes = 0;
int fails = 0;
while ( !glfwWindowShouldClose(window) )
{
glfwPollEvents();
 
*mappedBuffer = 0;
 
//Execute compute shader
glUseProgram(computeShaderProgram);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, buffer);
glDispatchCompute(1, 1, 1);
 
//Wait for the compute shader to finish. Warn if an error occurs.
GLsync syncObject = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
int waitResult = glClientWaitSync(syncObject, 0, 1000*1000*1000);
if(waitResult == GL_WAIT_FAILED || waitResult == GL_TIMEOUT_EXPIRED){
printf("WAIT FAILED!!!\n");
}
glDeleteSync(syncObject);
 
//Read back result from the buffer.
int result = *mappedBuffer; //Get int from index 0
 
//Update statistics
if(result == 1){
passes++;
}else{
fails++;
}
 
//Print out the statistics every 1 000 run
if((passes + fails) % 1000 == 0){
 
printf("Correct result: %d\n", passes);
printf("Wrong result: %d\n", fails);
printf("Correct percentage: %.6f%%", 100 * (double)passes/(passes + fails));
printf("\n"); //new line
 
glClear(GL_COLOR_BUFFER_BIT);
glfwSwapBuffers(window);
 
}
 
}
 
glDeleteShader(computeShader);
glDeleteProgram(computeShaderProgram);
 
glUnmapBuffer(GL_SHADER_STORAGE_BUFFER);
glDeleteBuffers(1, &buffer);
 
glfwDestroyWindow(window);
glfwTerminate();
 
return 0;
}[/source]
 
I'm pretty sure I've converted it properly. Anyhow, what I'm seeing is it works perfectly up until I swap buffers. After that, the value in the mapped buffer is always 0. I have an AMD Radeon HD 6870 with latest Catalyst 14.7 beta drivers.

Share this post


Link to post
Share on other sites

Does the Java program I uploaded work on your AMD card? That'd rule out a lot of things. I could imagine that the C++ compiler optimizes the loop you have since it assumes that the value of *mappedBuffer does not change inside the loop or something like that... Anyway, it'd be nice to know if my Java version works, as that program worked perfectly fine on my HD7790.

Share this post


Link to post
Share on other sites

Same problem with your Java program, but I'm tempted to put this down to a driver bug. I can't see anything wrong with the code, and it doesn't make sense that it doesn't work after doing a swap buffers.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement