Compute Shader Invocations

Graphics and GPU Programming Programming OpenGL

Started by StanLee October 04, 2012 03:58 PM

4 comments, last by Yours3!f 11 years, 6 months ago

157

Author

October 04, 2012 03:58 PM

Hello,

I am new to OpenGL and currently working on a particle system which makes use of the compute shader. I've got two questions. The first is about the compute shader itself. I create the particles and store them in shader storage buffer so I can access their position in the compute shader. Now I want to create a thread for every particle, which computes its new position. So I dispatch an one dimensional work group.



#define WORK_GROUP_SIZE 128

_shaderManager->useProgram("computeProg");

glDispatchCompute((_numParticles/WORK_GROUP_SIZE), 1, 1);

glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT);

Compute shader:



#version 430

struct particle{

	 vec4 currentPos;

	 vec4 oldPos;

};



layout(std430, binding=0) buffer particles{

	 	 	 struct particle p[];

};



layout (local_size_x = 128, local_size_y = 1, local_size_z = 1) in;

void main(){

	 uint gid = gl_GlobalInvocationID.x;



	 p[gid].currentPos.x += 100;

}

But somehow not all particles are affected. I am doing this the same way it was done in this example but it doesn't work.
http://education.sig...eShader_6pp.pdf

When I want to render 128.000 particles, then the code above would dispatch 128.000/128=1.000 1-dimensional work groups and each of them would have the size of 128. Doesn't it thus create 128*1.000 = 128.000 threads which execute the code in the compute shader above and thus all particles are affected? Each thread would have a differen ID at gl_GlobalInvocationID.x because all work-groups are 1-dimensional Am I missing something?

My other question is relating to glDrawArrays().
The vertex shader receives all the vertices from the shared-storage-buffer and passes them through to the geometry shader, where I emit 4 particles to create a quad on which I map my texture in the fragment shader. The structure which is stored in the shared-storage-buffer for every particle looks like this:



struct Particle{

glm::vec4 _currPosition;

glm::vec4 _prevPosition;

};

When I draw the scene I do the following:



glBindBuffer(GL_ARRAY_BUFFER, BufferID);

glVertexAttribPointer(0, 4, GL_FLOAT, GL_FALSE, sizeof(glm::vec4), 0);

glEnableVertexAttribArray(0);

glEnableClientState(GL_VERTEX_ARRAY);

glDrawArrays(GL_POINTS, 0, _numParticles*2);

glDisableClientState(GL_VERTEX_ARRAY);

glBindBuffer(GL_ARRAY_BUFFER, 0);

Somehow when I just call glDrawArrays(GL_POINTS, 0, _numParticles) not all particles are rendered. Why does this happen?
I would suggest the number of the vec4-vectors in the particle-struct is the reason but I am not sure. Could somebody explain it please?

Regards,
StanLee

Yours3!f

1,534

October 05, 2012 07:05 AM

When I want to render 128.000 particles, then the code above would dispatch 128.000/128=1.000 1-dimensional work groups and each of them would have the size of 128. Doesn't it thus create 128*1.000 = 128.000 threads which execute the code in the compute shader above and thus all particles are affected? Each thread would have a differen ID at gl_GlobalInvocationID.x because all work-groups are 1-dimensional Am I missing something?[/quote]
well, the way you'd usually want to do this is exploiting the GPU's Local Data Sharing capability (LDS), this means that you can efficiently work in each working group. In case of GPU's, when I used OpenCL the best work group size (when working with images) was about 16x16x1. This is because each work group has limited amount of memory (about 4KB I think, but this may vary). The max size of work groups was 256 in every direction (x, y, z), but you had to pay attention that your global work size had to be divisible by the local work group sizes.
So to answer your question the GPU does not dispatch that many threads (because it doesn't have that many compute cores), however it will dispatch some, and then it will work it's way gradually through the data.
In this case your local work group size IMO should be 256 to take maximum advantage.
Your global work group size should be it's multiple ie. 500 x 256 = 128000
this way your local work group id would be 0...255, 0, 0
and your global work group id should be 0...127999, 0, 0

so I believe changing this:
glDispatchCompute((_numParticles/WORK_GROUP_SIZE), 1, 1);
to this would solve the issue:
glDispatchCompute(_numParticles, 1, 1);

I'm not sure about the glDrawArrays issue.

Blog:

http://extremeistan.wordpress.com/

Stuff I wrote:

https://github.com/Yours3lf/libmymath

https://github.com/Yours3lf/linux_gl_fps

https://github.com/Yours3lf/instanced_font_rendering

http://youtu.be/k8PYkihyGXA

https://github.com/scrawl/smaa-opengl

https://github.com/Yours3lf/gl_browser_gui

Follow me on twitter:

https://twitter.com/0martint

StanLee

157

Author

October 05, 2012 07:55 AM

Thanks for your reply!

I've tested your suggestion but unfortunately nothing moves at all. :/

So I decided to update all positions in only one thread just for testing purpose. I changed the compute shader:



layout (local_size_x = 1, local_size_y = 1, local_size_z = 1) in;

void main(){

uint gid = gl_GlobalInvocationID.x;



if(gid == 0){

  for(int i = 0; i < maxParticles; ++i){

   p.currentPos.x += 100;

  }

}

}

And called:



glDispatchCompute(1, 1, 1);

But somehow the same problem occurs and not all particles are moving. I just don't get it. Is there somewhere a good documentation about the new compute shader in OpenGL?

Regards Stan

japro

887

October 05, 2012 11:45 AM

There are the specs. It's such a new feature that the only drivers i know about (Nvidia) are still "very beta". I have two examples here: https://github.com/p...er/experimental which also are particle systems. They worked on the first version of the 4.3 driver i havent tested them on the current one.

Tweet tweet!
My videos on YouTube
OpenGL Example Collection

Yours3!f

1,534

October 05, 2012 12:29 PM

oh well

I cant really help you beyond this point as I dont have a nvidia video card. But as japro said try the specs they're usually very helpful.
You can also try to implement the whole thing on the cpu side in c++, then try moving gradually to the gpu. Meaning try moving 1 particle, then 2, then all of them.
If nothing works you still have an option to use OpenCL until stable drivers and programming examples, tutorials etc. come out.

Blog:

http://extremeistan.wordpress.com/

Stuff I wrote:

https://github.com/Yours3lf/libmymath

https://github.com/Yours3lf/linux_gl_fps

https://github.com/Yours3lf/instanced_font_rendering

http://youtu.be/k8PYkihyGXA

https://github.com/scrawl/smaa-opengl

https://github.com/Yours3lf/gl_browser_gui

Follow me on twitter:

https://twitter.com/0martint

Yours3!f