Jump to content
  • Advertisement
Sign in to follow this  
amtri

shader use of glUniform1fv very slow

This topic is 2098 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hello,

 

I need to draw hundreds of thousands of cubes, each centered at a different location and of different size.

 

I first drew the cubes as 12 triangles (2 per cube face), using glDrawArrays. To draw this I need to pass 12*9 floating point coordinates to the graphics card.

 

I then thought of speeding this up using the following algorithm:

 

1) Use glVertexPointer on a standard unit cube centered at zero

 

2) Then, for each cube, I pass its center and a scaling parameter with glUniform1fv. I use a vertex shader the parses these 4 numbers by scaling and translating each coordinate component. The glDrawArray command always draws the same unit cube, but the shader program will take care of positioning each point in its proper location.

 

The algorithm works: I do get my cubes with the right size and at the location I want them. And I am passing only 4 floating point numbers per cube, rather than 108. Yet, this process is probably about 100 times slower than before.

 

I narrowed the problem down to the call to glUniform1fv. If I call it just once, rather than once per cube, I get my performance back. Of course, the cubes are not in the right location and are not of the right size, but at least I know the culprit.

 

Can anyone shed some light on why there is such a loss of performance in this function? And, better yet, a suggestion on how to really improve my performance with an algorithm like this to the point that it's better than my original triangulation?

 

I'm puzzled by this loss of performance when I'm sending fewer points to the graphics card.

 

Thanks.

Share this post


Link to post
Share on other sites
Advertisement

First, why are you calling glUniform1fv? Assuming uniform scaling, position and scale would fit perfectly into a single vec4. That's just one uniform location used and you can set them all with a single call to glUniform4f/glUniform4fv. Of course you could get rid of the uniforms entirely and instead put (x, y, z, scale) into a single vec4 vertex attribute, do an appropriate call to to glVertexAttribDivisor and use glDrawArraysInstanced.

 

The question to the 'why' is difficult to answer with the little information that is given. To start with, what kind of graphics card do you have? How old are the drivers? What kind of OpenGL context are you creating? What OpenGL version are you targetting? How are you setting your uniforms (hopefully not four calls to glUniform1fv for every four values)?

 

In the past I observed some old drivers which literally recompiled the entire shader for every glUniform* call.

Edited by BitMaster

Share this post


Link to post
Share on other sites

BitMaster,

 

You are right: position and scale fit perfectly into a single float[4] array, and this is exactly what I'm using, with a single call to glUniform1fv which allows me pass the array to the shader. I don't think this is much different than a vec4, is it?

 

I didn't know about glVertexAttribDivisor, so I just quickly read up on it. Let me see whether I can get this right (the details are still fuzzy...):

 

1) My unit cube has its coordinates set once, for a single cube. The reason I had avoided using an array of attributes is that since I have 36 vertices in my unit cube I was under the impression that if I were to use an array of attributes I would need to copy the data for each cube 36 times to make sure each vertex has its own data. I felt this would defeat my purpose of reducing the data being sent down to the card.

 

2) Now supposed that rather than drawing one cube at a time I create, say, 64 copies of the cube with glVertexPointer. This way I can buffer the data and reduce the number of calls associated with the attribute - especially, if I understand correctly, I only need to see attribute values 64 times, and NOT 64*36 times, if I use glVertexAttribDivisor. Then the number of calls to pass down the attribute values would be reduced by a factor of 64 (or whatever multiple I choose).

 

3) From what I understand, when working with the vertex shader after a call to glDrawArrays I don't really have any control over the order the vertices are coming in through. But it appears that glDrawArraysInstanced gives me some control over this. Is this correct?

 

Thanks.

Share this post


Link to post
Share on other sites

You are right: position and scale fit perfectly into a single float[4] array, and this is exactly what I'm using, with a single call to glUniform1fv which allows me pass the array to the shader. I don't think this is much different than a vec4, is it?

I'm not an expert on this and it might even be implementation dependent, but I believe there will be differences. A single uniform location can store four float values. I fear an array of single floats will waste three fourth of that space.
Even assuming the graphics card can pack the array tightly and doesn't waste extra time on every glUniform* call I would worry if the compiled shader code would be optimal (I'm not sure how well the swizzle operators in a vec4 would transfer to array indicing).
 

1) My unit cube has its coordinates set once, for a single cube. The reason I had avoided using an array of attributes is that since I have 36 vertices in my unit cube I was under the impression that if I were to use an array of attributes I would need to copy the data for each cube 36 times to make sure each vertex has its own data. I felt this would defeat my purpose of reducing the data being sent down to the card.

First, reducing the data being sent is not always the best solution. Transferring data from the system to the graphics card is expensive, yes. But not as expensive as it used to be and increasing the number of GL calls significantly instead will probably turn it into a fool's errand. GL calls are expensive and some even more than others.
That said, I'm not even sure what you mean by not using an array. If you need individual normals for each cube face, then the data has to be duplicated. It doesn't matter if you use the deprecated API (glVertex*, glNormal*) or glDrawArrays (with or without an VBO), the same data has to be send to the card behind the scenes.
That said, either with uniforms or glVertexAttribDivisior you just need one unit cube for the whole life time of the program. Upload the data once to a VBO. The only question is how to transfer and store the (position, scale)-information. That would depend heavily on how much change is expected from frame to frame and how many there actually are and if some or more static than others.
 

2) Now supposed that rather than drawing one cube at a time I create, say, 64 copies of the cube with glVertexPointer. This way I can buffer the data and reduce the number of calls associated with the attribute - especially, if I understand correctly, I only need to see attribute values 64 times, and NOT 64*36 times, if I use glVertexAttribDivisor. Then the number of calls to pass down the attribute values would be reduced by a factor of 64 (or whatever multiple I choose).

As said above, the only sensible thing is putting the cube data into a VBO. It is completely immutable for the whole run time. If you then need to render N cubes at different positions and sizes you set a vertex attribute which contains a vec4 and set glVertexAttribDivisor(theAttributeIndex, 1). So the array backing it must contain N vec4s and you call glDrawArraysInstanced with an instance count of N.
 

3) From what I understand, when working with the vertex shader after a call to glDrawArrays I don't really have any control over the order the vertices are coming in through. But it appears that glDrawArraysInstanced gives me some control over this. Is this correct?

The graphics card executes dozens or even hundreds of vertex/fragment shaders in parallel. Speaking of order has absolutely no point here. You can access some information like the index of the current vertex (gl_VertexID) or the index of the current instance (gl_InstanceID) in the individual shader instance (see this page for more details) but the nature of the massive parallelism limits both the information available and what you can do with it. Edited by BitMaster

Share this post


Link to post
Share on other sites

BitMaster,

 

First of all, thanks for all the help. I have almost everything going, except for the fact that glVertexAttribDivisor crashes on x64 with Glew version 1.10.0. Here's a quick rundown on what I have, if you don't mind checking this up for me:

 

1) In the vertex shader:

 

...

in vec3 center;

...

... center.x ...

... center.y ...

... center.z ...

 

2) In the C code:

 

GLint centerloc = glGetAttribLocation (program,"center"); /* resulting centerloc = 1 */

 

glEnableVertexAttribArray (centerloc);
glVertexAttribPointer (centerloc,3,GL_FLOAT,GL_FALSE,0,x);
glVertexAttribDivisor(centerloc,1); /*** crashes here with a 0x00000 pointer! ***/

 

I haven't even tried to draw anything because the program crashes during the setup.

 

Any thoughts?

 

Thanks.

Share this post


Link to post
Share on other sites

What kind of OpenGL context are you targetting? glVertexAttribDivisor is core in 3.3. Have you tried updating the graphics card driver? Have you checked what your hardware supports? Have you tried setting 'glewExperimental = GL_TRUE;' before calling glewInit()?

Share this post


Link to post
Share on other sites

Hmm... I'm not sure I'm in a position to answer all your questions, but here's my best shot:

 

1) OpenGL context: I call

 

ctx = wglCreateContext (GetDC((HWND)window));

 

2) My graphics driver is the latest, and the documentation states that it has support for OpenGL higher than 4.

 

3) I set glewExperimental = GL_TRUE;

 

This did make a difference: I no longer get a crash in glVertexAttribDivisor, but I now get a crash in

 

glDrawArraysInstanced (GL_TRIANGLES,0,36,npts);

 

where npts is an integer, the number of cubes I want to draw.

 

If I comment out the glVertexAttribDivisor, and replace glDrawArrayInstances with glDrawArrays(GL_TRIANGLES,0,36), then I do get my cube drawn. I get a crash even if npts=1.

Share this post


Link to post
Share on other sites
In this case there is probably something going wrong with
glVertexAttribPointer (centerloc,3,GL_FLOAT,GL_FALSE,0,x);
 
If you moved your cubes into a VBO, remember that you need to unbind the vertex buffer (bind 0 to GL_ARRAY_BUFFER) before calling glVertexAttribPointer for the center location, otherwise x is interpreted as an index into the VBO (which will in almost all scenarios just blow up). You also need to make sure that the memory pointed to by x is not freed until the glDraw* call has happened.
 
There could be a lot of other minor issues but these come to mind first.

Edit: Just for completeness, you did check that glDrawArraysInstanced is not nullptr? It's core since 3.1 so that really should not be the case but it still should be checked. Edited by BitMaster

Share this post


Link to post
Share on other sites

I narrowed the problem down to the call to glUniform1fv. If I call it just once, rather than once per cube, I get my performance back. Of course, the cubes are not in the right location and are not of the right size, but at least I know the culprit.

Instanced rendering is always the solution to this type of rendering so I am not going to derail the current direction of the thread, but just so you know, you did not necessarily find the culprit.

What you have found is that either ::glUniform1fv() is a problem, fill-rate is a problem, or something you are doing on the CPU to create the matrices for each instance is a problem.

 

If you aren’t rebuilding the matrix data you can eliminate that possibility, but otherwise you have 99,999 cubes out of 100,000 being early-Z-culled (assuming depth compare is GL_LESS as it should be).  Meaning you could have a fill-rate problem/pixel-shader problem.

 

 

L. Spiro

Share this post


Link to post
Share on other sites

Well, I narrowed this down: a lot of the performance problem came from the fact that I was NOT setting my cube coordinates in a vbo. I did that now and the performance is very, very different - whether I use an attribute array or not.

 

Now the problem I'm running into is that the center of the cubes - stored in an array as a vertex attribute - is not moving forward with the InstanceID. Result: all cubes are being drawn in the same location.

 

This brings up the question: can I have some vertex data in a vbo, and some in a pointer array? Or do I HAVE to put everything in the vbo if I'm going to use a vbo for anything? Although my cube is all a single color, I hacked the code to have one color per vertex, set the 36 colors in an array, then called

 

glColorPointer (3,GL_FLOAT,0,chex);
glEnableClientState (GL_COLOR_ARRAY);

 

But the cube still came out in a single color, ignoring all the colors in the "chex" array. Also, the cube coordinates and normals are interleaved in the vbo - i.e., x1,c1,x2,c2, etc.

 

Can anybody shed any light on this combination of ColorPointer outside of a vbo with both normals and coordinates in a vbo?

 

Thanks.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!