Jump to content

  • Log In with Google      Sign In   
  • Create Account

shader use of glUniform1fv very slow


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
12 replies to this topic

#1 amtri   Members   -  Reputation: 175

Like
0Likes
Like

Posted 13 January 2014 - 02:03 PM

Hello,

 

I need to draw hundreds of thousands of cubes, each centered at a different location and of different size.

 

I first drew the cubes as 12 triangles (2 per cube face), using glDrawArrays. To draw this I need to pass 12*9 floating point coordinates to the graphics card.

 

I then thought of speeding this up using the following algorithm:

 

1) Use glVertexPointer on a standard unit cube centered at zero

 

2) Then, for each cube, I pass its center and a scaling parameter with glUniform1fv. I use a vertex shader the parses these 4 numbers by scaling and translating each coordinate component. The glDrawArray command always draws the same unit cube, but the shader program will take care of positioning each point in its proper location.

 

The algorithm works: I do get my cubes with the right size and at the location I want them. And I am passing only 4 floating point numbers per cube, rather than 108. Yet, this process is probably about 100 times slower than before.

 

I narrowed the problem down to the call to glUniform1fv. If I call it just once, rather than once per cube, I get my performance back. Of course, the cubes are not in the right location and are not of the right size, but at least I know the culprit.

 

Can anyone shed some light on why there is such a loss of performance in this function? And, better yet, a suggestion on how to really improve my performance with an algorithm like this to the point that it's better than my original triangulation?

 

I'm puzzled by this loss of performance when I'm sending fewer points to the graphics card.

 

Thanks.



Sponsor:

#2 BitMaster   Crossbones+   -  Reputation: 4088

Like
0Likes
Like

Posted 13 January 2014 - 04:02 PM

First, why are you calling glUniform1fv? Assuming uniform scaling, position and scale would fit perfectly into a single vec4. That's just one uniform location used and you can set them all with a single call to glUniform4f/glUniform4fv. Of course you could get rid of the uniforms entirely and instead put (x, y, z, scale) into a single vec4 vertex attribute, do an appropriate call to to glVertexAttribDivisor and use glDrawArraysInstanced.

 

The question to the 'why' is difficult to answer with the little information that is given. To start with, what kind of graphics card do you have? How old are the drivers? What kind of OpenGL context are you creating? What OpenGL version are you targetting? How are you setting your uniforms (hopefully not four calls to glUniform1fv for every four values)?

 

In the past I observed some old drivers which literally recompiled the entire shader for every glUniform* call.


Edited by BitMaster, 13 January 2014 - 04:02 PM.


#3 amtri   Members   -  Reputation: 175

Like
0Likes
Like

Posted 13 January 2014 - 04:53 PM

BitMaster,

 

You are right: position and scale fit perfectly into a single float[4] array, and this is exactly what I'm using, with a single call to glUniform1fv which allows me pass the array to the shader. I don't think this is much different than a vec4, is it?

 

I didn't know about glVertexAttribDivisor, so I just quickly read up on it. Let me see whether I can get this right (the details are still fuzzy...):

 

1) My unit cube has its coordinates set once, for a single cube. The reason I had avoided using an array of attributes is that since I have 36 vertices in my unit cube I was under the impression that if I were to use an array of attributes I would need to copy the data for each cube 36 times to make sure each vertex has its own data. I felt this would defeat my purpose of reducing the data being sent down to the card.

 

2) Now supposed that rather than drawing one cube at a time I create, say, 64 copies of the cube with glVertexPointer. This way I can buffer the data and reduce the number of calls associated with the attribute - especially, if I understand correctly, I only need to see attribute values 64 times, and NOT 64*36 times, if I use glVertexAttribDivisor. Then the number of calls to pass down the attribute values would be reduced by a factor of 64 (or whatever multiple I choose).

 

3) From what I understand, when working with the vertex shader after a call to glDrawArrays I don't really have any control over the order the vertices are coming in through. But it appears that glDrawArraysInstanced gives me some control over this. Is this correct?

 

Thanks.



#4 BitMaster   Crossbones+   -  Reputation: 4088

Like
2Likes
Like

Posted 14 January 2014 - 02:10 AM

You are right: position and scale fit perfectly into a single float[4] array, and this is exactly what I'm using, with a single call to glUniform1fv which allows me pass the array to the shader. I don't think this is much different than a vec4, is it?

I'm not an expert on this and it might even be implementation dependent, but I believe there will be differences. A single uniform location can store four float values. I fear an array of single floats will waste three fourth of that space.
Even assuming the graphics card can pack the array tightly and doesn't waste extra time on every glUniform* call I would worry if the compiled shader code would be optimal (I'm not sure how well the swizzle operators in a vec4 would transfer to array indicing).
 

1) My unit cube has its coordinates set once, for a single cube. The reason I had avoided using an array of attributes is that since I have 36 vertices in my unit cube I was under the impression that if I were to use an array of attributes I would need to copy the data for each cube 36 times to make sure each vertex has its own data. I felt this would defeat my purpose of reducing the data being sent down to the card.

First, reducing the data being sent is not always the best solution. Transferring data from the system to the graphics card is expensive, yes. But not as expensive as it used to be and increasing the number of GL calls significantly instead will probably turn it into a fool's errand. GL calls are expensive and some even more than others.
That said, I'm not even sure what you mean by not using an array. If you need individual normals for each cube face, then the data has to be duplicated. It doesn't matter if you use the deprecated API (glVertex*, glNormal*) or glDrawArrays (with or without an VBO), the same data has to be send to the card behind the scenes.
That said, either with uniforms or glVertexAttribDivisior you just need one unit cube for the whole life time of the program. Upload the data once to a VBO. The only question is how to transfer and store the (position, scale)-information. That would depend heavily on how much change is expected from frame to frame and how many there actually are and if some or more static than others.
 

2) Now supposed that rather than drawing one cube at a time I create, say, 64 copies of the cube with glVertexPointer. This way I can buffer the data and reduce the number of calls associated with the attribute - especially, if I understand correctly, I only need to see attribute values 64 times, and NOT 64*36 times, if I use glVertexAttribDivisor. Then the number of calls to pass down the attribute values would be reduced by a factor of 64 (or whatever multiple I choose).

As said above, the only sensible thing is putting the cube data into a VBO. It is completely immutable for the whole run time. If you then need to render N cubes at different positions and sizes you set a vertex attribute which contains a vec4 and set glVertexAttribDivisor(theAttributeIndex, 1). So the array backing it must contain N vec4s and you call glDrawArraysInstanced with an instance count of N.
 

3) From what I understand, when working with the vertex shader after a call to glDrawArrays I don't really have any control over the order the vertices are coming in through. But it appears that glDrawArraysInstanced gives me some control over this. Is this correct?

The graphics card executes dozens or even hundreds of vertex/fragment shaders in parallel. Speaking of order has absolutely no point here. You can access some information like the index of the current vertex (gl_VertexID) or the index of the current instance (gl_InstanceID) in the individual shader instance (see this page for more details) but the nature of the massive parallelism limits both the information available and what you can do with it.

Edited by BitMaster, 14 January 2014 - 02:11 AM.


#5 amtri   Members   -  Reputation: 175

Like
0Likes
Like

Posted 14 January 2014 - 12:52 PM

BitMaster,

 

First of all, thanks for all the help. I have almost everything going, except for the fact that glVertexAttribDivisor crashes on x64 with Glew version 1.10.0. Here's a quick rundown on what I have, if you don't mind checking this up for me:

 

1) In the vertex shader:

 

...

in vec3 center;

...

... center.x ...

... center.y ...

... center.z ...

 

2) In the C code:

 

GLint centerloc = glGetAttribLocation (program,"center"); /* resulting centerloc = 1 */

 

glEnableVertexAttribArray (centerloc);
glVertexAttribPointer (centerloc,3,GL_FLOAT,GL_FALSE,0,x);
glVertexAttribDivisor(centerloc,1); /*** crashes here with a 0x00000 pointer! ***/

 

I haven't even tried to draw anything because the program crashes during the setup.

 

Any thoughts?

 

Thanks.



#6 BitMaster   Crossbones+   -  Reputation: 4088

Like
0Likes
Like

Posted 14 January 2014 - 01:15 PM

What kind of OpenGL context are you targetting? glVertexAttribDivisor is core in 3.3. Have you tried updating the graphics card driver? Have you checked what your hardware supports? Have you tried setting 'glewExperimental = GL_TRUE;' before calling glewInit()?



#7 amtri   Members   -  Reputation: 175

Like
0Likes
Like

Posted 14 January 2014 - 02:05 PM

Hmm... I'm not sure I'm in a position to answer all your questions, but here's my best shot:

 

1) OpenGL context: I call

 

ctx = wglCreateContext (GetDC((HWND)window));

 

2) My graphics driver is the latest, and the documentation states that it has support for OpenGL higher than 4.

 

3) I set glewExperimental = GL_TRUE;

 

This did make a difference: I no longer get a crash in glVertexAttribDivisor, but I now get a crash in

 

glDrawArraysInstanced (GL_TRIANGLES,0,36,npts);

 

where npts is an integer, the number of cubes I want to draw.

 

If I comment out the glVertexAttribDivisor, and replace glDrawArrayInstances with glDrawArrays(GL_TRIANGLES,0,36), then I do get my cube drawn. I get a crash even if npts=1.



#8 BitMaster   Crossbones+   -  Reputation: 4088

Like
0Likes
Like

Posted 14 January 2014 - 02:12 PM

In this case there is probably something going wrong with
glVertexAttribPointer (centerloc,3,GL_FLOAT,GL_FALSE,0,x);
 
If you moved your cubes into a VBO, remember that you need to unbind the vertex buffer (bind 0 to GL_ARRAY_BUFFER) before calling glVertexAttribPointer for the center location, otherwise x is interpreted as an index into the VBO (which will in almost all scenarios just blow up). You also need to make sure that the memory pointed to by x is not freed until the glDraw* call has happened.
 
There could be a lot of other minor issues but these come to mind first.

Edit: Just for completeness, you did check that glDrawArraysInstanced is not nullptr? It's core since 3.1 so that really should not be the case but it still should be checked.

Edited by BitMaster, 14 January 2014 - 02:22 PM.


#9 L. Spiro   Crossbones+   -  Reputation: 13595

Like
2Likes
Like

Posted 14 January 2014 - 08:58 PM

I narrowed the problem down to the call to glUniform1fv. If I call it just once, rather than once per cube, I get my performance back. Of course, the cubes are not in the right location and are not of the right size, but at least I know the culprit.

Instanced rendering is always the solution to this type of rendering so I am not going to derail the current direction of the thread, but just so you know, you did not necessarily find the culprit.

What you have found is that either ::glUniform1fv() is a problem, fill-rate is a problem, or something you are doing on the CPU to create the matrices for each instance is a problem.

 

If you aren’t rebuilding the matrix data you can eliminate that possibility, but otherwise you have 99,999 cubes out of 100,000 being early-Z-culled (assuming depth compare is GL_LESS as it should be).  Meaning you could have a fill-rate problem/pixel-shader problem.

 

 

L. Spiro


It is amazing how often people try to be unique, and yet they are always trying to make others be like them. - L. Spiro 2011
I spent most of my life learning the courage it takes to go out and get what I want. Now that I have it, I am not sure exactly what it is that I want. - L. Spiro 2013
I went to my local Subway once to find some guy yelling at the staff. When someone finally came to take my order and asked, “May I help you?”, I replied, “Yeah, I’ll have one asshole to go.”
L. Spiro Engine: http://lspiroengine.com
L. Spiro Engine Forums: http://lspiroengine.com/forums

#10 amtri   Members   -  Reputation: 175

Like
0Likes
Like

Posted 15 January 2014 - 04:28 PM

Well, I narrowed this down: a lot of the performance problem came from the fact that I was NOT setting my cube coordinates in a vbo. I did that now and the performance is very, very different - whether I use an attribute array or not.

 

Now the problem I'm running into is that the center of the cubes - stored in an array as a vertex attribute - is not moving forward with the InstanceID. Result: all cubes are being drawn in the same location.

 

This brings up the question: can I have some vertex data in a vbo, and some in a pointer array? Or do I HAVE to put everything in the vbo if I'm going to use a vbo for anything? Although my cube is all a single color, I hacked the code to have one color per vertex, set the 36 colors in an array, then called

 

glColorPointer (3,GL_FLOAT,0,chex);
glEnableClientState (GL_COLOR_ARRAY);

 

But the cube still came out in a single color, ignoring all the colors in the "chex" array. Also, the cube coordinates and normals are interleaved in the vbo - i.e., x1,c1,x2,c2, etc.

 

Can anybody shed any light on this combination of ColorPointer outside of a vbo with both normals and coordinates in a vbo?

 

Thanks.



#11 BitMaster   Crossbones+   -  Reputation: 4088

Like
0Likes
Like

Posted 16 January 2014 - 02:10 AM

First, your choice of words suggests a potential misunderstanding ('put everything in the vbo'). All vertex attributes can come from one VBO. All vertex attributes can come from different VBOs. Some vertex attributes can share a VBO. The only point where the state of GL_ARRAY_BUFFER matters is when you call glVertexAttrib*Pointer. The be clear: the state of GL_ARRAY_BUFFER is completely irrelevant during the actual call to glDraw*.
There would be no problem to have the unit cube in one VBO and the instance data (position, scale, color) in different VBOs. Considering all these attributes have different usage patterns (from static over dynamic to streaming) this would make a lot of sense too.

I can't really help you with glColorPointer. In the OpenGL versions I work with on a regular basis, this function is completely deprecated. For me vertex color (whether per-vertex or per-instance) is just another vec3/vec4 vertex attribute. I have my doubts though that you can combine glVertexAttribDivisor and glColorPointer though.

Some time in the past I did something very close to what you are doing now (instanced unit cubes with position and color as per-instance attributes). It's definitely doable. Maybe you should post more code so people can have a look over it.

Edited by BitMaster, 16 January 2014 - 02:42 AM.


#12 amtri   Members   -  Reputation: 175

Like
0Likes
Like

Posted 16 January 2014 - 12:42 PM

Spiro and BitMaster,

 

First of all, thanks again.

 

I solved all problems now. The performance difference was an issue of putting all data in a vbo or not. Having the data in arrays in the client side made everything very slow. Now I have attribute arrays, I'm using the Divisor methods, and all my data is first stored in vbos. The performance now is great!

 

Thanks!



#13 amtri   Members   -  Reputation: 175

Like
0Likes
Like

Posted 17 January 2014 - 05:35 PM

As I mentioned above, I got everything working... almost.

 

Many of the problems I was having before came from the fact that - although GL_VERSION_3_3 was defined - some functionality on one of our computers was not really there.

 

Which brings up the question: although the header file for glew does have all the functions I need, some of these functions are not supported in some platforms. In the past, I always checked for GL_VERSION... to make sure everything would work properly. But now I see this is no guarantee.

 

Does anybody know how I can know for certain whether a function is available - both in Linux and in Windows? If GL_VERSION_... is defined, what else do I need to do to make sure the code won't crash?

 

Thanks.






Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS