Sign in to follow this  
RegularKid

glPushMatrix performance

Recommended Posts

Hi! I have a bunch of 2d objects to draw....say a bunch of circles and a bunch of rectangles. Currently each shape is it's own class. I blow through the list of all my shapes and call "draw()" on each one. The draw does something like this:
glPushMatrix();
glTranslatef( x, y );
glScalef( xScale, yScale );
glRotatef( rot, 0.0f, 0.0f, 1.0f );

// Draw all the verts in my shape using local coordinates about the origin
// Draw all my child shapes too!

glPopMatrix();

The reason for the push / pop transformations is that all of my shapes live under a parent that has it's own translation / scale / rotation....which lives under yet another parent with it's own transformation info ( just a tree of parent -> child relationships ). So, this was the easiest way to get things to render properly. I'm wondering though if it might be more efficient to get rid of the pushing and popping for the transformation and simply calculate each shape vertex position based on the parents transformation ( and of course it's parents transformation ). So, basically doing the math for each vertex rather than using push / pop and specifying my vertices using local coordinates. Which way would give better performance? Thanks!

Share this post


Link to post
Share on other sites
Yes, the push, pop, translate, scale, and rotate commands are performance killers. Calculate and keep track of the matrices in your program. Call glLoadMatrixf() to change the modelview matrix just prior to rendering your shape.

If the child shape is static relative to the parent, you should set the child's vertex positions relative to the parent as you suggested. In this case, you don't even have to call glLoadMatrixf() for the child. If everything in your tree were static relative to each other, you would only need one glLoadMatrixf() call at the root.

If a child shape is moving relative to the parent, you would likely be better off calling glLoadMatrixf() rather than recomputing the vertices of the child.

While I tell you all this, always do performance testing. That is the only way you can be sure which way is better. There are so many CPU/GPU trade-offs that what is good in one OpenGL program might be bad in another.

Share this post


Link to post
Share on other sites
Quote:
Original post by RegularKid
I'm wondering though if it might be more efficient to get rid of the pushing and popping for the transformation and simply calculate each shape vertex position based on the parents transformation ( and of course it's parents transformation ). So, basically doing the math for each vertex rather than using push / pop and specifying my vertices using local coordinates.
It will never be faster to recalculate vertices on the CPU than the GPU. Remember, the GPU has been specially developed to perform vertex transformation, and it is incredibly fast at it.

Quote:
Original post by jsderon
Yes, the push, pop, translate, scale, and rotate commands are performance killers.
Huh? While they may not be the most incredibly thought out functions in the world, and they are soon to be deprecated, they are highly unlikely to be a deciding factor in performance. Push and Pop effectively do a single memcpy each, while translate, rotate and scale each perform a few trig functions and a single matrix multiplication each. None of that is going to be a problem, unless you are calling them many thousands of times per frame.

A quick tip on optimisation: unless you can prove that a given function is a performance bottleneck (i.e. with a profiler), it isn't worth worrying about.

Share this post


Link to post
Share on other sites
Thanks, swiftcoder...

That's interesting that there are two differing views on the transformation performance. What I think I'm going to do is this:

1. Any shape that is static, pre-calculate it's vert positions at start up and store them all in a large array that I can use for glDrawArrays. That way I'm not re-doing the math ( either myself or through the push / pop logic ) each frame and can just make a single call each frame to render them.

2. Any shape that is dynamic, I'll run some profiles on both methods ( recalculating vert positions myself and using push / pop ) and see which one works better for my particular solutions.

Thanks again for the help, guys!

Share this post


Link to post
Share on other sites
I agree with swiftcoder, those matrices just represent linear transformations so the math is simply a couple of numbers being multiplied and added together. Rotate would throw in a call to cos and sin, the performance isn't going to be make that much of a difference. Especially if the positions of these things are moving, I really doubt you can make it any faster without using these functions.

Share this post


Link to post
Share on other sites
true, but...

given that updating all the vertices individually on the CPU then allows you to draw the whole lot in one draw call instead of having the GPU wait around betweeen each bit of vertex data arriving, it's possible that it'll be faster, depending on the number of vertices and the frequency with which the transforms you apply to them change. If they don't change every frame, why do the math every frame?

Share this post


Link to post
Share on other sites
Quote:
Original post by RegularKid
1. Any shape that is static, pre-calculate it's vert positions at start up and store them all in a large array that I can use for glDrawArrays. That way I'm not re-doing the math ( either myself or through the push / pop logic ) each frame and can just make a single call each frame to render them.
Fair enough, you are reducing both the number of draw calls and the number of state changes, so you should benefit from this change.

Quote:
2. Any shape that is dynamic, I'll run some profiles on both methods ( recalculating vert positions myself and using push / pop ) and see which one works better for my particular solutions.
You can do this, but I can already tell you what the results will be:

Lets say you have a model with 5,000 vertices (not a particularly large model). Now, using your method I must recalculate the position of 5,000 vertices (and their normals, if you need lighting). Otherwise, I have a single call to each of push, translate, rotate, scale and pop. That is 5 operation vs. 5,000 operations - 5 clearly wins. The GPU has to transform the vertices whether or not you transform them (since it still has to do perspective correction, etc.), so in terms of big-Oh notation, my method is O(1), and yours is O(n), where n is the number of vertices (constant time vs linear time).

Share this post


Link to post
Share on other sites
Quote:
Original post by mrbastard
given that updating all the vertices individually on the CPU then allows you to draw the whole lot in one draw call instead of having the GPU wait around betweeen each bit of vertex data arriving
If you are worried about performance, then you are already using VBOs (Vertex Buffer Objects), right? And with a VBO, the GPU gets the data all in one chunk.

Quote:
If they don't change every frame, why do the math every frame?
The GPU still does the math every frame, regardless. If you pre-transform vertices, then they get multiplied by an identity model matrix, but it still costs the same amount as any other vertex*matrix multiplication (don't forget that the view and perspective transformations must still be applied).

Edit: I realise that you might be suggesting baking the entire scene, and rendering it in a single draw call. Unfortunately, even on a crap intel integrated GPU, I can push 500,000 triangles per-frame at 60fps, and there is no way that a CPU could transform that many, that fast. GPUs were developed for a very good reason.

Share this post


Link to post
Share on other sites
Quote:
Original post by swiftcoder
Quote:
Original post by RegularKid
I'm wondering though if it might be more efficient to get rid of the pushing and popping for the transformation and simply calculate each shape vertex position based on the parents transformation ( and of course it's parents transformation ). So, basically doing the math for each vertex rather than using push / pop and specifying my vertices using local coordinates.
It will never be faster to recalculate vertices on the CPU than the GPU. Remember, the GPU has been specially developed to perform vertex transformation, and it is incredibly fast at it.

Quote:
Original post by jsderon
Yes, the push, pop, translate, scale, and rotate commands are performance killers.
Huh? While they may not be the most incredibly thought out functions in the world, and they are soon to be deprecated, they are highly unlikely to be a deciding factor in performance. Push and Pop effectively do a single memcpy each, while translate, rotate and scale each perform a few trig functions and a single matrix multiplication each. None of that is going to be a problem, unless you are calling them many thousands of times per frame.

A quick tip on optimisation: unless you can prove that a given function is a performance bottleneck (i.e. with a profiler), it isn't worth worrying about.


While one might think that those commands are not performance killers, I have had experience with this. We have code that draws billboards. (I won't go into the history of the hows and whys of the implementation.) A few years ago, we changed the code from calling the various OpenGL matrix commands to computing the vertices on the CPU. Since these vertices are dynamic, they are computed each frame. The frame rate more that doubled in a test using a GeForce 6.

We have had a similar experience with our COLLADA model renderer. In this case, we did not modify the vertex buffers. We merely replaced the the OpenGL push, pops, translates, rotates; we kept track of the matrix on the CPU; and, only called the OpenGL set matrix function when necessary. Again, performance improved measurably. This test was using a GeForce 8.

At the end of my previous post, I mentioned to always performance test. It is the only way you can be sure for your particular use case. I've seen over and over again that what I thought would be the result wasn't - there was always another complication. RegularKid might have a different result than my experience.









Share this post


Link to post
Share on other sites
Quote:
Original post by swiftcoderAnd with a VBO, the GPU gets the data all in one chunk.

With a single VBO, sure. But the OP's code suggests he's either not using VBOs or is using more than one VBO. Otherwise there'd be no need for the push/pop. TBH I was assuming a glBegin/End pair, which may have been an incorrect assumption.

The point I was making is that he may be increasing the number of draw calls required because of the granularity at which he has to have his data in order to do gl state changes for every chunk of geometry. As an example, it's possible to do basic skeletal animation using the gl matrix stack. But nobody does, because it would mean many many small batches.

Quote:
The GPU still does the math every frame, regardless. If you pre-transform vertices, then they get multiplied by an identity model matrix, but it still costs the same amount as any other vertex*matrix multiplication (don't forget that the view and perspective transformations must still be applied).

True, but I was talking about the matrix concatenation done by the driver on the calls to glTranslatef, glRotatef, glScalef. I'm fairly sure these are done on the CPU and uploaded as a uniform. Granted, I didn't say that.

Quote:
Edit: I realise that you might be suggesting baking the entire scene, and rendering it in a single draw call. Unfortunately, even on a crap intel integrated GPU, I can push 500,000 triangles per-frame at 60fps, and there is no way that a CPU could transform that many, that fast. GPUs were developed for a very good reason.


True, but that's a straw-man. The OP is working in 2d and doesn't need to update all his objects every frame. He also doesn't need to do the projection stuff on the CPU. So a more realistic possibility is 2 adds, 2 mults, a sin call and a cos call. That's still a lot to do per-vertex, granted. But it's far less than what you're comparing it to.

I'm not saying manipulating the matrix stack is inherantly worse than updating each vertex on the CPU. I'm pointing out that it's possible it'll be slower.

In fact (returning to the skeletal animation example) it's very likely that doing some of the work on the CPU (say 10 times a second instead of 60), and then uploading the results for use by the vert shader is the best solution. Without knowing more about the OP's dataset, and how often it changes it's impossible to know.

Share this post


Link to post
Share on other sites
Quote:
Original post by jsderon
While one might think that those commands are not performance killers, I have had experience with this. We have code that draws billboards. (I won't go into the history of the hows and whys of the implementation.) A few years ago, we changed the code from calling the various OpenGL matrix commands to computing the vertices on the CPU. Since these vertices are dynamic, they are computed each frame. The frame rate more that doubled in a test using a GeForce 6.
So you were calling the matrix functions for each vertex (i.e. every 4 vertices)? Of course it was faster to generate the vertices on the CPU, you are in an extreme case.

As I mentioned in my first post, if you are calling these functions many thousand times per frame, then they will be a problem. However, this is also a likely sign that you should be offloading this operation to the GPU instead - in your case, the billboard vertices can be calculated on the GPU relatively inexpensively.

Quote:
Original post by mrbastard
In fact (returning to the skeletal animation example) it's very likely that doing some of the work on the CPU (say 10 times a second instead of 60), and then uploading the results for use by the vert shader is the best solution. Without knowing more about the OP's dataset, and how often it changes it's impossible to know.
Fair enough, but for most of these examples (and certainly for skeletal animation), you could gain a much larger performance by offloading it to the GPU.

As you say, without knowing the OP's dataset, it is hard to say, but one can say in general, that it is a bad idea to replace OpenGL matrix manipulations with manual vertex transformation on the CPU. Certain instance can be used to argue against this, but they don't hold in the typical use-case.

I am not disagreeing with either one of you, but I find it a little odd that when faced with a general question, (with no specifics), everyone chooses to answer with information that only applies in very limited cases...

Share this post


Link to post
Share on other sites
Quote:
Original post by swiftcoder
Fair enough, but for most of these examples (and certainly for skeletal animation), you could gain a much larger performance by offloading it to the GPU.
True, but not if the method of 'offloading to the GPU' is manipulating the matrix stack for every few vertices.

Quote:

I am not disagreeing with either one of you, but I find it a little odd that when faced with a general question, (with no specifics), everyone chooses to answer with information that only applies in very limited cases...

TBH my original reply was in response to yours, intended as "yes all that's true, but also bear in mind that..." [smile]

Share this post


Link to post
Share on other sites
Wow! Thanks for all the information, guys! Definitely gives me lots of ways at looking at how I'm rendering my stuff...

By the way:

No, I'm not using VBO's yet ( although I have started looking into them ... just haven't implemented anything yet ). Originally I was doing immediate mode which of course is not good...but am now using vertex arrays with glDrawArrays ( however I'm sure I'm not gaining too much as I'm doing this for each circle / rectangle I draw ). I bet I'll get a big boost once I stick ALL the static geometry into a big vertex list and simply call glDrawArray on that one each frame.

Again, thanks for all the insight!

Share this post


Link to post
Share on other sites
Quote:
Original post by mrbastard
True, but I was talking about the matrix concatenation done by the driver on the calls to glTranslatef, glRotatef, glScalef. I'm fairly sure these are done on the CPU and uploaded as a uniform. Granted, I didn't say that.


That's correct. You can also add glLoadIdentity, glPushMatrix, glPopMatrix and a bunch of others to the list. All these functions are deprecated.
The only supported function is glLoadMatrixf but even this is now deprecated since everything must be done with shaders.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this