VAO, VBO speed

Started by
6 comments, last by Kaptein 11 years, 6 months ago
Hi,

I'm currently having some speed problems when drawing with VAO and VBOs. This is my drawing method for every shape:

[source lang="cpp"]void Shape::draw() {
double posX = drawingState_->getPositionX();
double posY = drawingState_->getPositionY();
double posZ = drawingState_->getPositionZ();
double rotX = drawingState_->getRotationX();
double rotY = drawingState_->getRotationY();
double rotZ = drawingState_->getRotationZ();

glm::mat4 model_matrix = glm::translate(glm::mat4(1.0), glm::vec3(posX, posY, posZ));
model_matrix = glm::rotate(model_matrix, (float)rotX, glm::vec3(1.0,0.0,0.0));
model_matrix = glm::rotate(model_matrix, (float)rotY, glm::vec3(0.0,1.0,0.0));
model_matrix = glm::rotate(model_matrix, (float)rotZ, glm::vec3(0.0,0.0,1.0));
glm::mat4 tempMVP = Camera::projectionViewMatrix() * model_matrix;

shader_.bind();
glBindVertexArray(vaoID_);

GLuint MVP_ID = glGetUniformLocation(shader_.getProgramHandle(), "MVP_matrix");
glUniformMatrix4fv(MVP_ID, 1, GL_FALSE, glm::value_ptr(tempMVP));

//glDrawArrays(GL_QUADS, 0, 24);
//glBindBuffer(GL_ARRAY_BUFFER, vertices_vbo_);
glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, ebo_);
glDrawElements(GL_TRIANGLE_FAN, nIndices_, GL_UNSIGNED_SHORT, (void*)0);
glBindVertexArray(0); // Unbind our Vertex Array Object
shader_.unbind();
}[/source]

Every shape is just a normal circle with 21 vertices. If I remove the above drawing method and only updating the behavior of the shapes the result is 4000 shapes with 60fps (max fps). But if I update and drawing with the method above... I get 20 fps. Is drawing with VAO supposed to be that slow?

I'm stuck. Any help appreciated.

/F
Advertisement
1. 4000 x (bindvertexarray(0) + gluseprogram(0)) = un-needed
remove them at the end of the function

2. you are creating matrices left and right
a translation operation is 3-4 multiplications or LESS depending on circumstances

ie. translateXZ(lastv.x - v.x, lastv.z - v.z); lastv = v;

3. why are you creating 6 doubles at the start of the function?
they're kinda slow compared to regular float, and unless you can justify their use, replace them with float for the entire program
also, remove them from the function, and use the position values directly
either in the form of vectors, or matrices

heres the thing: the cpu and gpu likes to use its cache
ive been told that the cache is 800 times faster than reading from ram, in that respect, creating temp variables (the doubles) is fast once theyve been created
you are however creating 2 matrices for 4000 objects, that SLOW
and... on top of that, you are rendering ___4000___ objects, without justification!

the gpu likes to render lots of stuff in _one_ call :) so if there's any way you can remodel a little, and make rendering more composite
ie. gather some objects near each other and make them into 1 draw call, so that you perhaps go from 4000 to 1000, or even better 400 calls
there's lots of options here!
somewhat regretfully, this is the state of the modern day rasterizer
it's powerful in parallell, but each core is a little slow, and its very very bandwidth limited (talk less, render more!)
by talking i mean sending commands to the GPU, or uploading data (if you upload lots of data in one go, that's usually very fast :))
...also, the uniform location for MVP_matrix should be gotten once only (sometime shortly after the program object is created), cached, and reused. There's no need to call glGetUniformLocation every time, and this is potentially a slow call.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

GL_ELEMENT_ARRAY_BUFFER binding point is part of the vertex array state, so that line can be removed as long as the index buffer has been bound before. Also, what Kaptein said about the "unbinding" at the end. Maybe also try caching and checking if the shader you're about to bind is already in place.

New C/C++ Build Tool 'Stir' (doesn't just generate Makefiles, it does the build): https://github.com/space222/stir

Everything said is correct. Binding shader program and VAO should be outside of the loop. Binding VBO is meaningless, since it is part of VBO (as beans222 said).
BUT, performance boost is not assured after all this changes since drivers optimize most of the operations. I could firmly claim that NV drivers will not update uniforms if the values are not changed from the previous call. So, dirty-flags and optimized updates are nice but do not necessary lead to higher performance. On the other hand, the number of driver's call certainly can be a bottleneck.

1. 4000 x (bindvertexarray(0) + gluseprogram(0)) = un-needed
remove them at the end of the function


Why is this un-needed? I thought you were supposed to do this to clean things up for the next object, that might not have the same setup and shaders?


3. why are you creating 6 doubles at the start of the function?
they're kinda slow compared to regular float, and unless you can justify their use, replace them with float for the entire program
also, remove them from the function, and use the position values directly
either in the form of vectors, or matrices

heres the thing: the cpu and gpu likes to use its cache
ive been told that the cache is 800 times faster than reading from ram, in that respect, creating temp variables (the doubles) is fast once theyve been created
you are however creating 2 matrices for 4000 objects, that SLOW
and... on top of that, you are rendering ___4000___ objects, without justification!


Yeah, I agree about the floats and I should probably not update MVP unless something has been updated. However, I tried to disable all code until the shader bind, and that did not increase performance. Still 20fps, so the bottleneck is not in the matrix math. =/ 4000 object might seem a lot without justification... yes. What I am trying to do is converting my engine to opengl 3.2. Before this change the fps was 60, with drawing and updating. The drawing was not a bottleneck, at all.


the gpu likes to render lots of stuff in _one_ call so if there's any way you can remodel a little, and make rendering more composite
ie. gather some objects near each other and make them into 1 draw call, so that you perhaps go from 4000 to 1000, or even better 400 calls
there's lots of options here!
somewhat regretfully, this is the state of the modern day rasterizer
it's powerful in parallell, but each core is a little slow, and its very very bandwidth limited (talk less, render more!)
by talking i mean sending commands to the GPU, or uploading data (if you upload lots of data in one go, that's usually very fast )


One thing that might be an optimization is to draw everything in one grid cell in one draw. If lucky, the objects are scattered over multiple cells. Thanks for that.

Everything said is correct. Binding shader program and VAO should be outside of the loop. Binding VBO is meaningless, since it is part of VBO (as beans222 said).
BUT, performance boost is not assured after all this changes since drivers optimize most of the operations. I could firmly claim that NV drivers will not update uniforms if the values are not changed from the previous call. So, dirty-flags and optimized updates are nice but do not necessary lead to higher performance. On the other hand, the number of driver's call certainly can be a bottleneck.


Uhm... I think I have misunderstood how thinks work. If you have 10 objects that requires 10 different shaders... should I only bind the shader once outside the drawing loop?
Yes, the bottleneck is 4000 render calls, unless you have some numbers that disprove that theory :)
usually, if you can, sort by shaders
To be brutally honest, i dont think anyone will tell you "4000 render calls with shader changes, bindings and matrix operations is just fine"
its not, its not even "not fine" smile.png its captain ultrabad

so, clean your code up a little, figure out what goes where, and try sort by shader
then, start rendering larger objects in less calls where possible

also, if you are going for the large-amount-of-objects route, extend your matrix library do alot less multiplications where its not needed
this isnt hard to do, and if you know how they work already it should take you 5mins to make translateXZ/XY/ZY since translations are the most common call

also, if you start profiling and find out that the bottleneck is elsewhere, you can post again with the problem area and maybe we can figure something out there as well
you never know :)

This topic is closed to new replies.

Advertisement