Sign in to follow this  
dingojohn

Q3 BSP Rendering and Vertex Buffers

Recommended Posts

I'm currently working on a Q3 BSP renderer and I've managed to get everything drawing nicely including the PVS tests and frustum culling. However, I'm now trying to refactor things while upping the performance of the rendering. The BSP format stores a huge array of vertices and along with that a huge array of indices. When using a non-VBO approach, I can fast plow through all surfaces and draw everything with nice performance; with VBO's my performance drops dramatically. First, I tried to have every surface given its own vertexbuffer, but that makes a huge overhead because a lot of the surfaces in the bsp format have only two triangles. I also tried to save all vertex and index data in one two large buffers (they're HUGE!), but performance is still not on par with the non-VBO approach. As rendering without VBO's is getting deprecated, I'd like to use the VBO's but not suffer from a performance loss. How would you go about storing a huge level in vertex buffers to gain performance at least equivalent to a non-VBO approach? It should be noted that I program on OS X 10.6, on a MacBook Pro 13'' without dedicated VRAM, and thus I guess the large vertexbuffers could be a problem because of this, and I should not expect BETTER performance form using VBO's, but hopefully not worse either.

Share this post


Link to post
Share on other sites
The VBOs really might never get better performance than drawing everything the non-VBO way. But it's hard to tell what's the problem if we don't see the VBO code. Maybe you're just using it wrong..

Share this post


Link to post
Share on other sites
If you need the drawing code, then here is just some test code to render all surfaces in the map, where the buffers are many small ones for each surface. RenderSurface simply holds a shared pointer to buffers, usagemode is STATIC_DRAW. Also it contains count, which is the number of indices, all are of type short.

std::vector<boost::shared_ptr<RenderSurface> >::const_iterator rsurf =  map_surfaces.begin();

for (; rsurf != map_surfaces.end(); ++rsurf) {

boost::shared_ptr<RenderSurface> surface = (*rsurf);
static const int stride = sizeof(vertex);

const char* vertex_offset = (char*) 0;
const char* tex_offset = vertex_offset + sizeof(cml::vector3f);
// Lightmap should go here -> 2x sizeof(float)
const char* normal_offset = tex_offset + 4 * sizeof(float);
const char* color_offset = normal_offset + sizeof(cml::vector3f);

glBindBuffer(GL_ARRAY_BUFFER, surface->vertex_buffer->Id());
glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, surface->index_buffer->Id());

glVertexPointer(3, GL_FLOAT, stride, 0);
glTexCoordPointer(2, GL_FLOAT, stride, tex_offset);
glNormalPointer(GL_FLOAT, stride, normal_offset);
glColorPointer(3, GL_UNSIGNED_BYTE, stride, color_offset);

glDrawRangeElements(GL_TRIANGLES, 0, surface->count, surface->count,
GL_UNSIGNED_SHORT, 0 );

}


In case you need to see how I generate the buffers, then here is that code:

BufferObject::BufferObject(GLenum target, const GLsizei size, const void* data)
: target(target)
{

glGenBuffers(1, &id);

glBindBuffer(target, Id());

glBufferData(target, size, data, GL_STATIC_DRAW);

// unbind after data copy
glBindBuffer(target, 0);

}


and I just pass the data from the loaded bsp file directly to the buffer constructors. It draws correctly, so the data is correct.

Share this post


Link to post
Share on other sites
You can get a performance drop if you used to run through your scene tree and render a face whenever you encountered one and refactored this into collecting all faces and rendering them in the end. The latter would have the GPU waiting for the CPU to collect the faces and then have the CPU waiting for the GPU to finish rendering which degrades performance no matter how fast VBO are. Are you doing anything like that?

Appart from that, as snake already pointed out, it's hard to tell without seeing the code.

Quote:
How would you go about storing a huge level in vertex buffers to gain performance at least equivalent to a non-VBO approach?


Can you try the intermediate way of storing all vertices in a huge VBO but rendering with your old code by replacing all (I'm assuming you used glBegin() glEnd()) glVertex3fv(&vertices[indices[i][j]]) calls with glArrayElement(indices[i][j]). It would give a clue, whether it is actually the VBO that's causing the slowdown.

Edit: Just saw your code. Am I right in assuming, that one vertex in your VBO is 43 bytes? Try padding it up to the size of one cache line (64 Bytes).

Share this post


Link to post
Share on other sites
quake levels dont have that many polys, the PVS is mainly for reducing overdraw, so you could fill your VBO just when you move (changing visibility sets) and in all other frames just push the same VBO data.
that's probably the fastest way on nowadays hardware.
like "Ohforf sake" mentioned, cpu<->gpu syncs cost you performance, some brute-force vertex pushing is way faster.

Share this post


Link to post
Share on other sites
Quote:
Original post by Ohforf sake
You can get a performance drop if you used to run through your scene tree and render a face whenever you encountered one and refactored this into collecting all faces and rendering them in the end. The latter would have the GPU waiting for the CPU to collect the faces and then have the CPU waiting for the GPU to finish rendering which degrades performance no matter how fast VBO are. Are you doing anything like that?


I did that earlier, but my current tests do nothing like that as I'm testing the vertex buffers. I just loop through all the surfaces and render them as I go, which you can see in the code above. I'll make sure to keep this point in mind for later.


Quote:

Can you try the intermediate way of storing all vertices in a huge VBO but rendering with your old code by replacing all (I'm assuming you used glBegin() glEnd()) glVertex3fv(&vertices[indices[i][j]]) calls with glArrayElement(indices[i][j]). It would give a clue, whether it is actually the VBO that's causing the slowdown.


That was actually what I did. I had everything in two large vectors, one for vertices, one for indices and drew it with glDrawElemements - and the current code that does just that outperforms the two huge VBO's with offsets into the buffer by a factor of 3. The code that currently works and only uses the raw data chunks from the .bsp file is like this:

std::vector<q3bsp::surface>::const_iterator surf = mapfile->surfaces.begin();
for (; surf != mapfile->surfaces.end(); ++surf) {


static const int stride = sizeof(vertex);

glVertexPointer(3, GL_FLOAT, stride, &mapfile->vertices[surf->vertex].position);
glTexCoordPointer(2, GL_FLOAT, stride, &mapfile->vertices[surf->vertex].st);
glNormalPointer(GL_FLOAT, stride, &mapfile->vertices[surf->vertex].normal);
glColorPointer(3, GL_UNSIGNED_BYTE, stride, &mapfile->vertices[surf->vertex].color);


glDrawElements(GL_TRIANGLES, surf->num_mesh_vertices, GL_UNSIGNED_SHORT, &mapfile->sindices[surf->mesh_verts]);
}



Quote:
Edit: Just saw your code. Am I right in assuming, that one vertex in your VBO is 43 bytes? Try padding it up to the size of one cache line (64 Bytes).


sizeof(vertex) yields 44, and this is the structure:

struct vertex
{
cml::vector3f position;
cml::vector2f st; // texcoord
cml::vector2f lightmap; // lightmap texcoord
cml::vector3f normal;
cml::vector<unsigned char, cml::fixed<4> > color;
}



And if I add float padding[5];, sizeof(vertex) yields 64, and then after I alter my vertex loading to account for this, I still get same framerate.

Share this post


Link to post
Share on other sites
Quote:
Original post by Krypt0n
quake levels dont have that many polys, the PVS is mainly for reducing overdraw, so you could fill your VBO just when you move (changing visibility sets) and in all other frames just push the same VBO data.
that's probably the fastest way on nowadays hardware.
like "Ohforf sake" mentioned, cpu<->gpu syncs cost you performance, some brute-force vertex pushing is way faster.


So you'd fill a VBO every time I update position or move the view (because frustum culling and changed frustum)? What I did earlier was just to mark whether or not something was potentially visible, and I only updated those markings if the camera had moved, which indeed gave me a performance increase. However, that was still using glDrawArrays without VBO's. The VBO's are slowing my program, with or without culling.

Share this post


Link to post
Share on other sites
When drawing with the two large VBOs, do you also make a call to glVertexPointer etc. for every face? If so, can you change your indices so that the glVertexPointer calls can be moved/merged to a single call?

Another thing:
glDrawRangeElements(GL_TRIANGLES, 0, surface->count, surface->count, GL_UNSIGNED_SHORT, 0 );


Not sure about this one, I'll have to look that up, but shouldn't that be:
glDrawRangeElements(GL_TRIANGLES, 0, mapfile->vertices.size()-1, surface->count, GL_UNSIGNED_SHORT, 0 );

Share this post


Link to post
Share on other sites
Currently when I have one large vertex- and one large indexbuffer, I have to setup my glVertexPointer, NormalPointer etc., for every different object, done with an offset. I have thought about making it easier to make one large draw call that renders everything, but that would require me to modify the indexbuffer every time I update the frame. Do you suggest I do that and keep the indexbuffer as Read/Write and modify it and its size, whenever origin and frustum changes?

DrawRangeElements perform just as when I use DrawElements, I used it here for testing purposes.

Share this post


Link to post
Share on other sites
If you have a old/low-spec graphics card it may be that VBOs are not supported properly. I ran into this same problem a while back on an old laptop and I believe I eventually discovered the driver was emulating VBOs in software. At the time someone here pointed me to this demo: http://www.songho.ca/opengl/files/vbo.zip you should try running this and toggling VBOs on and off and see if you get the results you expect.

EDIT: Actually I just ran that demo myself and it runs slower with VBOs on than off, even on my Radeon HD 4800. So maybe forget that idea...

Share this post


Link to post
Share on other sites
Quote:
Original post by dingojohn
Quote:
Original post by Krypt0n
quake levels dont have that many polys, the PVS is mainly for reducing overdraw, so you could fill your VBO just when you move (changing visibility sets) and in all other frames just push the same VBO data.
that's probably the fastest way on nowadays hardware.
like "Ohforf sake" mentioned, cpu<->gpu syncs cost you performance, some brute-force vertex pushing is way faster.


So you'd fill a VBO every time I update position
yes
Quote:
or move the view (because frustum culling and changed frustum)?

no, dont use frustum culling.
I know everybody would suggest you the opposite, but that's not of much use in rendering quake levels nowadays. A ATI 9700 (yeah, damn old) shades a million pixel per frame and runs a quake level probably at least at 300fps, transforming those additional vertices isn't much overhead.
so you'd change the VBO just when you change the visibility set (that's far less than what you move).


Quote:

What I did earlier was just to mark whether or not something was potentially visible, and I only updated those markings if the camera had moved, which indeed gave me a performance increase. However, that was still using glDrawArrays without VBO's. The VBO's are slowing my program, with or without culling.

try it with some minimal vertex format.
btw, if you update it every frame, try to use GL_STREAM_DRAW or GL_DYNAMIC_DRAW

Share this post


Link to post
Share on other sites
I now have much better performance with my VBO's; better performance than without. I'm storing one large indexbuffer and one large vertexbuffer, but the indexbuffer is modified from the raw data of the .bsp file. I altered it to now index the vertices from 0 instead of from the start vertex given at a surface, meaning that I do not have to rebind anything or call glXXXPointer more than once. The offset into the indexbuffer is provided at the glDrawElements call, a call that I still do per surface.

I can now look into the proposed ideas of altering my indexbuffer whenever I move to a new cluster, knowing that the VBO's are performant enough.

Thanks anyone who helped me arrive at this, rating++.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this