Q3 BSP Rendering and Vertex Buffers

Started by
10 comments, last by dingojohn 14 years, 1 month ago
I'm currently working on a Q3 BSP renderer and I've managed to get everything drawing nicely including the PVS tests and frustum culling. However, I'm now trying to refactor things while upping the performance of the rendering. The BSP format stores a huge array of vertices and along with that a huge array of indices. When using a non-VBO approach, I can fast plow through all surfaces and draw everything with nice performance; with VBO's my performance drops dramatically. First, I tried to have every surface given its own vertexbuffer, but that makes a huge overhead because a lot of the surfaces in the bsp format have only two triangles. I also tried to save all vertex and index data in one two large buffers (they're HUGE!), but performance is still not on par with the non-VBO approach. As rendering without VBO's is getting deprecated, I'd like to use the VBO's but not suffer from a performance loss. How would you go about storing a huge level in vertex buffers to gain performance at least equivalent to a non-VBO approach? It should be noted that I program on OS X 10.6, on a MacBook Pro 13'' without dedicated VRAM, and thus I guess the large vertexbuffers could be a problem because of this, and I should not expect BETTER performance form using VBO's, but hopefully not worse either.
The VBOs really might never get better performance than drawing everything the non-VBO way. But it's hard to tell what's the problem if we don't see the VBO code. Maybe you're just using it wrong..
If you need the drawing code, then here is just some test code to render all surfaces in the map, where the buffers are many small ones for each surface. RenderSurface simply holds a shared pointer to buffers, usagemode is STATIC_DRAW. Also it contains count, which is the number of indices, all are of type short.

std::vector<boost::shared_ptr<RenderSurface> >::const_iterator rsurf =  map_surfaces.begin();        for (; rsurf != map_surfaces.end(); ++rsurf) {        boost::shared_ptr<RenderSurface> surface = (*rsurf);        static const int stride = sizeof(vertex);                const char* vertex_offset = (char*) 0;        const char* tex_offset =  vertex_offset + sizeof(cml::vector3f);        // Lightmap should go here -> 2x sizeof(float)        const char* normal_offset = tex_offset + 4 * sizeof(float);        const char* color_offset = normal_offset + sizeof(cml::vector3f);                glBindBuffer(GL_ARRAY_BUFFER, surface->vertex_buffer->Id());        glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, surface->index_buffer->Id());                glVertexPointer(3, GL_FLOAT, stride, 0);        glTexCoordPointer(2, GL_FLOAT, stride, tex_offset);        glNormalPointer(GL_FLOAT, stride, normal_offset);         glColorPointer(3, GL_UNSIGNED_BYTE, stride, color_offset);                glDrawRangeElements(GL_TRIANGLES, 0, surface->count, surface->count,                              GL_UNSIGNED_SHORT, 0 );         }

In case you need to see how I generate the buffers, then here is that code:

BufferObject::BufferObject(GLenum target, const GLsizei size, const void* data):  target(target){    	glGenBuffers(1, &id);        glBindBuffer(target, Id());        glBufferData(target, size, data, GL_STATIC_DRAW);        // unbind after data copy    glBindBuffer(target, 0);    }

and I just pass the data from the loaded bsp file directly to the buffer constructors. It draws correctly, so the data is correct.
You can get a performance drop if you used to run through your scene tree and render a face whenever you encountered one and refactored this into collecting all faces and rendering them in the end. The latter would have the GPU waiting for the CPU to collect the faces and then have the CPU waiting for the GPU to finish rendering which degrades performance no matter how fast VBO are. Are you doing anything like that?

Appart from that, as snake already pointed out, it's hard to tell without seeing the code.

Quote:How would you go about storing a huge level in vertex buffers to gain performance at least equivalent to a non-VBO approach?

Can you try the intermediate way of storing all vertices in a huge VBO but rendering with your old code by replacing all (I'm assuming you used glBegin() glEnd()) glVertex3fv(&vertices[indices[j]]) calls with glArrayElement(indices[j]). It would give a clue, whether it is actually the VBO that's causing the slowdown.

Edit: Just saw your code. Am I right in assuming, that one vertex in your VBO is 43 bytes? Try padding it up to the size of one cache line (64 Bytes).
quake levels dont have that many polys, the PVS is mainly for reducing overdraw, so you could fill your VBO just when you move (changing visibility sets) and in all other frames just push the same VBO data.
that's probably the fastest way on nowadays hardware.
like "Ohforf sake" mentioned, cpu<->gpu syncs cost you performance, some brute-force vertex pushing is way faster.
Quote:Original post by Ohforf sake
You can get a performance drop if you used to run through your scene tree and render a face whenever you encountered one and refactored this into collecting all faces and rendering them in the end. The latter would have the GPU waiting for the CPU to collect the faces and then have the CPU waiting for the GPU to finish rendering which degrades performance no matter how fast VBO are. Are you doing anything like that?

I did that earlier, but my current tests do nothing like that as I'm testing the vertex buffers. I just loop through all the surfaces and render them as I go, which you can see in the code above. I'll make sure to keep this point in mind for later.

Can you try the intermediate way of storing all vertices in a huge VBO but rendering with your old code by replacing all (I'm assuming you used glBegin() glEnd()) glVertex3fv(&vertices[indices[j]]) calls with glArrayElement(indices[j]). It would give a clue, whether it is actually the VBO that's causing the slowdown.

That was actually what I did. I had everything in two large vectors, one for vertices, one for indices and drew it with glDrawElemements - and the current code that does just that outperforms the two huge VBO's with offsets into the buffer by a factor of 3. The code that currently works and only uses the raw data chunks from the .bsp file is like this:

std::vector<q3bsp::surface>::const_iterator surf = mapfile->surfaces.begin();        for (; surf != mapfile->surfaces.end(); ++surf) {                	                static const int stride = sizeof(vertex);                        glVertexPointer(3, GL_FLOAT, stride, &mapfile->vertices[surf->vertex].position);            glTexCoordPointer(2, GL_FLOAT, stride, &mapfile->vertices[surf->vertex].st);            glNormalPointer(GL_FLOAT, stride, &mapfile->vertices[surf->vertex].normal);            glColorPointer(3, GL_UNSIGNED_BYTE, stride, &mapfile->vertices[surf->vertex].color);                                    glDrawElements(GL_TRIANGLES, surf->num_mesh_vertices, GL_UNSIGNED_SHORT, &mapfile->sindices[surf->mesh_verts]);        }

Quote:Edit: Just saw your code. Am I right in assuming, that one vertex in your VBO is 43 bytes? Try padding it up to the size of one cache line (64 Bytes).

sizeof(vertex) yields 44, and this is the structure:

struct vertex{    cml::vector3f position;    cml::vector2f st; // texcoord    cml::vector2f lightmap; // lightmap texcoord    cml::vector3f normal;    cml::vector<unsigned char, cml::fixed<4> > color;}

And if I add float padding[5];, sizeof(vertex) yields 64, and then after I alter my vertex loading to account for this, I still get same framerate.
Quote:Original post by Krypt0n
quake levels dont have that many polys, the PVS is mainly for reducing overdraw, so you could fill your VBO just when you move (changing visibility sets) and in all other frames just push the same VBO data.
that's probably the fastest way on nowadays hardware.
like "Ohforf sake" mentioned, cpu<->gpu syncs cost you performance, some brute-force vertex pushing is way faster.

So you'd fill a VBO every time I update position or move the view (because frustum culling and changed frustum)? What I did earlier was just to mark whether or not something was potentially visible, and I only updated those markings if the camera had moved, which indeed gave me a performance increase. However, that was still using glDrawArrays without VBO's. The VBO's are slowing my program, with or without culling.
When drawing with the two large VBOs, do you also make a call to glVertexPointer etc. for every face? If so, can you change your indices so that the glVertexPointer calls can be moved/merged to a single call?

Another thing:
glDrawRangeElements(GL_TRIANGLES, 0, surface->count, surface->count, GL_UNSIGNED_SHORT, 0 );

Not sure about this one, I'll have to look that up, but shouldn't that be:
glDrawRangeElements(GL_TRIANGLES, 0, mapfile->vertices.size()-1, surface->count, GL_UNSIGNED_SHORT, 0 );
Currently when I have one large vertex- and one large indexbuffer, I have to setup my glVertexPointer, NormalPointer etc., for every different object, done with an offset. I have thought about making it easier to make one large draw call that renders everything, but that would require me to modify the indexbuffer every time I update the frame. Do you suggest I do that and keep the indexbuffer as Read/Write and modify it and its size, whenever origin and frustum changes?

DrawRangeElements perform just as when I use DrawElements, I used it here for testing purposes.
If you have a old/low-spec graphics card it may be that VBOs are not supported properly. I ran into this same problem a while back on an old laptop and I believe I eventually discovered the driver was emulating VBOs in software. At the time someone here pointed me to this demo: http://www.songho.ca/opengl/files/vbo.zip you should try running this and toggling VBOs on and off and see if you get the results you expect.

EDIT: Actually I just ran that demo myself and it runs slower with VBOs on than off, even on my Radeon HD 4800. So maybe forget that idea...

This topic is closed to new replies.
