glBufferSubDataARB performance issues

Started by
2 comments, last by digitalgibs 14 years, 8 months ago
Can anyone tell if there is a better way? Is it the number of calls to glBufferSubDataARB() that's killing me or the number of VBO's ?!? I'm working on an opengl project that has potentially hundreds of objects (around 300-400 objects) animated on the CPU. Each one is roughly 300-900 triangles. Unfortunately I can't batch them into a single draw call as each object may contain unique shader parameters. Right now I'm pre-allocating VBO memory for each object using glBufferDataARB( GL_ARRAY_BUFFER_ARB, numBytes, verts, GL_DYNAMIC_DRAW ) and then updating all VBOS using glBufferSubDataARB( GL_ARRAY_BUFFER_ARB, 0, numBytes, verts ) before rendering the objects. I'm told that this is the fastest way to update but for some reason this is killing my performance. If I comment out this update call and just loop through all other code (cpu animations, visible object determination, etc) it runs at crazy frame rates (120+ fps), but including the update drops it down to the teens. NOTE: To avoid fill-rate times from effecting my tests, I disabled the actual draw calls. Only CPU code and VBO updates were being processed here.
Advertisement
Try mapping buffers, try to have fewer buffers with more than one triangle group in each (and use offset for drawing), and try to have several sets of them or use glBufferDataARB(..., 0, ...).

glBufferSubDataARB can do, but is not required to, the transfer to the card asynchronously. What it can't do is to do the memory copy to its own buffers asynchronously, as you could in theory delete the pointed-to memory the next microsecond after the function call returns. Thus it will always have to have some additional delay compared to mapping the buffer.

Your second issue is binding buffers, which is not as trivial on the driver side as you may think. Doing that many hundred times per frame can be a problem. Indexing into fewer buffers is cheaper.

The third issue is stalls. Drawing cannot happen before all data has been transferred, and transfers to the same memory can't happen before all draws that use it have finished. The driver will schedule asynchronously as much as it can, but it can't do much in such situations if it doesn't know what's safe to discard and when.
This can be solved by explicitely having 2-3 sets of buffers (so it will draw from one while you upload the other) or simply by calling glBufferDataARB with zero size before uploading the next buffer, which tells the driver "I won't be using the old contents any more, so throw it away once you're done, and store my new data elsewhere in the mean time".
Thanks samoth. Maybe I'll try to create one big VBO and reserve partitions for each object, then render using offsets. I wonder if bandwidth is an issue here as well since I'm using a interleaved vertex structure and not separate arrays.

// vertex structure (68 bytes)
vec3f pos;
vec2f st[2];
vec3f tangents[3];
unsigned char color[4];

At an average of 600 verts x 300 objects x 68 bytes, I'm looking at 12MB per frame of uploads... =( That sounds like a lot, but I don't know what is considered reasonable for OGL1.5+ compatible video card.
question.. If i group many objects into one VBO how would I assign the offset?!?

// object[0]
glVertexAttribPointerARB(..., 0); // vertex.pos
glVertexAttribPointerARB(..., 12); // vertex.texcoord

// object[1] ?? is the following valid ??
glVertexAttribPointerARB(..., 0 + firstVertexOffset); // vertex.pos
glVertexAttribPointerARB(..., 12 + firstVertexOffset); // vertex.texcoord

or would I always bind (0 and 12) and use offsets in my index buffer?

object[0].indexes = { 0, 1, 2, 3 }
object[1].indexes = { 4, 5, 6, 7 }
etc...
FYI for anyone following this thread. It unfortunately didn't make a very big difference in performance to combine VBO's. =(. I guess the way I was handling the data was fairly optimal to begin with so batching the objects into a single VBO was a negligible boost at the cost of much less flexibility (hard to add/remove objects easily without another manager or brute re-allocations to close the fragmentation).

I think that I may have simply run into a bandwidth issue with my card =(. I did a test with around 300 objects which totaled close to 6.2MB of vertex buffer updates but only a few bytes of index buffer updates. The frame rate was running around 25fps-35fps (no rendering, only updates) which means:

(6.2)MB per frame x (25 to 35)fps = 155MB/s - 217MB/s upload...

video card:
NVidia GeForce 8600M GS

Wikipedia claims a memory bandwidth of 12.8 to 22.4 GB/s... Ofcourse AGP1x and PCI are in the range of 150-250MB/s so this is likely the problem and not the video card.

[Edited by - digitalgibs on July 27, 2009 2:05:31 AM]

This topic is closed to new replies.

Advertisement