Sign in to follow this  
jlashmet

Vertex Cache

Recommended Posts

jlashmet    124
Hi guys, I'm trying to implement a vertex cache similar to the one described by Yann a few years back. I've got the basics working, but the performance is terrible. I'm using directx and have allocated a single dynamic vertex and index buffer which are split into fixed chunks. When rendering a mesh of about 300 faces, I get about 30 frames a second. However, I've found that the performance highly depends on the location of the vertices in the buffer. When they are towards the beginning, I get around 900fps. If they get assigned to a block that is towards the middle, I get the drop to 30fps. Has anyone here run into issues like this? Should I be using separate buffers per block? Currently my single buffer is 24 megs. Jason

Share this post


Link to post
Share on other sites
Yann L    1802
Here is a copy of a PM I wrote during an idea exchange with someone implementing a similar system:

-------

Hey,

Well, lots of time passed since those threads were created, and hardware has changed. The whole sliding slot structure was optimal in the days where GPUs offered full manual control over VRAM allocation and synchronization, ie. we're talking about nVidias VAR OpenGL extension. It was highly optimized to get the last little bit of memory to use, trying to reduce wasted memory to an absolute minimum.

Those times are over. Newer allocation systems (VBO for OpenGL and mapped vertex buffers for D3D) won't allow such low level access anymore. While the whole allocator system can still be implemented by locking and substructuring large buffers, this is in no way optimal from a performance point of view, since driver manufacturers didn't expect such abuse of the system.

Also, VRAM became cheap. 256 MB cards will be the norm in the very near future, and 1GB cards are lurking just below the horizon. A few hundred kilobytes more or less won't make a difference anymore. With this in mind, cache allocation systems have to be adapted.

The good news is that they can be considerably simplified. I changed the sliding slot allocator to a simple combination of a pool allocator and a fixed slot allocator, and the results are almost as good as the sliding slot one - and will perfectly work with new hardware and APIs.

Basically, allocate a certain number of fixed size slots for the smaller size requests: x * 1k slots, y * 2k slots, z * 4k slots, etc. They may waste memory, but they're very fast and hardware friendly. And since the slots sizes are kept small, the wastage is acceptable.

Then, for the larger requests, use a standard pool allocator, with a binary search tree to get the optimal insertion position. You can fit anything into the pool allocator that didn't fit into a fixed slot. Both allocators should be controlled by an LRU scheme, based on object visibility and temporal coherence.

You'll have to experiment a little on the sizes and granularities. In my case, I usually fit both allocators into a 24 or 32MB cache area. Even on scenes with tens of million faces and 800k - 900k faces in the view, the 32MB version won't overflow. And even if it did - big deal, all you lose is a little framerate until the cache is flushed again.

Cheers,
Yann

-----

Share this post


Link to post
Share on other sites
python_regious    929
Quote:
Original post by Yann L
Basically, allocate a certain number of fixed size slots for the smaller size requests: x * 1k slots, y * 2k slots, z * 4k slots, etc. They may waste memory, but they're very fast and hardware friendly. And since the slots sizes are kept small, the wastage is acceptable.


So, say I had a 1.5k, and a 0.5k set of data, a 1k slot, and a 2k slot would be taken up ( rather than concatenating the two and just use a single 2k slot )?

As for larger data sizes, > than 4k in this example, can you expand on this:
Quote:

Then, for the larger requests, use a standard pool allocator, with a binary search tree to get the optimal insertion position. You can fit anything into the pool allocator that didn't fit into a fixed slot. Both allocators should be controlled by an LRU scheme, based on object visibility and temporal coherence.


Please? or possibly link to some papers generalising the technique?


Share this post


Link to post
Share on other sites
codehunter13    122
Hi,
ive just implemented the old system and now i read this :)
The only difference between the two systems i see,is that you use an extra allocator(the one with the little chunks) in the new system.The allocator for big chunks is the same as the one in the old system? Maybe without the step where you check for cache hits/misses and reorder the chunks accordingly. Is this correct?

Share this post


Link to post
Share on other sites
Yann L    1802
Quote:
Original post by codehunter13
Hi,
ive just implemented the old system and now i read this :)
The only difference between the two systems i see,is that you use an extra allocator(the one with the little chunks) in the new system.The allocator for big chunks is the same as the one in the old system? Maybe without the step where you check for cache hits/misses and reorder the chunks accordingly. Is this correct?

More or less. The idea is to avoid sub-buffer updates within an existing VBO, and dispatch rendering on seperate objects instead. This allows for much better performance on modern hardware, at the expense of a little more memory.

For the individual block allocators, one VBO should hold one to 8 individual slots. A typical scene will randomize the slot order enough to get a good statistical distribution over several different VBOs, and avoid many sub buffer updates in a single frame. Sharing up to 8 (16 will typically also work) slots per VBO reduces the number of allocated objects, which can also impact performance on some drivers.

For the large pool allocator, you'd have a single VBO with evil sub buffer updates, but this allocator will usually be updated far less often than the block allocators.

If you use D3D, then replace all instances of 'VBO' with 'vertex buffer' above :)

Share this post


Link to post
Share on other sites
Toji    535
Sorry to revive a somewhat dead (4 days old?) thread, but I'm very curious about something:

Yann, you mention that having a "slotted" setup like this is most effecient for the graphics cards, but it seems as if the main benifit is that you avoid Sub Buffer updates at the cost of a lot more VBO binding. This leads me to believe that this setup is primarily meant for buffers that are going to be updated frequently (as in nearly every frame), possibly for CPU skinning or particle effects. If you have a situation where you have many static buffers (level geometry, GPU skinned meshes, etc) wouldn't it be more benificial to use several large, subdivided buffers instead? While it would require more sub buffer updates upfront, the reduced VBO switching seems like it would be worth it.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this