VBO Pooling: Does it make sense?

Started by
8 comments, last by max343 11 years, 5 months ago

Can anyone provide a useful link or results of any experiment that could confirm the story about vertex post-transform cache on post-Fermi cards?
I have read a lot about that (and implemented some schemes), and indeed there are benefits if applied on old cards, but I had no improvements on Fermi.
Also, it significantly depends upon the driver, and the way vertices are distributed between multiple processing units. We should delve deeper into GPUs architecture and drivers' design to get correct answer. It is much simpler to carry out some experiments. That's why I ask for your results. is there any benefits of optimizing indexing on modern GPUs?


The most major difference in the memory department in Fermi was that NVIDIA introduced L1/L2. About 700kb of L2, and 16kb of L1 (in default mode). This means that now you know how to use the cache better, or how to fool it (if you're really into that). Before Fermi, caching in NVIDIA's hardware was basically a black-box and involved a lot of finger crossing.
Since the introduction of normal cache architecture, the notion of VRAM transfer rate is not so interesting. Now there are L2 misses instead. A rule of thumb is that if you have a miss in some cache level you'll waste roughly 10 times more cycles by trying to fetch the same data from an upper level. And you always start from L1 where fetches are very cheap.

With larger buffers all you'll probably see is capacity misses, and they'll generally apply only for L1. These are not so bad, and this is what was checked up until now, except there was no L1/L2 on pre-Fermi hardware so essentially you'd get something like an L2 miss.
On the other hand, triggering L2 misses won't go unnoticed. These are not so hard to trigger, you just need to keep in mind that cache line's size is 128b (so there are about 6k lines), now assume some associativity/eviction policy, and finally you can use any of the widely known ways to create conflict misses for this configuration. Here random jumps in the buffer won't give you the desired result, as they'll just uniformly map themselves on the L2, but using some simple pattern should do the trick quite well.

This topic is closed to new replies.

Advertisement