Geometry batching?

Started by
7 comments, last by bilsa 16 years, 11 months ago
Hi there! I somewhat recall there was an old thread started by Yann - the graphics guru :P - where he described his shader system along with a batching system for geometry. If I'm not entirely mistaken the batching was sorted using some modified version of Radix sort to minimize state changes during the rendering of a frame. Then there was the use of geometry slots pointing to a vertexbuffer. As I understood this each geometry had it's place in a large vertexbuffer shared by many different geometrys. This was to reduce the change of vertexbuffer states. What I'm wondering is if there really is any big performance gain nowadays in batching multiple objects into one batch, when speaking of models consisting of lets say 5000~10000 triangles? And something I didn't really get is how such catching would be implemented most efficiently, as I see it the most optimal way would be to fill a vertexbuffer with spatially local geometry that will most likely be rendered together. For static geometry that never move in a map it could be somewhat achieved... Anyway, is there really any point in wasting much effort and time into caching into large vertexbuffers?
Advertisement
The idea of having a single vertex buffer AFAIK, and using hedges/ranges of it for different chunks of geometry is so that you do not have to detach/reattach different buffers.

This is true even of animated geometry, where multiple frames can be calculated and then blended/animated on the GPU, thereby avoiding having to re-upload geometry every frame.

Someone correct me if im wrong?
Yeh, I would think so too. Does anyone know if the size matters? or is it that performance is worse when getting over some size?
It depends on the video card your dealing with, in Yann's thread (god doesnt even knows how many times ive read and re-read that entire thread) he describes video cards supporting the array_range extension pretty much allocating the entirety of the available memory, and sub allocating from that. But videocards like my Radeon 9600 XT would have an absolute fit with that, id imagine.
I`m reading that the overhead is much much smaller when working with DX-10 cards. But last time I implemented this into my engine (at that time on GF6800), the performance difference was pretty huge. Don`t remember exact numbers now, but way over 30%. My VB was about 15 MB, but it was compressed at a ratio of 3:1, so the ucompressed VB would have around ~40 MB. At that size even the badnwidth would probably start playing its role. But even when compressed to 15 MB, it still showed a huge performance increase. So, I`d say it`s worth spending a day or two with this change.

VladR My 3rd person action RPG on GreenLight: http://steamcommunity.com/sharedfiles/filedetails/?id=92951596

Ok, good to hear. I already started on implementing the sorting mechanism/state tree anyways :)

Another thing I was thinking about was how to best manage the vertexbuffers.

For static that won't be moving I figure it would be best to store the data depending on spatial positioning. Since spatially close objects probabilly will be rendered simultaneously anyway.

Then after dividing the vertexbuffers to spatial areas, also try to sort vertices of the same type close to eachother in the vertexbuffers.

So, basically I was thinking to have something like this:

World Map:
-------------------------------||              |               ||              |               ||     VB1      |      VB2      ||              |               ||______________|_______________||              |               ||              |               ||     VB3      |      VB4      ||              |               ||              |               ||------------------------------|


Now assuming that each vertexbuffer has some "optimal" size.
Then also put the data that has same vertex structure close to eachother in the buffers.

Do you think this will affect performance much at all?

VladR, so you used only one large vertex buffer? did you sort the vertex data according to the vertex structures in that buffer?

(btw, I know that I whouldn't worry about this now... yada yada... despite I want to do this)
Doesnt the optimal VBO size depend on the size of the graphics card vertex cache? Exceeding this cache size would hurt performance or something.

However using a single large vertex buffer is non-optimal, it is better to use several medium-sized buffers and simply also sort geometry based on which VBO it gets stored in.

Also i believe that in another thread somewhere Yann suggested that the render-queues use temporal-coherence and are retained from frame to frame and geometry is just added/removed as need be, thus the queue is partially sorted already and can be efficiently resorted using a natural-mergesort which would run in (almost) linear time with nearly sorted data, the difference is it runs with significantly less overhead than the radix sort and so is faster overall.

Static geometry can be pre-processed in the spatial partitioning so that it fits optimally into a cache, however dynamic geometry must be uploaded into your cache-slots.
The thing with the slot system is the need to avoid fragmentation which is easy if the slots are fixed in size (rather than perfectly allocated for the size of the request), however fixed size slots mean that there is some wastage in the VBO if the entire slot isnt used (which would be quite often) now actually this is fine on modern hardware and probably more efficient speed wise than trying to implement some sort of run-time defragmentation.

All in all sorting geometry by state and subdividing large vertex buffers ought to see you some rapid performance increase even on modern hardware.
Obviously, it all depends on the style of game, data representation of the level and camera system. So far, I`ve been working just on 3rd person games, so my experience is based around those. But I remember very clearly, from first versions of my renderer ~5 yrs ago, that when I put all data (uncompressed at that time so they used lots of VRAM) into single VB and IB (draw calls stayed the same), the fps jumped from 45 to over 100 (on shiny new GF3). Just by having everything in one VB and IB. It took half-day to implement it, so it was definitely worth it.
However, when I upgraded my engine to continuous streaming in the background, it was time to change it, since these days the cards are so fast, that they don`t have a problem rendering 1M tris each frame in 100 fps. So, I reduced the number of draw calls per static stuff (buildings/props) by grouping everything into sectors that get streamed. Surely, there is some duplication, since all crates and little stuff is the same. But since my Vertex takes usually just 12 Bytes (after some brutal compression), it`s practically nothing. Besides, it gets streamed in the background, so whatever the size of each chunk, it`s not slowing it down. But the rendering got faster, since I render whole chunk in a single DIP call. Redundant parts probably get clipped in the pipeline fast enough so as not to slow it down. Simply, even if I render 30k-60k tris per frame more (they`re outside the frustum, but you also save CPU cycles by not checking each object, or their quadtree node), the end result is again faster. Which is good.
Now it`s not in a single VB anymore, just as you say, there are as many VBs as there are chunks, which is about 5-24 per scene (depending on camera angle).
But since I`m getting framerate of ~250 with ~350k tris on a low-end GF6600, I don`t feel the need to try connecting the cached VBs into a single VB (as was the case with the GF3 example above) to see what kind of a performance boost that would give me.

Quote:did you sort the vertex data according to the vertex structures in that buffer
Yes, with shaders you can have many different vertex types inside one VB. Of course, they must be sorted, but you don`t do it per frame, so it`s not a problem.

Quote:VladR, so you used only one large vertex buffer ?
Well, not a single VB for whole game, that`s for sure. But previously (before caching system), all static objects (buildings/trees/props) were in one VB. Terrain had the second and characters had their own (per character type). But with the streaming, it is faster to just have a VB per terrain chunk (VB consisting just of RAW 16-bit heightmap values (so it is rather small per each chunk - just 32 KB) - the rest gets calculated inside shader).
Surely, you could have it all in 1 VB and update it slowly (after all, the visible chunks of terrain mesh consume less than 2 MB in VB, no matter how huge the whole world is), but this way the terrain can be endless and consumes as little memory as possible, so I made a trade-off here.

VladR My 3rd person action RPG on GreenLight: http://steamcommunity.com/sharedfiles/filedetails/?id=92951596

Yeh, that slot memory setup is interesting. I'm thinking I could live with some memory waste in the slots. (anyway it sounds cool "having written a gpu memory slot system" :P)

Ok, big thanks to everyone ;)

This topic is closed to new replies.

Advertisement