A Vertex Buffer can be released...?

Started by
18 comments, last by Namethatnobodyelsetook 18 years, 9 months ago
I have all my static models split up with different vb's, and it has been fast so far. However, I haven't rendered a scene with a few hundred objects like you will be.
Advertisement
Hahaha, I was referring to 2.2 milliseconds per frame. But the number came right out of nowhere. It would depend heavily on your number of models and how large they are.
Firstly, 32 bytes per vertex is something that can probably be reduced.

Consider that each VB is a sort of heap. Rather than it being a heap of bytes, it's a heap of vertices. As such all the standard principles of memory managers apply...

Track ranges of vertices in each buffer and the model that they belong to. When you load a new model in, look for buffers with the right amount of space (if you're using index buffers then you can fix up the indices, so it doesn't even have to be a single consecutive chunk).

Once you've got a list of potential buffers, you want to pick the one that will give the best performance - i.e. the other models using that buffer are likely to be used at the same time as your new model, so if you put them together you won't need to switch. That information can be built by hand (you give each model a list of other models that it gets used at the same time as, in order) or automatically (you play through the game with a 'usage recorder' turned on which generates data about how often each pair of models were rendered in the same frame).

Richard "Superpig" Fine - saving pigs from untimely fates - Microsoft DirectX MVP 2006/2007/2008/2009
"Shaders are not meant to do everything. Of course you can try to use it for everything, but it's like playing football using cabbage." - MickeyMouse

Yes, at 2.2ms per frame, it's still a potential huge amount of overhead that can be avoided atleast partially. While I doubt it takes 2.2ms per frame, it does take something. Doing a heap approach like Superpig suggests allows multiple models to share buffers if the vertex formats match. It's also what I do. You could force allocations to be pow2, or in chunks to help reduce fragmentation. Yann used some sort of dynamic pow2 based vertex manager. Once you're sharing models in a buffer, your render sorting function should attempt to put models using the same buffer together. How you choose to sort your models (perhaps sort by transparent objects first, then solids by shaders, then textures, then vertex buffers, then Z) is entirely dependant on your app. What works for one app may not be optimal for another. I'm just pointing out that vertex buffer can factor into the sort somewhere.

While 32 byte vertices can be reduced, you still want to be a pow2 multiple for GPU cache purposes. Modern games also require more data, for example, lets say your vertex requires:

position, normal, diffuse, boneids, boneweights, uv, tangent.

At the very least, we need 4 bytes per element, or 28 bytes. Even if you omit a few, rounding up to 32 is still optimal. If you drop diffuse, boneids and weights, you can cram the data into 16 bytes... assuming the lack of precision is fine for position, normal, and tangent. As superpig's post points out, you can use SHORT2 for the uvs without trouble.

Still, say we drop those vertices down to 16 bytes, we're still talking about 900MB of vertices. Drop it to a minimal 4 bytes and it's still 228MB.

A game that manages this many resources is quite an undertaking. Since the OP is unclear about managing VB memory, I don't think he's quite ready to tackle such a project. Sure, aim high, but reasonable high. This is going to be one massive game, which usually isn't the best type of project with which to learn D3D.
Quote:Original post by Namethatnobodyelsetook
perhaps sort by transparent objects first, then solids by shaders, then textures, then vertex buffers, then Z


Just curious, what type of data structure and sorting method do you use to make this fast enough to do per frame?

Right now my scene graph is a N-tuple tree where N is unlimited, and is initially constructed spatially and not modified thereafter. I knew I would have to sort it from front to back eventually, but haven't done it yet. If you could point me in the right direction to do all those sorts effeciently it would save me a lot of time.

Also, do you sort before frustrum culling or keep a seperate scene graph of objects that pass frustrum culling, then do the sorting?

Thanks.
Quote:Original post by Chris81
Quote:Original post by Namethatnobodyelsetook
perhaps sort by transparent objects first, then solids by shaders, then textures, then vertex buffers, then Z

Just curious, what type of data structure and sorting method do you use to make this fast enough to do per frame?

Usually, only your transparent objects need sorted per frame. Other texture types should be sorted as they are registered or loaded for rendering. Same goes for the stream source. You sort textures first, then sub-sort the stream source under that. So if five objects use the same stream source and texture, there should be no state changes, other than a matrix upload or maybe lighting.

I wasn't aware that shaders were better grouped than textures. Is this true? As in changing from one technique to another should be avoided more than changing between textures?
I gather what's visible, then break that into n sections, one per material. Each section's vital info (material, Z, original object pointer, etc) is copied into a list node. These list nodes are all preallocated when an object is added to the scene, so no allocation/free occurs each frame.

The linked list is then sorted via a Merge Sort which is based on This, except for I use an Amiga-style doubly linked list (ie: the list nodes ARE the data nodes, nodes don't just refer to data. Limitation is data item can only be in 1 list at a time. Gain is never need to worry about allocation of list nodes).

In addition to what's above, I also sort by mesh pointer, such that two objects with the same source mesh are placed next to each other. Now, when I go to render, I can check if the next object in the list is the same object and material, and decide to batch render instead. I just keep checking next, next, next, until I find a mismatch or I have too many objects to do in one batch.

For commercial quality (though, sadly, we're not doing AAA quality) games, this sort takes roughly 0.2% of our CPU... so, basically, it's free. I could optimize it further by including the sort parameters in the sort algorithm instead of doing a function call, but, really it's already blazingly fast.
Interesting, thanks.
Quote:Original post by Namethatnobodyelsetook
Doing a heap approach like Superpig suggests allows multiple models to share buffers if the vertex formats match.
Actually, with vertex declarators I /think/ you can put them in the same buffer even if the formats don't match. You just adjust your offsets accordingly. It's possible that they need to be the same size - would certainly help from a management standpoint - but still, better than seperate buffers.

Quote:While 32 byte vertices can be reduced, you still want to be a pow2 multiple for GPU cache purposes.
I'd be interested to know where you heard that. I've always been under the impression that fetches into the pretransform cache were larger than 32 bytes at a time - more like 128 - which is why we reorder vertex buffers for indexed primitives such that sequentially used vertices get fetched into the cache all at the same time. As such being non-pow2 shouldn't matter - it'd be more a question of whether your vertex size is a factor of the fetch size, or you're wasting space at the end of the fetch. (Of course, not working for an IHV, I could be totally off-base with this. Feel free to correct me).

Richard "Superpig" Fine - saving pigs from untimely fates - Microsoft DirectX MVP 2006/2007/2008/2009
"Shaders are not meant to do everything. Of course you can try to use it for everything, but it's like playing football using cabbage." - MickeyMouse

True, matching stride is the important part of using the same VB. I don't know how many cards support the new DX9 offset feature, so using that to match up between different strides isn't always going to work. It does mean you can't use FVF vertex buffers, but as far as I know that just means you can't use it as a destination for ProcessVertices, so it's not a biggie.

This nVidia presentation mentions the 32 byte vertex, as does this ATI paper. The ATI paper claims the cache line is 256 bits, or 32 bytes, which is why 32 or 64 byte vertices are best. They suggest padding a 40 byte vertex out to 64. That's lots of wasted RAM, so I'm sure it does make a significant difference (perhaps entirely theoretical unless you're memory and vertex limited... who knows.) The trouble with, say, 28 byte vertices is that the hardware may require fetches on 256 bit boundries. In order to fetch a non-pow2 vertex you may require two fetches. The occasional cache miss won't kill you, but now you're doubling the damage of each cache miss. Again, not terrible if you still usually access cached data, but depending on how well your model caches, it could be bad.

This topic is closed to new replies.

Advertisement