d3d performance, vertex buffers

Started by
3 comments, last by Namethatnobodyelsetook 19 years, 6 months ago
I realize this is probably a popular topic, but I was wondering if anyone had any perticular favorite links they could post where I could read more about this. Basically, from what poking around I have done on these forums, it seems like sending triangles to the card in batches is a really good idea. However, the engine I am developing is for a 2d/3d shooter; a lot of textured quads mixed with models. My question is, whats the best way to be rendering all this stuff? Should i make a giant vertex buffer, lock it once at the beginning of a frame, transform all the quads in software and copy the resulting verts into the vb and then unlock and render? that doesnt seem that optimal to me; whats the point of having a graphics card that can transform verts fast if the cpu does it? Basically, I guess im confused about the whole process of batching - how can you draw a bunch of different models or quads or whatever using different worldview matrices if they are all hiding in the same buffer? isn't the whole point of batching to render in one big drawprimitive call? Anyway, thanks in advance for your time, I realize this is probably a well covered topic - its just hard to seperate the good from the bad when you get so many hits.
Advertisement
There is a huge overhead to doing a Draw call. If you're drawing less than 100 vertices in a call, the CPU will take more time to initialize the graphics card, than it would take for the CPU to have just transformed the data in software.

CPU cost of GPU Draw = 100
CPU cost of CPU Transform = 1.

Now lets say we're drawing 50 quads.

Using GPU to transform requires 50 draw calls.
Cost = 100 * 50 = 5000

Using CPU to transform requires 200 CPU transforms (4 * 50) and 1 draw call.
Cost = 100 + 200 * 1 = 300.

These costs aren't actual specific numbers, but illustrate the point.


Batching:
When doing CPU transform, the data can all be written at once into a dynamic VB and rendered from that.

When doing GPU transform, batching is often drawing static pre-determined world geometry. This is pre-processed to know which items in the same area of the world are using the same textures and states, and to put them in the VB as a single item. Another form of batching is to draw n copies of the same mesh at different locations. This is useful for rocks, trees, bullets and other common but small objects. This is done by putting n copies of the mesh into a VB, each with a unique ID, and using that ID to select a transform matrix (just like bones). The latest technique available on GeForce6800 is similar, except uses extra data streams rather than shader constants to store the various transforms, and only requires one copy of the mesh in the VB.

edit: Changed CPU transform cost * 4 to account for being a quad, not a single vertex... It's still a huge performance gain to use CPU.

[Edited by - Namethatnobodyelsetook on September 28, 2004 9:22:06 PM]
Quote:Basically, I guess im confused about the whole process of batching - how can you draw a bunch of different models or quads or whatever using different worldview matrices if they are all hiding in the same buffer? isn't the whole point of batching to render in one big drawprimitive call?


You can use index buffers and drawindexedprimitive. Tutorial
______________________________Perry Butler aka iosysiosys Website | iosys Music | iosys Engine
Quote:Original post by iosys

You can use index buffers and drawindexedprimitive. Tutorial


ah, so I would make a bunch of different draw calls, but using the same vertex and index buffers? I thought the point was to minimize draw calls, but apparently the point is to minimize switching indices and vertexes?

-Eli
I'm not sure what iosys was thinking when posting that, so I'll have to guess.

When drawing a quad you can use a tristrip or a trilist. If you don't use indices a tristrip need 4 vertices, and a trilist needs 6. If you want to draw multiple quads, they aren't necessarily connected, so you need to use a trilist, which will require 6 vertices per quad.

If however you use indices you only need 4 vertices and 6 indices (ie: 0, 1, 2, and 2, 1, 3) to make a quad. If you're transforming in software, you've just reduced your workload by 33%. So yeah, indices are good, use them, but it still doesn't cover the batching issue at all.

You DO want to limit your draw calls... and your material/effect/texture changes. Each draw call is slow, and each change of textures is slow. Typically you sort what you have to draw by texture (and other state changes), try to batch as much of that together into as few draw calls as you can.

ie:

SetTexture1
DrawMesh1
SetTexture2
DrawMesh2
SetTexture1
DrawMesh1
SetTexture3
DrawMesh3
SetTexture2
DrawMesh2

is bad, while

SetTexture1
DrawMesh1
DrawMesh1
SetTexture2
DrawMesh2
DrawMesh2
SetTexture3
DrawMesh3

is better, and

SetTexture1
DrawMesh1 twice via some batching technique
SetTexture2
DrawMesh2 twice via some batching technique
SetTexture3
DrawMesh3

is ideal.

If you don't need texture wrapping, and can take all 3 textures, put them in one larger texture (ie: 4 256x256 textures can fit in a single 512x512 texture that doesn't need to wrap), then you have a chance to go even further and do

SetMegaTexture1
DrawMeshes

though getting to this level of batching is pretty much reserved for pre-processed static world geometry. In a real game you're not going to know to put the exact specific number of copies of the meshes into the VB to allow you to batch 3 types of meshes in a single draw call. The odds of getting the correct texture combinations, and mesh count combinations for dynamic world content just rules this out. Of course, if you're doing software transform, the odds are better since the data doesn't have to be already be sitting in the VB.


The technique I use is this (requires using a shader):
nMaxBatch = 1000 / vertcount // arbitrarily chosen number
nMaxBatch = max(nMaxBatch, 1)
nMaxBatch = min(nMaxBatch, 15)
Copy data to VB nMaxBatch times each copy with a unique number in a float1. I use this to index into an array of transforms in my shader.

Copy data to IB nMaxBatch times, each copy the indices are offset to point to the nth copy of the vertices.

When drawing I just need to use nPrimCount * copies to render, and nMaxVertex =+ nIncToMaxVertex * copies. I can't remember off hand why the maxvertices isn't a straight multiply, but I do remember that I did that.


For quads, and possibly for other small objects (10-20 poly rocks for example) you can probably get better performance using entirely software transform, writing the output into a dynamic vb and rendering that. It'll work on all hardware, new and old, no shaders required, and will give the best performance.

Because of constant limits in vs.1.1, I max out at 15 copies of an object in one draw. For a quad, that's 30 polys. Until you hit 300 polys you're probably limited by the CPU overhead of calling DrawIndexedPrimitive(). So, even using hardware batching, it's not good for quads. Knowing the quads aren't going to need texture transforms, or lighting, etc. you could pack position x,y, scale x,y into a single constant, allowing 96 quads per batch... or 192 polys... almost enough to be useful.

What you end up doing really depends on your target hardware, and personal preference. There isn't just one technique to works for everything. Hopefully you've got a better idea of what to do, and what's possible anyway.

This topic is closed to new replies.

Advertisement