DrawPrimitive and batching

Started by
9 comments, last by LeGreg 17 years, 11 months ago
I am a little bit confused as to why calling DrawPrimitive() many times is bad for performance if you keep the same vertex buffer, material, and texture the whole time. It seems to me that there would be very little overhead in DrawPrimitive() because you only need to send 12 bytes to the video card. Since the video card has to do all the work on each vertex separately, not knowing what is needed from the full vertex buffer upfront wouldn't be that big of deal. Unless of course, there is something about batching I don't really understand. On a related note, if you’re not supposed to call DrawPrimitive() a lot, how do you draw lots of objects that require the same vertex buffer, but that are rotated and translated by different amounts, is there a way of getting around using SetTransform() and DrawPrimitive() for each object?
Advertisement
The overhead isn't the 12 bytes passed to D3D, it's the fact that D3D has to get the calls through to the driver, and the driver has to submit them to the card. There's quite a big overhead in that.

As for transforming lots of objects, there's instancing (which isn't really that great), or vertex shaders. Therre's often better ways around though, depending on exactly what you're trying to do.
The big overhead is the kernel-mode context switch that a Draw**() call suffers.

So, basically you want to send as much data as possible for as few Draw**() calls. "Batching" and "instancing" are the two main ways.

Both tend to require a bit of clever "engineering" as they might not be the most immediately intuitive ways of rendering in your application.

The DirectX SDK has a sample about instancing - showing off various strategies/options.

<Shameless-Plug>
I recently wrote some sample code (see here and here) that demonstrates batching an optimization.

It's not the greatest example around:

Notice how the smaller batch-sizes are identical in performance - the primary reason being that the context-switching overhead takes up most of the time, thus hiding much of the advantage of the GPU. As you start submitting bigger and bigger batches you allow the GPU to stretch its legs and the per-call overhead becomes less significant.
</Shameless-Plug>

hth
Jack

<hr align="left" width="25%" />
Jack Hoxley <small>[</small><small> Forum FAQ | Revised FAQ | MVP Profile | Developer Journal ]</small>

Quote:Original post by jollyjeffers
The big overhead is the kernel-mode context switch that a Draw**() call suffers.


This isn't the only reason, the kernel mode context switch is expensive but the runtime is capable of batching the batches (several draw calls in one transition, don't ask me what the heuristic is.).

You should check several things in your code : see if changing one state, updating one new constant (both vertex and pixel shaders) will affect this performance and how much. (maybe that's what you already do but that's not clear from your graphs)

Quote:is there a way of getting around using SetTransform() and DrawPrimitive() for each object?


Yes hw instancing. Be careful not to use it for big draw calls (there is small performance overhead), only if you have a lot of small draw calls:

SM3.0 best practices.

Of course if you have to change textures or other states between draw calls this will make the use of instancing impossible (at least until D3d10 arrives).

LeGreg
Quote:Original post by StormArmageddon
how do you draw lots of objects that require the same vertex buffer, but that are rotated and translated by different amounts, is there a way of getting around using SetTransform() and DrawPrimitive() for each object?

This may be obvious, but you always have the option of creating a larger vertex buffer (they aren't usually big memory hogs) and then transforming the vertices as a preprocess. The choice of static or dynamic vertex buffer depends on exactly what you're doing, but if your geometry is more or less set in place, just use a static vertex buffer and make the direct tradeoff of memory for rendering speed (since you can now transform and render the entire GROUP in one drawprimitive). Why do the same calculations over and over every frame if you have some memory to spare?

Thanks a bunch, didn't realize the drivers/kernel were the bottle-neck, but I guess that makes sense.

Quote:This may be obvious, but you always have the option of creating a larger vertex buffer (they aren't usually big memory hogs) and then transforming the vertices as a preprocess. The choice of static or dynamic vertex buffer depends on exactly what you're doing, but if your geometry is more or less set in place, just use a static vertex buffer and make the direct tradeoff of memory for rendering speed (since you can now transform and render the entire GROUP in one drawprimitive). Why do the same calculations over and over every frame if you have some memory to spare?

I was actually talking about objects that are moving dynamically, such as a physics simulation with lots of objects. I thought about just using dynamic vertex buffers but I couldn't see modifying and sending the entire vertex buffer every frame as being an efficient option.
Actually, to me it still doesn't make sense. This may be due to my unfamiliarity with how Windows works at the lowlevel, but what extra work is being done in the context switch? It seems odd that Windows can't just accept the code and pass it off to the driver really easy. Are there a lot of checks or something that Windows has to do before the move to kernel mode is made?
Quote:Original post by Cypher19
Actually, to me it still doesn't make sense. This may be due to my unfamiliarity with how Windows works at the lowlevel, but what extra work is being done in the context switch? It seems odd that Windows can't just accept the code and pass it off to the driver really easy. Are there a lot of checks or something that Windows has to do before the move to kernel mode is made?


The gpus are also very fast. So the work you have to do to draw one triangle has to be very low.
I'm not too familiar with the inner secrets of the Windows kernel/OS, but you might be able to get more information from the DDK's if you're interested.

Quote:Original post by Cypher19
Are there a lot of checks or something that Windows has to do before the move to kernel mode is made?
Yup, you can't easily jump between the rings - various checks and moves and so on are made.

As was suggested - I dont think the individual context/kernel switch is slow, more than the architecture of D3D9 generates lots of them. Even lots of slightly slow calls ends up building up a noticeable delay...

hth
Jack

<hr align="left" width="25%" />
Jack Hoxley <small>[</small><small> Forum FAQ | Revised FAQ | MVP Profile | Developer Journal ]</small>

A context switch costs about 1000 cycles in each direction, probably a little more. If the actual call does nothing, then on a 2 GHz machine running at 100 FPS, that gives us a maximum of 10K calls assuming that nothing else is happening.

Add in everything that is happening at the various layers of the system and you run out of room to make draw calls really quickly.
SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.

This topic is closed to new replies.

Advertisement