Sign in to follow this  
StormArmageddon

DrawPrimitive and batching

Recommended Posts

I am a little bit confused as to why calling DrawPrimitive() many times is bad for performance if you keep the same vertex buffer, material, and texture the whole time. It seems to me that there would be very little overhead in DrawPrimitive() because you only need to send 12 bytes to the video card. Since the video card has to do all the work on each vertex separately, not knowing what is needed from the full vertex buffer upfront wouldn't be that big of deal. Unless of course, there is something about batching I don't really understand. On a related note, if you’re not supposed to call DrawPrimitive() a lot, how do you draw lots of objects that require the same vertex buffer, but that are rotated and translated by different amounts, is there a way of getting around using SetTransform() and DrawPrimitive() for each object?

Share this post


Link to post
Share on other sites
Evil Steve    2017
The overhead isn't the 12 bytes passed to D3D, it's the fact that D3D has to get the calls through to the driver, and the driver has to submit them to the card. There's quite a big overhead in that.

As for transforming lots of objects, there's instancing (which isn't really that great), or vertex shaders. Therre's often better ways around though, depending on exactly what you're trying to do.

Share this post


Link to post
Share on other sites
jollyjeffers    1570
The big overhead is the kernel-mode context switch that a Draw**() call suffers.

So, basically you want to send as much data as possible for as few Draw**() calls. "Batching" and "instancing" are the two main ways.

Both tend to require a bit of clever "engineering" as they might not be the most immediately intuitive ways of rendering in your application.

The DirectX SDK has a sample about instancing - showing off various strategies/options.

<Shameless-Plug>
I recently wrote some sample code (see here and here) that demonstrates batching an optimization.

It's not the greatest example around:

Notice how the smaller batch-sizes are identical in performance - the primary reason being that the context-switching overhead takes up most of the time, thus hiding much of the advantage of the GPU. As you start submitting bigger and bigger batches you allow the GPU to stretch its legs and the per-call overhead becomes less significant.
</Shameless-Plug>

hth
Jack

Share this post


Link to post
Share on other sites
LeGreg    754
Quote:
Original post by jollyjeffers
The big overhead is the kernel-mode context switch that a Draw**() call suffers.


This isn't the only reason, the kernel mode context switch is expensive but the runtime is capable of batching the batches (several draw calls in one transition, don't ask me what the heuristic is.).

You should check several things in your code : see if changing one state, updating one new constant (both vertex and pixel shaders) will affect this performance and how much. (maybe that's what you already do but that's not clear from your graphs)

Quote:
is there a way of getting around using SetTransform() and DrawPrimitive() for each object?


Yes hw instancing. Be careful not to use it for big draw calls (there is small performance overhead), only if you have a lot of small draw calls:

SM3.0 best practices.

Of course if you have to change textures or other states between draw calls this will make the use of instancing impossible (at least until D3d10 arrives).

LeGreg

Share this post


Link to post
Share on other sites
MasterWorks    496
Quote:
Original post by StormArmageddon
how do you draw lots of objects that require the same vertex buffer, but that are rotated and translated by different amounts, is there a way of getting around using SetTransform() and DrawPrimitive() for each object?

This may be obvious, but you always have the option of creating a larger vertex buffer (they aren't usually big memory hogs) and then transforming the vertices as a preprocess. The choice of static or dynamic vertex buffer depends on exactly what you're doing, but if your geometry is more or less set in place, just use a static vertex buffer and make the direct tradeoff of memory for rendering speed (since you can now transform and render the entire GROUP in one drawprimitive). Why do the same calculations over and over every frame if you have some memory to spare?

Share this post


Link to post
Share on other sites
Thanks a bunch, didn't realize the drivers/kernel were the bottle-neck, but I guess that makes sense.

Quote:
This may be obvious, but you always have the option of creating a larger vertex buffer (they aren't usually big memory hogs) and then transforming the vertices as a preprocess. The choice of static or dynamic vertex buffer depends on exactly what you're doing, but if your geometry is more or less set in place, just use a static vertex buffer and make the direct tradeoff of memory for rendering speed (since you can now transform and render the entire GROUP in one drawprimitive). Why do the same calculations over and over every frame if you have some memory to spare?

I was actually talking about objects that are moving dynamically, such as a physics simulation with lots of objects. I thought about just using dynamic vertex buffers but I couldn't see modifying and sending the entire vertex buffer every frame as being an efficient option.

Share this post


Link to post
Share on other sites
Cypher19    768
Actually, to me it still doesn't make sense. This may be due to my unfamiliarity with how Windows works at the lowlevel, but what extra work is being done in the context switch? It seems odd that Windows can't just accept the code and pass it off to the driver really easy. Are there a lot of checks or something that Windows has to do before the move to kernel mode is made?

Share this post


Link to post
Share on other sites
LeGreg    754
Quote:
Original post by Cypher19
Actually, to me it still doesn't make sense. This may be due to my unfamiliarity with how Windows works at the lowlevel, but what extra work is being done in the context switch? It seems odd that Windows can't just accept the code and pass it off to the driver really easy. Are there a lot of checks or something that Windows has to do before the move to kernel mode is made?


The gpus are also very fast. So the work you have to do to draw one triangle has to be very low.

Share this post


Link to post
Share on other sites
jollyjeffers    1570
I'm not too familiar with the inner secrets of the Windows kernel/OS, but you might be able to get more information from the DDK's if you're interested.

Quote:
Original post by Cypher19
Are there a lot of checks or something that Windows has to do before the move to kernel mode is made?
Yup, you can't easily jump between the rings - various checks and moves and so on are made.

As was suggested - I dont think the individual context/kernel switch is slow, more than the architecture of D3D9 generates lots of them. Even lots of slightly slow calls ends up building up a noticeable delay...

hth
Jack

Share this post


Link to post
Share on other sites
Promit    13246
A context switch costs about 1000 cycles in each direction, probably a little more. If the actual call does nothing, then on a 2 GHz machine running at 100 FPS, that gives us a maximum of 10K calls assuming that nothing else is happening.

Add in everything that is happening at the various layers of the system and you run out of room to make draw calls really quickly.

Share this post


Link to post
Share on other sites
LeGreg    754
Note, I had already detailed some of the work that was necessary in an old post there.

It's a bit more work than sending 12 bytes to the video card. Of course most of this work doesn't exist on console (like the Xbox) with a similar D3d interface because they can make it much closer to the hardware and they even have a superoptimized mode where they can save a whole pushbuffer (maybe not as powerful as a display list, but much more lightweight and efficient) and reuse it for several draw calls. Similarly, switching texture is so much more efficient on consoles.

D3D9 XP can't be made much more efficient. A driver usually runs several architectures with different requirements and caps. The hardware is not necessarily a good match to the API and the API can cover several generations of D3D as well (the DDI has not been cleaned up, it will be in Vista). The driver doesn't live in the same space as the application, and it has to accept that somebody has corrupted its draw calls and the hw doesn't fail gracefully for all those errors, etc...
D3D9 vista may help, but performance will suffer on other fronts.

D3d10 vista should be much better but as the hardware becomes faster and faster, if you plan to make small batches, runtime/driver work per batch will have to be ridiculously low and some other bottleneck may appear (like the ability of the chip front end to process commands, bubbles in the chip, or even available bandwidth to pass the pushbuffer around).

Catching up with GPU speed is an open problem, that Microsoft, GPU makers try to solve constantly.

LeGreg

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this