Jump to content
  • Advertisement
Sign in to follow this  
StormArmageddon

DrawPrimitive and batching

This topic is 4444 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I am a little bit confused as to why calling DrawPrimitive() many times is bad for performance if you keep the same vertex buffer, material, and texture the whole time. It seems to me that there would be very little overhead in DrawPrimitive() because you only need to send 12 bytes to the video card. Since the video card has to do all the work on each vertex separately, not knowing what is needed from the full vertex buffer upfront wouldn't be that big of deal. Unless of course, there is something about batching I don't really understand. On a related note, if you’re not supposed to call DrawPrimitive() a lot, how do you draw lots of objects that require the same vertex buffer, but that are rotated and translated by different amounts, is there a way of getting around using SetTransform() and DrawPrimitive() for each object?

Share this post


Link to post
Share on other sites
Advertisement
The overhead isn't the 12 bytes passed to D3D, it's the fact that D3D has to get the calls through to the driver, and the driver has to submit them to the card. There's quite a big overhead in that.

As for transforming lots of objects, there's instancing (which isn't really that great), or vertex shaders. Therre's often better ways around though, depending on exactly what you're trying to do.

Share this post


Link to post
Share on other sites
The big overhead is the kernel-mode context switch that a Draw**() call suffers.

So, basically you want to send as much data as possible for as few Draw**() calls. "Batching" and "instancing" are the two main ways.

Both tend to require a bit of clever "engineering" as they might not be the most immediately intuitive ways of rendering in your application.

The DirectX SDK has a sample about instancing - showing off various strategies/options.

<Shameless-Plug>
I recently wrote some sample code (see here and here) that demonstrates batching an optimization.

It's not the greatest example around:

Notice how the smaller batch-sizes are identical in performance - the primary reason being that the context-switching overhead takes up most of the time, thus hiding much of the advantage of the GPU. As you start submitting bigger and bigger batches you allow the GPU to stretch its legs and the per-call overhead becomes less significant.
</Shameless-Plug>

hth
Jack

Share this post


Link to post
Share on other sites
Quote:
Original post by jollyjeffers
The big overhead is the kernel-mode context switch that a Draw**() call suffers.


This isn't the only reason, the kernel mode context switch is expensive but the runtime is capable of batching the batches (several draw calls in one transition, don't ask me what the heuristic is.).

You should check several things in your code : see if changing one state, updating one new constant (both vertex and pixel shaders) will affect this performance and how much. (maybe that's what you already do but that's not clear from your graphs)

Quote:
is there a way of getting around using SetTransform() and DrawPrimitive() for each object?


Yes hw instancing. Be careful not to use it for big draw calls (there is small performance overhead), only if you have a lot of small draw calls:

SM3.0 best practices.

Of course if you have to change textures or other states between draw calls this will make the use of instancing impossible (at least until D3d10 arrives).

LeGreg

Share this post


Link to post
Share on other sites
Quote:
Original post by StormArmageddon
how do you draw lots of objects that require the same vertex buffer, but that are rotated and translated by different amounts, is there a way of getting around using SetTransform() and DrawPrimitive() for each object?

This may be obvious, but you always have the option of creating a larger vertex buffer (they aren't usually big memory hogs) and then transforming the vertices as a preprocess. The choice of static or dynamic vertex buffer depends on exactly what you're doing, but if your geometry is more or less set in place, just use a static vertex buffer and make the direct tradeoff of memory for rendering speed (since you can now transform and render the entire GROUP in one drawprimitive). Why do the same calculations over and over every frame if you have some memory to spare?

Share this post


Link to post
Share on other sites
Thanks a bunch, didn't realize the drivers/kernel were the bottle-neck, but I guess that makes sense.

Quote:
This may be obvious, but you always have the option of creating a larger vertex buffer (they aren't usually big memory hogs) and then transforming the vertices as a preprocess. The choice of static or dynamic vertex buffer depends on exactly what you're doing, but if your geometry is more or less set in place, just use a static vertex buffer and make the direct tradeoff of memory for rendering speed (since you can now transform and render the entire GROUP in one drawprimitive). Why do the same calculations over and over every frame if you have some memory to spare?

I was actually talking about objects that are moving dynamically, such as a physics simulation with lots of objects. I thought about just using dynamic vertex buffers but I couldn't see modifying and sending the entire vertex buffer every frame as being an efficient option.

Share this post


Link to post
Share on other sites
Actually, to me it still doesn't make sense. This may be due to my unfamiliarity with how Windows works at the lowlevel, but what extra work is being done in the context switch? It seems odd that Windows can't just accept the code and pass it off to the driver really easy. Are there a lot of checks or something that Windows has to do before the move to kernel mode is made?

Share this post


Link to post
Share on other sites
Quote:
Original post by Cypher19
Actually, to me it still doesn't make sense. This may be due to my unfamiliarity with how Windows works at the lowlevel, but what extra work is being done in the context switch? It seems odd that Windows can't just accept the code and pass it off to the driver really easy. Are there a lot of checks or something that Windows has to do before the move to kernel mode is made?


The gpus are also very fast. So the work you have to do to draw one triangle has to be very low.

Share this post


Link to post
Share on other sites
I'm not too familiar with the inner secrets of the Windows kernel/OS, but you might be able to get more information from the DDK's if you're interested.

Quote:
Original post by Cypher19
Are there a lot of checks or something that Windows has to do before the move to kernel mode is made?
Yup, you can't easily jump between the rings - various checks and moves and so on are made.

As was suggested - I dont think the individual context/kernel switch is slow, more than the architecture of D3D9 generates lots of them. Even lots of slightly slow calls ends up building up a noticeable delay...

hth
Jack

Share this post


Link to post
Share on other sites
A context switch costs about 1000 cycles in each direction, probably a little more. If the actual call does nothing, then on a 2 GHz machine running at 100 FPS, that gives us a maximum of 10K calls assuming that nothing else is happening.

Add in everything that is happening at the various layers of the system and you run out of room to make draw calls really quickly.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!