Interesting performance notes from the new SDK docs

Started by
12 comments, last by IFooBar 19 years, 8 months ago
Quote: Now that you know about the command buffer and the effect it can have on profiling, you should know that there are a few other conditions that can cause the runtime to empty the command buffer. You need to watch out for these in your render sequences. Some of these conditions are in response to API calls, others are in response to resource changes in the runtime. Any of the following conditions will cause a mode transition: When one of the lock methods (IDirect3DVertexBuffer9::Lock) is called on a vertex buffer, index buffer, or texture (under certain conditions with certain flags). When a device or vertex buffer, index buffer, or texture is created. When a device or vertex buffer, index buffer, or texture is destroyed by the last release. When IDirect3DDevice9::ValidateDevice is called. When IDirect3DDevice9::Present is called. When the command buffer fills up. When IDirect3DQuery9::GetData is called with D3DGETDATA_FLUSH.
When they say "mode transition", they're talking about a switch from user mode to kernel mode (and back), which is one of the more expensive things you can do under windows. Interestingly enough, it doesn't say that a mode transition takes place when a shader is created. I'm assuming that's either a typo, or the shader is actually sent to the driver only when you call SetVertexShader() / SetPixelShader(). The fact that both new buffer creation and using lock() causes a transition definitely goes a long way towards determining the best options when using dynamic vs. static vertex buffers. Creating a new buffer could actually be more efficient if the data isn't going to change more than once every few seconds (as opposed to every frame, as would happen in a dynamic buffer). Avoiding an extra mode transition or two could definitely give you a performance boost (in exchange for the cost of copying extra data across the AGP bus). Obviously that all depends on how often you do the changes. It would never be more efficient to rebuild the buffer every frame, for example. I'm still reading the doc, but overall I'm really thankful for the performance tweaking tools provided with the latest release (especially PIX) and the enhanced documentation for performance issues (this isn't just in the document; every sample actually inicludes performance notes now that were sadly lacking in previous SDK revisions, meaning you didn't find out how slow a given process was until you saw it with your own eyes). Edit: Also of note here are estimated state change costs. Obviously these are going to be driver-dependent, but the ordering should be roughtly the same in terms of most-> least expensive: API Call Average number of Cycles SetVertexDeclaration 6500 - 11250 SetFVF 6400 - 11200 SetVertexShader 3000 - 12100 SetPixelShader 6300 - 7000 SPECULARENABLE 1900 - 11200 SetRenderTarget 6000 - 6250 SetPixelShaderConstant (1 Constant) 1500 - 9000 NORMALIZENORMALS 2200 - 8100 LightEnable 1300 - 9000 SetStreamSource 3700 - 5800 LIGHTING 1700 - 7500 DIFFUSEMATERIALSOURCE 900 - 8300 AMBIENTMATERIALSOURCE 900 - 8200 COLORVERTEX 800 - 7800 SetLight 2200 - 5100 SetTransform 3200 - 3750 SetIndices 900 - 5600 AMBIENT 1150 - 4800 SetTexture 2500 - 3100 SPECULARMATERIALSOURCE 900 - 4600 EMISSIVEMATERIALSOURCE 900 - 4500 SetMaterial 1000 - 3700 ZENABLE 700 - 3900 WRAP0 1600 - 2700 MINFILTER 1700 - 2500 MAGFILTER 1700 - 2400 SetVertexShaderConstant (1 Constant) 1000 - 2700 COLOROP 1500 - 2100 COLORARG2 1300 - 2000 COLORARG1 1300 - 1980 CULLMODE 500 - 2570 CLIPPING 500 - 2550 DrawIndexedPrimitive 1200 - 1400 ADDRESSV 1090 - 1500 ADDRESSU 1070 - 1500 DrawPrimitive 1050 - 1150 SRGBTEXTURE 150 - 1500 STENCILMASK 570 - 700 STENCILZFAIL 500 - 800 STENCILREF 550 - 700 ALPHABLENDENABLE 550 - 700 STENCILFUNC 560 - 680 STENCILWRITEMASK 520 - 700 STENCILFAIL 500 - 750 ZFUNC 510 - 700 ZWRITEENABLE 520 - 680 STENCILENABLE 540 - 650 STENCILPASS 560 - 630 SRCBLEND 500 - 685 TWOSIDEDSTENCILMODE 450 - 590 ALPHATESTENABLE 470 - 525 ALPHAREF 460 - 530 ALPHAFUNC 450 - 540 DESTBLEND 475 - 510 COLORWRITEENABLE 465 - 515 CCW_STENCILFAIL 340 - 560 CCW_STENCILPASS 340 - 545 CCW_STENCILZFAIL 330 - 495 SCISSORTESTENABLE 375 - 440 CCW_STENCILFUNC 250 - 480 SetScissorRect 150 - 340

---------------------------Hello, and Welcome to some arbitrary temporal location in the space-time continuum.

Advertisement
I agree wholeheartedly, very good read for any graphics programmer using D3D. Most of the concepts are probably transferable to OpenGL, also, such as texture binding time and the like.

Niko Suni

Great info. Perhaps this could be made a sticky or put into the Forum FAQ?

Knowing roughly the number of cycles (even if it is driver dependent) of these API calls would help a lot in deciding what kind of trade-offs to make when deciding how to render things.

neneboricua
I've stickied it temporarily. If someone'd put it in some Q&A shape, I'll add it to the FAQ.
(Perhaps the answer should just be a link to the doc page?)

Quote:Creating a new buffer could actually be more efficient if the data isn't going to change more than once every few seconds (as opposed to every frame, as would happen in a dynamic buffer).
How can that be true? The only way to get data into the new vertex buffer you just created is to lock it. By creating a new buffer you have two overheads, by simply locking it with the discard flag you only have one.
____________________________________________________________AAAAA: American Association Against Adobe AcrobatYou know you hate PDFs...
Quote:Original post by Raloth
Quote:Creating a new buffer could actually be more efficient if the data isn't going to change more than once every few seconds (as opposed to every frame, as would happen in a dynamic buffer).
How can that be true? The only way to get data into the new vertex buffer you just created is to lock it. By creating a new buffer you have two overheads, by simply locking it with the discard flag you only have one.


If I have a dynamic buffer, i'm going to be locking every single frame (costing me 2 mode changes + the cost of data transfer).

If I have a static buffer, I'm going to be doing 1 release, 1 create, 1 lock, and the cost of the data transfer.

After that, I have only the cost of drawprimitive each frame.

Like I said -- if you're only updating the buffer every once in a while (like maybe once every few seconds), creating a new static buffer is more effecient than using a dynamic buffer (because you avoid all that nasty AGP transfer overhead as well as the constant locking and unlocking).

---------------------------Hello, and Welcome to some arbitrary temporal location in the space-time continuum.

When you create the new static vertex buffer you still have to lock and unlock it the same as if you had locked your dynamic vertex buffer. Creating a new one is going to screw around with the driver, so that adds even more overhead. There is no way that creating a new vertex buffer is more efficient. Locking the static buffer with the discard flag will achieve the same thing with better performance, and depending on the application there may be next to no hit at all.
____________________________________________________________AAAAA: American Association Against Adobe AcrobatYou know you hate PDFs...
Yes, if you were doing it **EVERY FRAME*** (note the emphasis here, it's important), you'd kill performance with a static buffer.

However, if you're only updating once every few seconds, you can actually get superior performance out of static buffers. I've already implemented a system into my current engine that allows for both, and it works extremely well. Don't take my word for it, though -- try it yourself.

---------------------------Hello, and Welcome to some arbitrary temporal location in the space-time continuum.

I understand that static buffers have better performance when they aren't updated often. I too have a similar system. What I don't understand is why releasing the old one and creating a new one, plus doing the lock and unlock, would be cheaper than just locking the old one with discard.
____________________________________________________________AAAAA: American Association Against Adobe AcrobatYou know you hate PDFs...
Locking the old with discard will force the card to flush it's use of the buffer.

Delete can be delayed interally until D3D is done with it, and return immediately.

Create can be done immediately, as it's not interfereing with current rendering.

Locking the new buffer is fine since D3D doesn't need it's contents for anything buffered.

So, creating the new buffer is more efficient in terms of the driver doesn't HAVE to flush it's render queue. Whether that works in reality is different, but in theory it makes perfect sense.

This topic is closed to new replies.

Advertisement