Jump to content
  • Advertisement
Sign in to follow this  
ValMan

Wrote batching system to replace D3DXSprite, now poor performance

This topic is 2904 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hello,

I needed some extra functionality for rendering objects in my 2D app, so I decided to write my own class to handle batching. Unlike D3DXSprite, it can handle objects that use different Effects with different parameters, various geometry types (quads, meshes, line lists, line strips, points and point sprites) and so on. When the new system was ready, I found my performance was noticeably slower that that of D3DXSprite when comparing with old builds (both running in release mode, pure device, release DX runtime, etc).

After testing, I found the bottleneck is not related to fill rate, number of state changes, or number of batches. The number of renderables (in my test case, only quads) does however make a huge impact on the frame time. D3DXSprite on the other hand seems to have relatively constant performance no matter how many quads you draw, and it's mostly limited by fill rate and batch count.

I have a system memory fixed size vertex and index buffers (currently for 10,000 items), and video memory vertex and index buffers of same size. When a function is called to draw a Quad, it adds vertices and indexes for the quad into system memory buffers, then stores the pointers to them in a Renderable structure, along with info on primitive type, vertex count, index count, transform, z-order and material used. Renderable then gets added to the array of Renderables which is sorted and traversed by FlushBatch() function.

When FlushBatch() iterates through Renderables to draw batches, it locks the video memory vertex/index buffers (with Discard flag in their entirety) and copies data from system memory to video memory buffers for each renderable in a batch. Then it calls DrawIndexedPrimitive to submit batch to video card. All in all, this is pretty standard.

I noticed in PIX that D3DXSprite works differently because it operates only on quads. It seems to allocate and fill the index buffer only once at initialization time which makes sense since it knows it only draws quads and all the indices will be the same, just repeating over and over again with an offset pre-applied. So their index buffer is probably static and not dynamic like mine. Second, instead of locking the entire video memory vertex buffer with Discard flag before sending each batch like I do, it only locks the part that was not touched yet by previous batch with No-Overwrite flag. When it comes to the internal memory structures D3DXSprite is using, I really have no idea how mine compare. I just know sorting renderables is not the bottle neck because taking out sorting code makes no impact on performance.

Can anyone help me with this issue? Is there a problem with my buffer locking scheme, or is my way of using system memory buffers too inefficient?

Thank you,
Val

Share this post


Link to post
Share on other sites
Advertisement
It kind of sounds like you figured out the answer to your own problem ... it appears it is less efficient to lock the whole dynamic VB range one time than it is to lock say 1/32 of the VB range 32 times.

The DirectX SDK's even seem to support this:

There are cases where the amount of data the application needs to store per lock is small, such as adding four vertices to render a sprite. D3DLOCK_NOOVERWRITE indicates that the application will not overwrite data already in use in the dynamic buffer. The lock call will return a pointer to the old data, allowing the application to add new data in unused regions of the vertex or index buffer. The application should not modify vertices or indices used in a draw operation as they might still be in use by the graphics processor. The application should then use D3DLOCK_DISCARD after the dynamic buffer is full to receive a new region of memory, discarding the old vertex or index data after the graphics processor is finished.

Essentially it sounds like even with the D3DLOCK_DISCARD flag you are introducing a pipeline stall if you lock the entire VB range and then draw from it. It might be worthwhile to investigate D3DLOCK_NOOVERWRITE also.

Share this post


Link to post
Share on other sites
The number of times you can use D3DLOCK_DISCARD per frame without getting a pipeline stall is limited. The exact number depends on available memory and other driver details. Therefore most people use the already discussed discard/no overwrite solution. At the same time the try to size their dynamic buffers in a way that they are large enough to store the data for one frame.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

Participate in the game development conversation and more when you create an account on GameDev.net!

Sign me up!