Jump to content
  • Advertisement
Sign in to follow this  
kovacsp

Some questions about batching

This topic is 5446 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi, I've read some texts about batching (I've mentioned some of them in my previous post) to achieve better performace in direct3d, but still have some questions. First.. can I / should I render everything in one batch? If I want to render everything in one batch, then I should use the same vertex declaration for everything. This vertex decration should contain everything that I ever need. But I remember a case, when a mobility radeon didn't render anything when it got some extra fields it didn't need for the fixed function pipeline (namely tangent and binormal, which was needed for bump mapped objects only). If no, then i should use at least n batches, where n is the number of vertex declarations to use. But: can I put objects that will be rendered with different render states in one buffer? (since I can give the index buffer to use, I assume yes, but will this result in faster rendering? I will have the same number if drawindexedprimitive calls then.) Or can I batch index buffers too? The other question.. I should pad the vertex declarations to 32, 64, anything bytes.. what should I use for padding that will not cause incompatibility problems? (maybe texture coords?) Which is the best method for updating the vertex buffer in 32, 64, ... byte blocks? Should I create one vertex in the memory, update it, and memcpy in the buffer? or what? Is is better to use state blocks than changing render states manually? If yes, can I create them in advance (I mean, before rendering anything)? Finally.. please recommend me some good source code that employs these kinds of optimizations. As far as I know, the sdk samples are coded for understanding and not speed.. I need some good source to learn. Thanks for reading my silly questions, Peter

Share this post


Link to post
Share on other sites
Advertisement
You can't realistically render in one batch. You need a batch per texture change at the least. Since your bumpmapped and non-bumpmapped objects won't share the same textures, you can use that opportunity to switch vertex buffers and declarations too.

Realistically, you won't batch everything that's batchable... it would require rebuilding the vertex buffer every frame to contain exactly what's on screen.

For example, if your VB contains 15 copies of a tree, 15 rocks, and 15 weeds, and after frustum culling you have 25 trees, 9 rocks, and 10 weeds to render, you can't do that at once.

You'll probably batch them like this:
draw 15 trees,
another 10 trees,
9 rocks,
and 10 weeds.

But the effort of determining that optimal mixed model batching might be overkill. You want to aim for <=300 draw calls per frame. In the original case, we reduced the 44 draws to 4. The effort involved in getting it less than 4 draws instead really worth it.


Yes, you can put geometry of different meshes that share a vertex format but use different states in the same buffer. This might be faster, despite needing extra draw calls, because you won't necessarily needs to switch VBs. The gain won't be much, but if it happens, it's pretty much free.

nVidia recommends using vertex data that is a multiple of 32 bytes, because that's it's cache line size. If you can guarantee you use the vertices linearly (0,1,2, 1,3,2, 2,3,4, etc..), it won't really matter. Realistically, parts of your mesh will use vertices from non-memory-adjacent vertices every now and then. (0,1,2, 1,3,2, 10,1,2, etc..) Whenever you jump to a new place in video memory, and your data isn't 32 byte aligned, it will take extra time to fetch (2 cache line fetches instead of 1) and use twice as much cache space.

I'm not sure how much of performance hit this is, however memory access is usually a GPU bottleneck, so I'd aim for the 32 byte vertex. If you can get down to 16, and fit two vertices in one cacheline, while remaining aligned, that's even better.

AGP transfers are optimal in bursts. In order for a burst to work, all memory WRITEs must occur sequentially in address and time. Depending on how you calculate your vertices, it may be faster to calculate it in local memory, then memcpy to the GPU memory. Then again, it might be okay to write directly to GPU memory. You'll need to make a release build to profile this properly, as debug builds will write to local loop counter variables and such, breaking the bursts.

Stateblocks are supposedly faster. I've seen nVidia recommend you avoid them though, as they may produce many redundant state changes... however the D3D overhead of calling SetRenderState and SetTextureStageState might outweigh the cost of the redundant states. Try to reduce how often you change states.

Share this post


Link to post
Share on other sites
Some problems with dynamic batch reduction (copying vertex lists to a single, dynamic vertex buffer, then rendering as a single batch with one DIP call):

1. you must also dynamically batch the index buffers, but with offsets for each index. This means that you can't block copy the index buffer, but must instead copy indices one-by-one whilst adding an offset.

2. the bandwidth requirements of copying this much data to AGP every frame can be overwhelming. In one case we had (dynamically batching 20 race cars of ~25k tri each), the memcpy dominated (over 40% CPU).

Any solutions to these two issues?

joe
image space

Share this post


Link to post
Share on other sites
Statically define what you can batch. I allow user-specified, or 1000/vertexcount copies of a mesh. This puts n copies into the VB, and IB. Each copy in the VB has a unique id, used in the shader to get the proper matrix. Each copy in the IB is offset to point to the proper place in the VB. I place the objects with a vertex shader, using a unique world matrix, opacity value, and other values per object. If I want to draw extra copies I just set a few extra constant registers, increase the primitive count, and high vertex count, and I'm done.

Share this post


Link to post
Share on other sites
Thanks for the replies!

Now some more questions..
You mentioned frustrum culling. If I get the matrix from the camera settings, I will have a view frustrum. Therefore the video card cull things that are outside this. So what's the point in doing the same for myself? Is this all about transferring or not transferring the vertex and index data of the culled things through the AGP bus?

And another: if I have a more or less static world, only camera moves. In this case, I can batch everything in one (assuming I need the same vertex decl for each). Does it worth it?

Another.. how should I order? One could be material (shader) another could be front to back. I cannot do both of them in the same time. So which is better?

If I have redundant render states.. then who is going to drop them? d3d? the driver? Does it worth checking with getrenderstate befor setrenderstate? Or I could have a "mirror"
too.

What do you think aboot pre-transformed static geometry? (thus avoiding matrix changes)

The presentation called "directx9 performance" mentions that I should avoid material changes and compute multiple materials in a single pixel shader.. did anyone ever did this? This sounds really strange for me..

As an extreme: would it be a good idea to define a layer above d3d and make some optimizations on the data I get for rendering? Only simple sorting, etc..

And I'm still looking for some highly optimized sample code.. do you know of any?

Sorry for having so much questions... I think that I will have to do lots of experiments to try all these ideas. But it's easier to ask more experienced people, isn't it? ;)

Thanks again,
Peter

Share this post


Link to post
Share on other sites
Quote:
Original post by kovacsp
You mentioned frustrum culling. If I get the matrix from the camera settings, I will have a view frustrum. Therefore the video card cull things that are outside this. So what's the point in doing the same for myself? Is this all about transferring or not transferring the vertex and index data of the culled things through the AGP bus?
If a tree is behind you, and you render everything, you still have to change texture to render it, and then issue the call to draw the object. If you do the culling yourself, you don't need to make any D3D calls for that object at all.

Quote:
Original post by kovacsp
Another.. how should I order? One could be material (shader) another could be front to back. I cannot do both of them in the same time. So which is better?
I think theres a nVidia presentation about this somewhere. You should sort by texture first (material), then vertex buffer, then other properties (I can't remember what). You can sort by 2 things. If you have 5 objects with the same texture, but are in different VBs, you can sub-sort the 5 objects by VB.

Quote:
Original post by kovacsp
If I have redundant render states.. then who is going to drop them? d3d? the driver? Does it worth checking with getrenderstate befor setrenderstate? Or I could have a "mirror"
too.
I think D3D drops the states for you. Don't do a GetRenderState(), since thats just another call into D3D. Keeping track of what states you change in your app might be useful, particularly if you need to reset the device (you'll loose all render states when that happens). But it shouldn't be much of a performance hit.

Quote:
Original post by kovacsp
As an extreme: would it be a good idea to define a layer above d3d and make some optimizations on the data I get for rendering? Only simple sorting, etc..
I do this because it makes things easier for me. I pass objects to my renderer, and it sorts them by texture, VB, etc, then passes them to D3D. It also manages state changes so I can set all the renderstates back when the device is reset.

Share this post


Link to post
Share on other sites
Thanks for zour useful replies again!

Now I have only one more:
Assume I have enoug video memory. I create the texture (load it from file, or anything). Then I SetTexture and render something with it. Then I load another texture. Then set it to the same channel and now render something else with this new texture. In this case: is the first texture dropped out from video memory? Or is it cached and when I set it again it won't be loaded through agp?
I can imagine that since d3d must have a handle for all textures created, then the currently unused textures are cached too (if I have enough memory).
Another from another point of view: when it will be loaded to video mem? When I create it, when I set it, or when I render with it?

The same question applies to another resources too. Is there a document that describes the management behaviour of resources (or some document describing the inner workings of card, driver and d3d in detail)?

Best regards,
Peter

Share this post


Link to post
Share on other sites
Quote:
Original post by kovacsp
Assume I have enoug video memory. I create the texture (load it from file, or anything). Then I SetTexture and render something with it. Then I load another texture. Then set it to the same channel and now render something else with this new texture. In this case: is the first texture dropped out from video memory? Or is it cached and when I set it again it won't be loaded through agp?
I can imagine that since d3d must have a handle for all textures created, then the currently unused textures are cached too (if I have enough memory).
If you're using the managed pool, then the texture is unloaded when you run out of video memory. If you have enough video memory, the texture stays there until you release it. The same for other resource types. With the default pool, the Create*() call fails with D3DERR_OUTOFVIDEOMEMORY, and then you'd need to unload another texture to make room. The driver (or D3D, I'm not sure which) does this for you automatically for the managed pool.

Quote:
Original post by kovacsp
Another from another point of view: when it will be loaded to video mem? When I create it, when I set it, or when I render with it?
I think its done when the texture is created, but I could be wrong. Someone else feel free to comment on this...

Quote:
Original post by kovacsp
The same question applies to another resources too. Is there a document that describes the management behaviour of resources (or some document describing the inner workings of card, driver and d3d in detail)?
I belive that with the managed pool, resources are loaded into video memory, and are unloaded when the ersource is released. If you run out of video memory, then D3D will unload resources using a last recently used policy. The D3D documentation has some stuff on this, but I don't know how in-depth it goes (I'm reading through it now).

Share this post


Link to post
Share on other sites
POOL_MANAGED will go into video memory when needed, and possibly be pulled from video memory if another resource needs the space. If you use less than the full video memory, everything will eventually end up in video memory, and no more transfer will be needed. This is optimal.

Data isn't put into video memory on load. You can call Preload() on the texture to put it in video memory. If you SetTexture() a texture that needs to be in video memory, but isn't, the texture will be transferred to video memory at that point... so preload isn't necessary, but...

Say you have 30MB of textures, and you'll never see 10MB of those until you get to the second half of the level and enter some other section of the world. You can preload those textures before the level starts so they're ready. If you don't preload them, there may be a glitch in framerate as you suddenly need 10MB of data you didn't require until now. Preloading can make things smoother if you have enough video memory... Of course D3D has no accurate indication of how much video memory is available. GetAvailableTextureMemory isn't accurate as it includes AGP memory, which can be any amount of space, configurable via BIOS settings.

Certain things, like render targets and dynamic vertex buffers must be in POOL_DEFAULT, and always take video memory.

Share this post


Link to post
Share on other sites
Quote:
Original post by Namethatnobodyelsetook
Data isn't put into video memory on load. You can call Preload() on the texture to put it in video memory. If you SetTexture() a texture that needs to be in video memory, but isn't, the texture will be transferred to video memory at that point... so preload isn't necessary, but...
Ah, I forgot about PreLoad(), thanks.

Quote:
Original post by Namethatnobodyelsetook
Certain things, like render targets and dynamic vertex buffers must be in POOL_DEFAULT, and always take video memory.
Actually, dynamic VBs don't need to be in the default pool, but theres no real reason for having them in the managed pool.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!