• Advertisement
Sign in to follow this  

[XNA] 1000 cubes, 4 Textures = 55fps... should I expect more?

This topic is 3227 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi All, I am just starting an XNA trial to see what it's all about. I wanted to build a 2D game with 3D graphics - like Mario but in a world of cubes, not blocks. Anyway, I have done up a simple test framework to check some performance and ideas. I just create a camera with a .x file animated character and instanced 1000 cubes. I get about 55fps. Is this to be expected, or does it point to me doing something wrong. The cubes have a 128x128 texture applied to each face, with my test having a total of 4 different textures over the 1000 cubes. The cubes just create the VertexPositionNormalTexture array from a given position and size, and tracks which faces have what texture. I then use this to do the rendering :
private void DrawBlocks(Matrix currentViewMatrix)
        {
            _beBlock.Begin();
            foreach (DictionaryEntry de in _htTextures)
            {
                string sTexName = de.Key.ToString();

                // Get all the triangles with this texture
                List<VertexPositionNormalTexture> lVerts = new List<VertexPositionNormalTexture>();
                foreach (Block b in _alBlocks) {
                    lVerts.AddRange(b.Get_Texture_Triangles(sTexName));
                }

                if (lVerts.Count > 0) {
                    
                    _beBlock.World = World;
                    _beBlock.View = View;
                    _beBlock.Projection = Projection;
                    _beBlock.Texture = (Texture2D)de.Value;
                    _beBlock.TextureEnabled = true;
                    _beBlock.DiffuseColor = new Vector3(1.0f, 1.0f, 1.0f);
                    _beBlock.AmbientLightColor = new Vector3(0.75f, 0.75f, 0.75f);
                    _beBlock.DirectionalLight0.Enabled = true;
                    _beBlock.DirectionalLight0.DiffuseColor = Vector3.One;
                    _beBlock.DirectionalLight0.Direction = Vector3.Normalize(new Vector3(1.0f, -1.0f, 1.0f));
                    _beBlock.DirectionalLight0.SpecularColor = Vector3.One;
                    _beBlock.LightingEnabled = true;


                    _dvbBlock = new DynamicVertexBuffer(GraphicsDevice, lVerts.Count * VertexPositionNormalTexture.SizeInBytes, BufferUsage.WriteOnly);
                    _dvbBlock.SetData(lVerts.ToArray(),0,lVerts.Count);
                    //_vbBlock = new VertexBuffer(GraphicsDevice, lVerts.Count * VertexPositionNormalTexture.SizeInBytes, BufferUsage.WriteOnly);

                    foreach (EffectPass ep in _beBlock.CurrentTechnique.Passes)
                    {
                        ep.Begin();

                        //GraphicsDevice.Vertices[0].SetSource(_vbBlock, 0, VertexPositionNormalTexture.SizeInBytes);
                        GraphicsDevice.Vertices[0].SetSource(_dvbBlock, 0, VertexPositionNormalTexture.SizeInBytes);
                        GraphicsDevice.VertexDeclaration = _vdBlock;
                        GraphicsDevice.DrawPrimitives(PrimitiveType.TriangleList, 0, lVerts.Count/3);

                        //GraphicsDevice.DrawPrimitives(PrimitiveType.TriangleList, 0, 2);
                        ep.End();
                    }
                }
                
            }
            _beBlock.End();
       }


Now, am I on the right track with this rendering idea? I am getting all the block faces that use the given texture, create and stuff a DynamicVertexBuffer with the data and draw it for each effect Pass. Is there a better way? I have never used shaders before and am learning about them at the same time as XNA and its components. Thanks for any comments. Steele.

Share this post


Link to post
Share on other sites
Advertisement
It looks like you are computing vertices at runtime. This is quite expensive by itself, as is using a dynamic vertex buffer. Using static buffers should run quite well, until the buffer starts to contain too much stuff that is offscreen (i.e. transformed but not visible).

Also, drawing indexed primitives will be quite a bit more efficient due to the caches involved. You could also compute index buffers instead of vertex buffers with a bit of work too probably, if it has to remain dynamic.

Share this post


Link to post
Share on other sites
The vertices aren't being computed, just referenced at runtime.
The reason I was going with dynamic vertex buffers was that for the Mario-esque type game, the screen is only about 1/30 of the whole level. So I was trying to add in the ability to cull down the vertices to the current view (not yet implemented).

Are you saying that instead of building the vb from seperate Block objects, I should put all Block vertices into one big vb and build the index buffers to draw instead? Making my Blocks object responsible for vertex and index information and each block just containing information about textures/face and bounds information?

I guess I might be getting ahead of myself here and should just test out how it works for my end purpose (probably no more than 100 blocks on screen at a time), but I just expected more from my 8800 GT...

Share this post


Link to post
Share on other sites
In addition to what Zoner said, are you visibility culling blocks at all? If all 1000 blocks are not on screen you should not draw them all. There are plenty of easy ways to do visibility culling on blocks in a Mario style side scroller.

Share this post


Link to post
Share on other sites
With that many DrawPrimitive calls you're likely CPU-bound. Calls to the graphics API will often result in a lot of CPU overhead, and this applies particularly to DP calls. For a 2D platformer you'll probably never have to draw anywhere close to that many objects to screen at once so you won't have to worry about it, but for some reason you do you'll need to use some form of batching. Static geometry is typically batched by putting it all into a large vertex bugger, while dynamic geometry has to be batch using instancing techniques.

Share this post


Link to post
Share on other sites
Besides all the above how are you computing your framerate? Is the SynchronizeWithVerticalRetrace member of the graphics device set to true?

Share this post


Link to post
Share on other sites
Thanks guys. This is very helpful.
Zoner - I will try the index primitive way. My only confusion is with handling the static vertexbuffer (or should that be vertexbuffers?). How do I know how big the vertexbuffer needs to be before I know the size of the geometry in view? Or do I just declare one of a large size and chunk as needed? If I'm using indexes, I imagine that once the vertexbuffer is built, it needs to remain the same regardless of size or scope. I take it the contents of this will remain the same and the whole thing passed to GPU and just the indexes get updated depending on visibility culling.
MJP - The number of drawprimitive calls is what I was attempting to reduce in my draw code by batching all of the Block faces that use a particular Texture into one drawprimitive call. Again, I am not too cluey on Effects (BasicEffect) so what I am doing may not be what I thought, so please let me know! And what is called a large vertexbuffer now-a-days? In my test example I have 36 * 1000 vertices of type VertexPositionNormalTexture. Does this pose a problem?
Machaira - Both SynchronizeWithVerticalRetrace = false and IsFixedTimeStep = false.

Share this post


Link to post
Share on other sites
1000 DrawPrimitive() calls per frame will be a heavy toll on any graphics card. As the others have said, the solution is to use vertex batching - collecting a larger number of vertices and issuing just a handful of DrawPrimitive() calls. It's exactly what the SpriteBatch class in XNA does for 2D graphics, only that you seem to need an equivalent for 3D graphics.

One way to do this would be to create a vertex buffer manager that allows each cube to allocate vertices in a larger (or multiple larger) vertex buffers. Certainly a bit complicated if the lifetime of your cubes varies.

Another approach would be to simply write vertices into an array instead of calling DrawPrimitive() and when a certain threshold is reached, write them into a dynamic vertex buffer, make the DrawPrimitive() call and continue filling up the array anew. On the PC, make sure you call SetData() with SetDataOptions.Discard to avoid stalling the graphics pipeline.

For one of my XNA projects, I've written a PrimitiveBatch class that does this, analoguous to the SpriteBatch class, only that it works for vertices of any type. If you're interested, you can check out there sources here.

Getting accurate numbers on the optimal vertex batch size is hard and your best bet is to choose a target graphics card that you consider standard amongst your users and tune the vertex batch size to perform optimally on that card. The idea is to use small batches to regularly feed the graphics card (which is rendering asynchronously while the CPU already puts together the next batch), but not so small that the call overhead is bigger than the gain. It probably doesn't say much, but batches of 24K vertices seemed to perform optimally for me on a GeForce 8800 GTS 512.

Share this post


Link to post
Share on other sites
This is not XNA related, but I have a question dealing with dynamic buffers in general. Is it almost as expensive to update vertices for a batched collection of meshes as it is to do many draw calls if the meshes were not batched into one buffer? I am using shader instancing so batching is limited to X amount of instances at a time, according to the shader model used (To render Y instances I'll have to use Y/X amount of draw calls). But I would have to lock the buffer every frame to update the location of every instanced mesh.

Share this post


Link to post
Share on other sites
Quote:
Original post by JustChris
This is not XNA related, but I have a question dealing with dynamic buffers in general. Is it almost as expensive to update vertices for a batched collection of meshes as it is to do many draw calls if the meshes were not batched into one buffer? I am using shader instancing so batching is limited to X amount of instances at a time, according to the shader model used (To render Y instances I'll have to use Y/X amount of draw calls). But I would have to lock the buffer every frame to update the location of every instanced mesh.


That depends on the size of your batch, but generally i think updating a vertex buffer with 1000 positions is not as expensive as using 1000 draw calls.

Keep in mind that writing to shader parameters is not free either.

Share this post


Link to post
Share on other sites
To summarize and throw in some new issues:

- each draw calls is overhead in the driver
-> minimize the number of draw calls each frame to a few hundreds
- do not dynamically allocate memory each frame, create it at initialization time and reuse it
-> it looks like you are creating a new VB for each iteration of the outer loop
- if blocks are not visible do not issue a draw call for this block
-> do a simple visibility check before drawing
- switching VB's involves overhead
-> put all vertex and index data into a single VB and IB and use the DrawPrimitive parameters instead

Share this post


Link to post
Share on other sites
Just to add to the previous list, instancing has a very thin line in where you are optimizing your code. To much vertices with to little instances will be costly, but to many instances with to little vertices will be costly as well (or something like that)

I once read hardware instancing (with two vertex buffers) is 20% more costly per draw, so it must be worth it...

Share this post


Link to post
Share on other sites
Waterwalker - you pretty much summed it all up. Great!
I think I will persue the chunked vb approach. It shouldn't be too hard with the setup I have. And I can visualise the output.
If I understand it correctly, if I create a fixed size vb on init, then use the SetData to fill and do DrawPrimitive, I should remove the overhead of creating dynamic vbs for each texture loop. I can see the benefit of this.

The one large static vb with index buffer method I can see further benefit to, but not sure about having all the vertex data in GPU memory for access by the index buffer... or is this not how it works? From my reading, it seems that the vb is sent to the GPU and then the DrawIndexPrimitive calls lower the required bandwidth by only sending the index values and not full vertex information. Does this mean that the GPU needs all of the vertex information on board, regardless of my visibility culling? Or is there something smarter going on that I don't get?

Share this post


Link to post
Share on other sites
It depends. When using static buffers all of the data resides on the GPU as long as there is enough VRAM available. If there is too many data the driver moves the buffers that were not used during the last calls to the AGP memory.

With dynamic buffers it also depends on the driver where it places the data but usually this should be AGP memory which requires the driver to upload the data to the graphics card for each rendering call if the data is not already in the VRAM from the last call.

With data not on the VRAM at the time of the draw call the driver should only copy the vertices from the vertex buffer that are used by the drawing calls (or rather all vertices from the first to the last vertex being used). Similarly the driver would only copy the indices from the index buffer that are used as defined by the primitive count and the start index.

Also, the graphics card would only transform those vertices that are referenced by indices when calling DrawIndexedPrimitive for performance's sake.

So using one big static vertex buffer is always a good idea to prevent copying the vertex data for each rendering call. I omit mentioning that there is of course a limit to the size you should use. Because if your buffer gets too big then it becomes slow if the driver has to move the buffer from VRAM to AGP or back. But in your case you won't run into such dimensions.

Share this post


Link to post
Share on other sites
I finally had a chance to refactor my code to use one vb and an index buffer.
And it has made a BIG difference.
Same scene, was 55fps, now 300+fps.
I have one query, though, about the creation of the index buffer for the DrawIndexedPrimitive call.
During my Draw code, I get the indices of the Blocks that I can see for each texture and have found that I need to do a

IndexBuffer ib = new IndexBuffer(GraphicsDevice, typeof(int), _iIBMaxSize, BufferUsage.WriteOnly);


for each texture loop (ie fetch of index values).

private void DrawBlocks(Matrix currentViewMatrix)
{
_beBlock.Begin();
foreach (DictionaryEntry de in _htTextures)
{
string sTexName = de.Key.ToString();

// Get all the triangles with this texture
List<int> lIdxs = new List<int>();
List<Block> al = GetVisibleBlocks();
foreach (Block b in al) {
lIdxs.AddRange(b.Get_Texture_Triangles(sTexName));
}

if (lIdxs.Count > 0)
{

_beBlock.World = World;
_beBlock.View = View;
_beBlock.Projection = Projection;
_beBlock.Texture = (Texture2D)de.Value;
_beBlock.TextureEnabled = true;
_beBlock.DiffuseColor = new Vector3(1.0f, 1.0f, 1.0f);
_beBlock.AmbientLightColor = new Vector3(0.75f, 0.75f, 0.75f);
_beBlock.DirectionalLight0.Enabled = true;
_beBlock.DirectionalLight0.DiffuseColor = Vector3.One;
_beBlock.DirectionalLight0.Direction = Vector3.Normalize(new Vector3(1.0f, -1.0f, 1.0f));
_beBlock.DirectionalLight0.SpecularColor = Vector3.One;
_beBlock.LightingEnabled = true;


//_dvbBlock = new DynamicVertexBuffer(GraphicsDevice, lVerts.Count * VertexPositionNormalTexture.SizeInBytes, BufferUsage.WriteOnly);
//_dvbBlock.SetData(lVerts.ToArray(),0,lVerts.Count);

//_vbBlock = new VertexBuffer(GraphicsDevice, lVerts.Count * VertexPositionNormalTexture.SizeInBytes, BufferUsage.WriteOnly);
//_vbBlock.SetData(lVerts.ToArray(), 0, lVerts.Count);

GraphicsDevice.Vertices[0].SetSource(_vbAllBlocks, 0, VertexPositionNormalTexture.SizeInBytes);
//_ibAllBlocks.SetData(lIdxs.ToArray(), 0, lIdxs.Count);
IndexBuffer ib = new IndexBuffer(GraphicsDevice, typeof(int), _iIBMaxSize, BufferUsage.WriteOnly);
ib.SetData(lIdxs.ToArray(), 0, lIdxs.Count);
GraphicsDevice.Indices = ib; // _ibAllBlocks;

foreach (EffectPass ep in _beBlock.CurrentTechnique.Passes)
{
ep.Begin();

//GraphicsDevice.Vertices[0].SetSource(_vbBlock, 0, VertexPositionNormalTexture.SizeInBytes);
//GraphicsDevice.Vertices[0].SetSource(_dvbBlock, 0, VertexPositionNormalTexture.SizeInBytes);
GraphicsDevice.VertexDeclaration = _vdBlock;
GraphicsDevice.DrawIndexedPrimitives(PrimitiveType.TriangleList, 0, 0, _iVBActualSize, 0, lIdxs.Count / 3);
//GraphicsDevice.DrawPrimitives(PrimitiveType.TriangleList, 0, lVerts.Count/3);

//GraphicsDevice.DrawPrimitives(PrimitiveType.TriangleList, 0, 2);
ep.End();
}
}

}
_beBlock.End();


}


When I tried having just a member variable that I could use over and over, I got an :
You may not modify a resource that has been set on a device, or after it has been used within a tiling bracket
error.
Is what I am doing ok? Or should I be using a single member variable DynamicIndexBuffer?
Thanks again for the help so far. I have learned a lot.

Share this post


Link to post
Share on other sites
Quote:
Original post by Crow-knee
I finally had a chance to refactor my code to use one vb and an index buffer.
And it has made a BIG difference.
Same scene, was 55fps, now 300+fps.
I have one query, though, about the creation of the index buffer for the DrawIndexedPrimitive call.
During my Draw code, I get the indices of the Blocks that I can see for each texture and have found that I need to do a
<source lang="c#">
IndexBuffer ib = new IndexBuffer(GraphicsDevice, typeof(int), _iIBMaxSize, BufferUsage.WriteOnly);
</source>
for each texture loop (ie fetch of index values).
When I tried having just a member variable that I could use over and over, I got an :
You may not modify a resource that has been set on a device, or after it has been used within a tiling bracket
error.
Is what I am doing ok? Or should I be using a single member variable DynamicIndexBuffer?
Thanks again for the help so far. I have learned a lot.
You should never be creating a resource in your render loop, at best it'll be slow and at worst it'll fragment VRAM and cause your app to fail with an out of video memory error / exception.
If you can get away with making the index buffer static (I.e. only very infrequently lock it - less often than once per second), then do so. That's what the ID3DXSprite class does, and I assume the SpriteBatch class does too. The order of the vertices required to draw a cube will never change, so this can be precalculated.
Otherwise, create a large dynamic index buffer, and update it's contents when you need to (You don't have to use the whole buffer every time).

Share this post


Link to post
Share on other sites
Ahh... yes. Well, I kinda thought that creating an index buffer for every texture for every frame was a bad idea - but I am not quite sure of what I should do then.
As I see it, using the BasicEffect, I need the DrawIndexedPrimitive to use an (Dynamic)IndexBuffer that contains only the vertex indices of the triangles that use that texture AND are visible. How do I feed the IndexBuffer for each texture when the second time through the loop I get the error I mentioned above?
You may not modify a resource that has been set on a device, or after it has been used within a tiling bracket
You mention locking - I am not at my home PC (with XNA) but is that a method of the IndexBuffer? Is this something I should do in the Update method, locking the IndexBuffer and feeding in the texture indices, taking note of the offsets and lengths of each with the one Indexbuffer to use in the draw call?
If so, I could see that, with a more relaxed FOV culling method, the IndexBuffer updates could be done only a couple of times a second...

Share this post


Link to post
Share on other sites
Quote:
You mention locking - I am not at my home PC (with XNA) but is that a method of the IndexBuffer? Is this something I should do in the Update method, locking the IndexBuffer and feeding in the texture indices, taking note of the offsets and lengths of each with the one Indexbuffer to use in the draw call?
If so, I could see that, with a more relaxed FOV culling method, the IndexBuffer updates could be done only a couple of times a second...


Crash course in the GPU Command Buffer:

Every D3D state management and drawing API call made creates a data packet that goes into a buffer. This is the GPU command buffer. The GPU reads from this buffer to know what to do next. Certain commands write fences into the buffer (up to D3D, we dont get to do this sadly), as well as when threshold of number of bytes written into the buffer has been reached. The fence id is just a serial number. When resources (textures, index buffers, vertex buffers, rendertargets, and shaders) are referenced, they are tagged with the current fence serial number, and are considered in-use until the CPU can determine that the GPU has read and processed the commands up to that fence id.

Now for the major part:

When you lock a resource that has an outstanding fence, it stalls until the GPU has processed up to that fence. If this resource is currently set on the hardware, it will stall up until the GPU is completely idle.

Now from the code you have it still looks like you are constructing some buffers in the render loop, so they are somewhat immune to this, but you would see it if you tried to recycle them. This is generally fixed by allocating dynamic index buffers and using D3DLOCK_DISCARD. Half the time the drivers are really just allocating more memory and queuing the original memory to be freed, so the cost of allocating a brand new index or vertex buffer or not isn't that bad, despite the dire warnings in the documentation or from other forum users :) The real advantage is in quantity of API calls to do it, Release, Create, and Lock is going to cost more than just a Lock with a different parameter or two (D3DLOCK_DISCARD and D3DLOCK_NOOVERWRITE), as on the PC the context switches to the OS layer are pretty expensive.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement