Archived

This topic is now archived and is closed to further replies.

flaXen

Performance issue

Recommended Posts

flaXen    122
My 2D sprite put routine in D3D isn''t very fast and I''m curious if there is anything I can do to improve its performance. It uses a vertex buffer created outside of the main routine. I suspect that the Lock''ing and Unlock''ing of the vertex buffer is the cause of the problem. The texture SHOULD be in video memory (used D3DPOOL_MANAGED in CreateTexture()). The vertices are initalized by a simple routine which loads the data into the "sq" vertex list. The result is about 3200 100x100 sprites per second drawn. That''s not nearly enough performance for a tile-based game. The tiles are 44x44 and end up getting about 14400 per second which ends up being a maximum of 480 of those tiles per frame. That might be enough, but not nearly as fast as I would need it to be in order to draw anything else (the foreground). Here is the actual code: D3DCUSTOMVERTEX sq[4]; void *pVertices; if (spr == NULL) return; if (frame < 0 || frame > spr->Frames) return; if (spr->Sprites[frame].sprTexture == NULL) return; if (FAILED(g_pVBS->Lock(0, sizeof(sq), (BYTE**)&pVertices, 0))) return; D3DCreate2DSquare(sq, x, y, x + spr->sprWid, y + spr->sprHei, spr->padWid, spr->padHei); memcpy(pVertices, sq, sizeof(sq)); g_pVBS->Unlock(); g_pd3dDevice->SetTexture(0, spr->Sprites[frame].sprTexture); g_pd3dDevice->SetStreamSource(0, g_pVBS, sizeof(D3DCUSTOMVERTEX)); g_pd3dDevice->SetVertexShader(D3DFVF_CUSTOMVERTEX); g_pd3dDevice->SetRenderState(D3DRS_ALPHABLENDENABLE, 1); g_pd3dDevice->SetRenderState(D3DRS_SRCBLEND, D3DBLEND_SRCALPHA); g_pd3dDevice->SetRenderState(D3DRS_DESTBLEND, D3DBLEND_INVSRCALPHA); g_pd3dDevice->DrawPrimitive(D3DPT_TRIANGLESTRIP, 0, 2); //g_pd3dDevice->SetRenderState(D3DRS_ALPHABLENDENABLE, 0); Any performance hints would be really helpful! Thanks, -- Dan (flaXen)

Share this post


Link to post
Share on other sites
CrazedGenius    156
not locking each time would certainly help...

Also, don''t set the other states any more often than you have to...

With 2D, you''re going to be mostly fill limited. Total up the number of pixels drawn per second and compare that against the benchmarks for your card. Are you in the ballpark? Also, are you trying to draw more than is actually being shown?

Share this post


Link to post
Share on other sites
Draigan    130
Well, like the previous post said, don''t lock each one and add to the vertex buffer. Also, don''t change render states each time. Chances are you are gonna have a few tiles that are gonna be drawn more than once on the screen. Lock the VB, add all these at once as a triangle list using index lists, unlock, set the texture, draw them all. Or hell, add them all to the vertex buffer, and draw them in groups depending on the texture.
It looks like you have each little tile as an individual texture. I imagine that 64x64 tiles would be quicker and you could batch a bunch into a single 256x256 texture. Then you could do larger batches of vertices each frame depending if it''s little texture is contained inside the big texture. All you''d have to do is to offset your texture coordinates for this. And if you have bilinear filtering on, put a little border around them.

Share this post


Link to post
Share on other sites
invective    118
Generally nothing should go between the lock and unlock except the copy and any logic you must have there for the copy. Updating or creating the system mem the vertices should be done before the lock.

The bigger problem seems to be you are locking for every square and making a new copy of the squares every frame. You should have one big permanent vertex buffer that holds all the squares you are going to render, and keep a copy of it in system ram. Now when you update the squares you lock the buffer and copy only the changes. Draw all the squares with one DrawIndexedPrimitive call.

One lock, one render -- its much faster. Note that you can store the bitmap for more than one sprite in your texture. For example if you have 16 sprites measureing 32x32, if you adjust your texture coordinates you can dump them all into a single 128x128 texture to minimize texture changes.

Share this post


Link to post
Share on other sites
flaXen    122
Ah... This is all very good advice. Here is an additional consideration: The screen isn''t aligned to the tiles. That is, it smoothly scrolls, so every time it does, it must recreate all the coordinates for the tiles hence the reason there isn''t any precalculated tile positions and whatnot.

I have to kick myself for not seeing the texture optimization. As it stands, each tile is loaded as normal 32x32 texture extracted from tall columns of textures (e.g. 32x128 = 4 textures). Then they''re rotated into a 43x44 tile and placed in a 64x64 buffer. Each frame gets its own 64x64 buffer. Instead, I''ll have a single 64xWhatever texture and load them into that. I can then use texture coordinates to switch to different frames. Excellent..

So then vertex buffer optimization. Would it be wise to create one single massive list of vertices and fill that? It does limit me to the number of sprites I can have on screen, but if the performance is that much better, a finite limit would be smarter.

For those who care: My custom tile rotator does a SUPERIOR job. It converts 32x32 tiles into 43x44 rotated tiles (so that they mesh properly w/o overlap). It oversamples (the equivelant of bilinear interpolation) so that it retains as much quality as possible while avoiding aliasing.

Share this post


Link to post
Share on other sites
invective    118
You may want to use hardware transforms and clipping if you are going to be scrolling. See the 2d in direct 3d article on this sight for how to set it up.

In this case, most of your locks will become unecessary, since you will be using matrices to transform the geometry instead. You will need a different matrix for each sprite (except the background, since all tiles in it are always transformed by the same amount).

Keeping everything in one vb should not be a problem. If it is you can divide them up into logical groups: one background, one enemy, one player, etc.

Your steps will now be something like

setWorldMatrix (background)
DrawPrimitive (background) // you can still use one vb, but you need to specify correct into the vb offsets here
for each enemy:
if currentenemy is alive
setWorldMatrix (currentenemy)
DrawPrimitive (currentenemy)

etc.

Share this post


Link to post
Share on other sites
CrazedGenius    156
quote:
Original post by invective
You may want to use hardware transforms and clipping if you are going to be scrolling.


Yes!!! If you are scrolling a bunch of tiles, I can''t think of any reason why you need to recompute all the vertices. This is exactly what the hardware is good at.

Share this post


Link to post
Share on other sites
flaXen    122
Well, I managed to get a single large vertex buffer running, but the performance isn''t actually any better. I''m still getting in the order of 13500 44x44 tiles per second which is, surprisingly, a lot less than what it was before the modification. I eliminate redundant setting of render states and textures, tho I still have it setup to use individual textures for each tile. That shouldn''t be a problem considering this test only uses 1 texture and thus only sets the texture once.

The video card I''m developing on is a Vanta. The system is an Athlon 750. nVidia claims the Vanta should do about 200 million pixels per second. 44x44x13500 is still only 26.14 million pixels. I know that the Vanta is a crappy little card, but it should be able to do better than this. I haven''t run the test on my own system yet.

Seeing as having one large vertex buffer doesn''t help, what else can I do to improve performance? How do I check to see if textures are thrashing or anything like that?

I can''t rely on hardware transform since this game needs to be able to run on wimpy machines too.

Thanks!
-- Dan (flaXen)

Share this post


Link to post
Share on other sites
Guardian_Light    122
I hate to disappoint, but it''s likely you will never meet your card''s "theoretical" fill limit. The keyword is theoretical. In real world applications, you''ll have to find your own fill limit.

"So much fun, so little time."
~Michael Sikora

Share this post


Link to post
Share on other sites
Prosper/LOADED    100
quote:
Original post by flaXen
g_pd3dDevice->SetRenderState(D3DRS_ALPHABLENDENABLE, 1);



I''m more an OpenGL guy than a D3D one but you seem to use alpha blending for every tile. The matter is that alpha blending IS SLOW (much slower than standard drawing). If you just use it as a mask you may use an alpha test instead (I don''t know how to do this in D3D but it certainly can handle this) which is much faster.

Share this post


Link to post
Share on other sites
invective    118
quote:

The matter is that alpha blending IS SLOW (much slower than standard drawing).



That is a per card issue. My geForce 2 gives the same fram rate for my terrain program whether or not I have alpha blending on for the 8192 water triangles being rendered. It depends how many texels and blending ops your card can do in one pass.

Share this post


Link to post
Share on other sites
CrazedGenius    156
You can use matrix transformations without necessarily relying on *hardware* T&L. The point is that the transformations in the worst case are probably faster than locking the buffer and in the best case the hardware will really help.

Share this post


Link to post
Share on other sites
invective    118
quote:

The point is that the transformations in the worst case are probably faster than locking the buffer and in the best case the hardware will really help



I don''t know about that if you are doing simple 2d with no rotation/scaling. If you are doing software transforms then doesn''t d3d have to lock the buffer anyways? In that case its faster to have your own vertex buffer copy and update coordinates with simple addition instead of matrix multiplies then lock and copy. The big benefit of harware transforms is really that you don''t have to send all the data over the AGP bus to the card -- you just send the matrix and the card does the transform locally. That said, I''d still use D3D for the transforms -- most computers are so fast you will never see the difference in a 2d game, unless you are really rendering an obscene amount of triangles.

Even at 1024x768, with 32x32 sprites and double full coverage, you are only rendering 3072 triangles a frame or at 60 Hz a measly 184,320 triangles a second. Even and old TNT is supposed to be able to do 6 million triangles a second, or 30 times this amount. It is actually the fillrate that will be the issue, because the TNT is going to choke and die on that many texels a second... Anyways, point being the geometry is not going to be your bottle neck in a 2d app as long as you do a reasonably efficient job of implementing it.

Share this post


Link to post
Share on other sites
flaXen    122
That makes sense. Streamlining my geometry did nothing for performance. But still, the performance is well below even the most horrible estimate of minimum speed.

If the geometry isn''t to blame, then what is? Textures? It''s true that the texture uses an alpha map, but as far as I can tell, there is no genuine alpha key. The texture I''ve been testing with is 15bit+Alpha (A1R5G5B5). I''d assume the video card wouldn''t do real alpha calculations on a binary alpha value. ARGH! This will drive me insane. Perhaps there is some other problem...

-- Dan (flaXen)

Share this post


Link to post
Share on other sites
Prosper/LOADED    100
quote:
Original post by invective
That is a per card issue. My geForce 2 gives the same fram rate for my terrain program whether or not I have alpha blending on for the 8192 water triangles being rendered. It depends how many texels and blending ops your card can do in one pass.



It gives you the same frame rate because you don''t reach any limit. If you render three blended 1280x1024 quads your framerate will be far less than three non blended quads.

quote:

The texture I''ve been testing with is 15bit+Alpha (A1R5G5B5). I''d assume the video card wouldn''t do real alpha calculations on a binary alpha value.



With OpenGL you can call glAlphaFunc( GL_GREATER, 0.5f) and after that only texels with an alpha value greater than 0.5 will be rendered. I assume there is a way to do it with Direct3D.

Share this post


Link to post
Share on other sites