Performance issue

Graphics and GPU Programming Programming

Started by flaXen November 08, 2001 08:10 PM

14 comments, last by flaXen 22 years, 5 months ago

122

Author

November 08, 2001 08:10 PM

My 2D sprite put routine in D3D isn''t very fast and I''m curious if there is anything I can do to improve its performance. It uses a vertex buffer created outside of the main routine. I suspect that the Lock''ing and Unlock''ing of the vertex buffer is the cause of the problem. The texture SHOULD be in video memory (used D3DPOOL_MANAGED in CreateTexture()). The vertices are initalized by a simple routine which loads the data into the "sq" vertex list. The result is about 3200 100x100 sprites per second drawn. That''s not nearly enough performance for a tile-based game. The tiles are 44x44 and end up getting about 14400 per second which ends up being a maximum of 480 of those tiles per frame. That might be enough, but not nearly as fast as I would need it to be in order to draw anything else (the foreground). Here is the actual code: D3DCUSTOMVERTEX sq[4]; void *pVertices; if (spr == NULL) return; if (frame < 0 || frame > spr->Frames) return; if (spr->Sprites[frame].sprTexture == NULL) return; if (FAILED(g_pVBS->Lock(0, sizeof(sq), (BYTE**)&pVertices, 0))) return; D3DCreate2DSquare(sq, x, y, x + spr->sprWid, y + spr->sprHei, spr->padWid, spr->padHei); memcpy(pVertices, sq, sizeof(sq)); g_pVBS->Unlock(); g_pd3dDevice->SetTexture(0, spr->Sprites[frame].sprTexture); g_pd3dDevice->SetStreamSource(0, g_pVBS, sizeof(D3DCUSTOMVERTEX)); g_pd3dDevice->SetVertexShader(D3DFVF_CUSTOMVERTEX); g_pd3dDevice->SetRenderState(D3DRS_ALPHABLENDENABLE, 1); g_pd3dDevice->SetRenderState(D3DRS_SRCBLEND, D3DBLEND_SRCALPHA); g_pd3dDevice->SetRenderState(D3DRS_DESTBLEND, D3DBLEND_INVSRCALPHA); g_pd3dDevice->DrawPrimitive(D3DPT_TRIANGLESTRIP, 0, 2); //g_pd3dDevice->SetRenderState(D3DRS_ALPHABLENDENABLE, 0); Any performance hints would be really helpful! Thanks, -- Dan (flaXen)

CrazedGenius

156

November 08, 2001 10:02 PM

not locking each time would certainly help...

Also, don''t set the other states any more often than you have to...

With 2D, you''re going to be mostly fill limited. Total up the number of pixels drawn per second and compare that against the benchmarks for your card. Are you in the ballpark? Also, are you trying to draw more than is actually being shown?

Author, "Real Time Rendering Tricks and Techniques in DirectX", "Focus on Curves and Surfaces", A third book on advanced lighting and materials

Draigan

130

November 09, 2001 12:32 AM

Well, like the previous post said, don''t lock each one and add to the vertex buffer. Also, don''t change render states each time. Chances are you are gonna have a few tiles that are gonna be drawn more than once on the screen. Lock the VB, add all these at once as a triangle list using index lists, unlock, set the texture, draw them all. Or hell, add them all to the vertex buffer, and draw them in groups depending on the texture.
It looks like you have each little tile as an individual texture. I imagine that 64x64 tiles would be quicker and you could batch a bunch into a single 256x256 texture. Then you could do larger batches of vertices each frame depending if it''s little texture is contained inside the big texture. All you''d have to do is to offset your texture coordinates for this. And if you have bilinear filtering on, put a little border around them.

invective

118

November 09, 2001 12:59 AM

Generally nothing should go between the lock and unlock except the copy and any logic you must have there for the copy. Updating or creating the system mem the vertices should be done before the lock.

The bigger problem seems to be you are locking for every square and making a new copy of the squares every frame. You should have one big permanent vertex buffer that holds all the squares you are going to render, and keep a copy of it in system ram. Now when you update the squares you lock the buffer and copy only the changes. Draw all the squares with one DrawIndexedPrimitive call.

One lock, one render -- its much faster. Note that you can store the bitmap for more than one sprite in your texture. For example if you have 16 sprites measureing 32x32, if you adjust your texture coordinates you can dump them all into a single 128x128 texture to minimize texture changes.

flaXen

122

Author

November 09, 2001 04:16 PM

Ah... This is all very good advice. Here is an additional consideration: The screen isn''t aligned to the tiles. That is, it smoothly scrolls, so every time it does, it must recreate all the coordinates for the tiles hence the reason there isn''t any precalculated tile positions and whatnot.

I have to kick myself for not seeing the texture optimization. As it stands, each tile is loaded as normal 32x32 texture extracted from tall columns of textures (e.g. 32x128 = 4 textures). Then they''re rotated into a 43x44 tile and placed in a 64x64 buffer. Each frame gets its own 64x64 buffer. Instead, I''ll have a single 64xWhatever texture and load them into that. I can then use texture coordinates to switch to different frames. Excellent..

So then vertex buffer optimization. Would it be wise to create one single massive list of vertices and fill that? It does limit me to the number of sprites I can have on screen, but if the performance is that much better, a finite limit would be smarter.

For those who care: My custom tile rotator does a SUPERIOR job. It converts 32x32 tiles into 43x44 rotated tiles (so that they mesh properly w/o overlap). It oversamples (the equivelant of bilinear interpolation) so that it retains as much quality as possible while avoiding aliasing.

invective

118

November 09, 2001 05:47 PM

You may want to use hardware transforms and clipping if you are going to be scrolling. See the 2d in direct 3d article on this sight for how to set it up.

In this case, most of your locks will become unecessary, since you will be using matrices to transform the geometry instead. You will need a different matrix for each sprite (except the background, since all tiles in it are always transformed by the same amount).

Keeping everything in one vb should not be a problem. If it is you can divide them up into logical groups: one background, one enemy, one player, etc.

Your steps will now be something like

setWorldMatrix (background)
DrawPrimitive (background) // you can still use one vb, but you need to specify correct into the vb offsets here
for each enemy:
if currentenemy is alive
setWorldMatrix (currentenemy)
DrawPrimitive (currentenemy)

etc.

CrazedGenius

156

November 09, 2001 09:12 PM

quote:Original post by invective
You may want to use hardware transforms and clipping if you are going to be scrolling.

Yes!!! If you are scrolling a bunch of tiles, I can''t think of any reason why you need to recompute all the vertices. This is exactly what the hardware is good at.

Author, "Real Time Rendering Tricks and Techniques in DirectX", "Focus on Curves and Surfaces", A third book on advanced lighting and materials

flaXen

122

Author

November 09, 2001 10:01 PM

Well, I managed to get a single large vertex buffer running, but the performance isn''t actually any better. I''m still getting in the order of 13500 44x44 tiles per second which is, surprisingly, a lot less than what it was before the modification. I eliminate redundant setting of render states and textures, tho I still have it setup to use individual textures for each tile. That shouldn''t be a problem considering this test only uses 1 texture and thus only sets the texture once.

The video card I''m developing on is a Vanta. The system is an Athlon 750. nVidia claims the Vanta should do about 200 million pixels per second. 44x44x13500 is still only 26.14 million pixels. I know that the Vanta is a crappy little card, but it should be able to do better than this. I haven''t run the test on my own system yet.

Seeing as having one large vertex buffer doesn''t help, what else can I do to improve performance? How do I check to see if textures are thrashing or anything like that?

I can''t rely on hardware transform since this game needs to be able to run on wimpy machines too.

Thanks!
-- Dan (flaXen)

Guardian_Light

122

November 09, 2001 10:27 PM

I hate to disappoint, but it''s likely you will never meet your card''s "theoretical" fill limit. The keyword is theoretical. In real world applications, you''ll have to find your own fill limit.

"So much fun, so little time."
~Michael Sikora

Prosper/LOADED

100

November 10, 2001 06:51 AM

quote:Original post by flaXen
g_pd3dDevice->SetRenderState(D3DRS_ALPHABLENDENABLE, 1);

I''m more an OpenGL guy than a D3D one but you seem to use alpha blending for every tile. The matter is that alpha blending IS SLOW (much slower than standard drawing). If you just use it as a mask you may use an alpha test instead (I don''t know how to do this in D3D but it certainly can handle this) which is much faster.

Performance issue

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Performance issue

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines