Slow drawing textured quads with direct3D8

Recommended Posts

Hi, I don't know what I'm doing wrong, but my game runs extremely slow. I've got 40 FPS, which isn't much. (I've got an ATI Radeon 7000) I've implented an index buffer, and I work with textured quads. In my game loop, I'm drawing 150 quads. These are my d3d8 functions:
/* begins a set of textured quads */
/* lock vertex buffer */
this->vertex_buffer->Lock(0, 0, (uchar **) &(this->vertices), 0);

/* reset vertices count */
this->vertices_count = 0;

/* set texture */
this->device->SetTexture(0, texture);
}

int i;

/*
0 ----- 1
| \     |
|   \   |
|     \ |
3 ----- 2

0: x, y -> left, top
1: x + width, y -> right, top
2: x + width, y + height -> right, bottom
3: x, y + height -> left, bottom

1, 2 are doubled

6 indices, 4 vertexes
*/

/* setup vertices */
i = this->vertices_count;

this->vertices[i].x     = (float) dest->left;
this->vertices[i].y     = (float) dest->top;
this->vertices[i].z     = 1.0f;
this->vertices[i].rhw   = 1.0f;
this->vertices[i].color = D3DCOLOR_XRGB(255, 255, 255);
this->vertices[i].u     = 0.0f;
this->vertices[i].v     = 0.0f;

this->vertices[i + 1].x     = (float) dest->right;
this->vertices[i + 1].y     = (float) dest->top;
this->vertices[i + 1].z     = 1.0f;
this->vertices[i + 1].rhw   = 1.0f;
this->vertices[i + 1].color = D3DCOLOR_XRGB(255, 255, 255);
this->vertices[i + 1].u     = 1.0f;
this->vertices[i + 1].v     = 0.0f;

this->vertices[i + 2].x     = (float) dest->right;
this->vertices[i + 2].y     = (float) dest->bottom;
this->vertices[i + 2].z     = 1.0f;
this->vertices[i + 2].rhw   = 1.0f;
this->vertices[i + 2].color = D3DCOLOR_XRGB(255, 255, 255);
this->vertices[i + 2].u     = 1.0f;
this->vertices[i + 2].v     = 1.0f;

this->vertices[i + 3].x     = (float) dest->left;
this->vertices[i + 3].y     = (float) dest->bottom;
this->vertices[i + 3].z     = 1.0f;
this->vertices[i + 3].rhw   = 1.0f;
this->vertices[i + 3].color = D3DCOLOR_XRGB(255, 255, 255);
this->vertices[i + 3].u     = 0.0f;
this->vertices[i + 3].v     = 1.0f;

/* increase vertices count */
this->vertices_count += 4;

/* flush the buffer if it's full */
if (this->vertices_count == (VERTEX_BUFFER_SIZE * 4)) {
/* unlock vertex buffer */
this->vertex_buffer->Unlock();

/* draw quads in the buffer */
this->device->DrawIndexedPrimitive(D3DPT_TRIANGLELIST, 0, this->vertices_count, 0, this->vertices_count / 2);

/* reset vertices count */
this->vertices_count = 0;

/* lock vertex buffer */
this->vertex_buffer->Lock(0, 0, (uchar **) &(this->vertices), 0);
}
}

/* ends a set of textured quads */
/* unlock vertex buffer */
this->vertex_buffer->Unlock();

/* flush the buffer if it isn't empty */
if (this->vertices_count != 0) {
this->device->DrawIndexedPrimitive(D3DPT_TRIANGLELIST, 0, this->vertices_count, 0, this->vertices_count / 2);

/* reset vertices count */
this->vertices_count = 0;
}
}

/* begins scene */
inline void d3d_class::begin_scene(void) {
/* clear the screen */
this->device->Clear(0, NULL, D3DCLEAR_TARGET, D3DCOLOR_XRGB(0, 0, 0), 0.0f, 0);

/* begin scene */
this->device->BeginScene();
}

/* ends scene */
inline void d3d_class::end_scene(void) {
/* end the scene */
this->device->EndScene();

/* present the scene */
this->device->Present(NULL, NULL, NULL, NULL);
}


Does anybody know what I'm doing wrong? I call begin_scene, begin_quad_set once, add_quad 150 times and then end_quad_set and end_scene. Thanks, Gerben VV

Share on other sites
Hi.
It is much more slower seting vertices 150 times per render loop. You set it once in one large vertex buffer when you initialize your application.
Then you just call render vertex buffer once per render loop.

Share on other sites
From a quick look at your code it seems that you'll be holding the lock on the buffer for a relatively long period of time. Whilst you hold that lock you could be stalling the pipeline - not good [smile]

Try re-writing it so that you compose the buffer entirely in system memory and then do a quick lock-memcpy-unlock operation when you need to.

Also, look into the various locking flags (discard for example) and verify that you're creating the vertex buffer with the correct usage flags. The Debug runtimes will usually scream-and-shout at you if you're doing anything obviously wrong here (you have run it against the debugs, right?)

One final note - the best optimization in this sort of case is at the algorithmic level. Resource modification is painful - you can alleviate some of the pain, but it's never going to be "nice"... so micro-optimization might get you somewhere, but the biggest gains are likely to come from changing the way the application uses/needs locks.

A trivial example - some sort of double-buffering approach so that you don't keep changing the same VB repeatedly. Maybe try to take advantage of any temporal coherancy and cache common results between frames so that you only do work when something needs changing.

hth
Jack

Share on other sites
Thanks,

Quote:
 From a quick look at your code it seems that you'll be holding the lock on the buffer for a relatively long period of time. Whilst you hold that lock you could be stalling the pipeline - not good [smile]

I tried to lock the VB as less as possible, for every lock tages time.
What I figured out is that it is the Present() call that takes a long time.
I don't really understand the stalling pipeline problem.

Quote:
 Try re-writing it so that you compose the buffer entirely in system memory and then do a quick lock-memcpy-unlock operation when you need to.

I'll try setting the VB content in system-memory and then memcpy it, but it takes extra memory.

Quote:
 A trivial example - some sort of double-buffering approach so that you don't keep changing the same VB repeatedly. Maybe try to take advantage of any temporal coherancy and cache common results between frames so that you only do work when something needs changing.

Can you actually setup 2 VBs?

algorithmic level?? You really should explain this to me, coz I don't understand this.

Finally, I don't know how to set different textures if I would copy the VB from system to video memory.

Share on other sites
This'll have to be quick - I've gotta go out in 5 [smile]

Quote:
Original post by gerbenvv
Quote:
 From a quick look at your code it seems that you'll be holding the lock on the buffer for a relatively long period of time. Whilst you hold that lock you could be stalling the pipeline - not good [smile]

I tried to lock the VB as less as possible, for every lock tages time.
What I figured out is that it is the Present() call that takes a long time.
I don't really understand the stalling pipeline problem.

Direct3D's rendering is a pipeline - a series of connected stages, taking input and then passing on it's output to the next stage. If one stage has to wait for something else to finish, or for it's input data to be ready, then it is considered "stalled" - it's not doing anything useful.

If you have a lock on a vertex buffer it, by definition, means that any part of the graphics pipeline cannot actually render from it (because it might be in the process of being changed).

Quote:
Original post by gerbenvv
Quote:
 Try re-writing it so that you compose the buffer entirely in system memory and then do a quick lock-memcpy-unlock operation when you need to.

I'll try setting the VB content in system-memory and then memcpy it, but it takes extra memory.

Yup, it'll take extra memory - but that really shouldn't be a problem. Even a large vertex buffer will only take a few hundred kilobytes of system memory - which is a drop-in-the-ocean for modern 256-512-1024mb machines.

Quote:
Original post by gerbenvv
Quote:
 A trivial example - some sort of double-buffering approach so that you don't keep changing the same VB repeatedly. Maybe try to take advantage of any temporal coherancy and cache common results between frames so that you only do work when something needs changing.

Can you actually setup 2 VBs?

Yup, create as many IDirect3DVertexBuffer9's as you want [smile] You typically tend to use one at a time (although even this can be "broken") but there's nothing wrong with having many vertex buffers.

Quote:
 Original post by gerbenvvalgorithmic level?? You really should explain this to me, coz I don't understand this.

The algorithmic level is how you design your code - it's what steps you implement to solve your problem. You might wan to look up "Big Oh" notation if you're not familiar with it. It's mathematically provable that some algorithms are faster than others - for example, "Quick Sort" is almost always going to be much faster than "Bubble Sort" no matter how much you try and optimize both [smile]

Quote:
 Original post by gerbenvvFinally, I don't know how to set different textures if I would copy the VB from system to video memory.

For each texture you have to set up the new texture, there's not really any sensible way around this. Although, remember that if you set the parameters to DrawPrimitive() or DrawIndexedPrimitive() appropriately you don't have to render the entire vertex buffer. You could quite conceivably render a large buffer in 4 different draw-calls and change the texture between them.

hth
Jack

Share on other sites
I just want to clear up a common problem with people using Big-O notation and "proofs".

Big-O notation is used to meansure and compare "complexity" in terms of "if you have a very large number of elements, then two algorithm's with the same runtime / space complexity can converge to theoretically the same speed"

this works in the classroom and on paper when doing research in computer science. But in the real world those constants that get ignored ( O ( 2198120938 x n ) ~= O ( n ) ) are very very important... for example the original poster is running some code on 150 textured quads... so n = 150 for simplicity sake... and lets say in one algorithm, it runs in O (n) time and uses 150 instructions per quad... that sounds fast, linear time, O(n), etc... but lets say you have an O (nlogn) algorithm ("slower") but only uses 10 instructions per iteration...

O(n) < O(n log n)... but 150 * n(= 150) is worse than 10 * 150 * log(150)...

sorry about the O(n) rant but in the work place far too often Ill show optimizations for an algorithm that greatly reduce operational constants on a known algorithm that has the "best proven runtime complexity" and people just ignore it, "the constants are insignificant"... which is usually not the case... unless you can prove that your n is going to be much larger than a couple thousand... try optimizing the algorithms for constants as well..

to gerbenvv:

if you still care about optimizing code in general, every bit helps... you might want to avoid doign somethings you have there...

every time you have a "this->vertices[i + some number].some variable = x;" the compiler is generating alot of instructions for you that you dont need...

try setting a temporary variable = to a reference of the vertex in question.

Vertex &tempVert = vertices[i + 3]; for example... now not only are you doing i + 3 only once instead of 8 times, you are also not dereferencing the array pointer that many times... sometimes the compiler can optimize this out, but in your case its based on a for loop index, which it will probably expect there is a reason you are dereferencing that many times.

if you are only doing it once or twice, an extra 40 instructions per rendering loop means nothing to computers of today... but you are doing it 40 times per loop iteration, that really adds up... especially if you want to have time to do game logic, input, sound... etc.. or port it to GBA where the processor has many less cycles available per second.

the second thing is using the D3DCOLOR_XRGB() macro to set all your colours to white... doesnt look like those change, so you could either do a COLORREF whiteColour = D3DCOLOR_XRGB(255, 255, 255); and then just use that variable... this isnt a big deal, but the macro in question does a few bit shifts and will use some scratch registers to do it, Im all about optimizing out as many instructions as you can... bit shifts are usually fast on all machines, but you never know...

Sorry that these are all terribly small things, just keep them in mind if you apply to game development studios doing work on non pc hardware... they are often fighting for clock cycles, especially when the AI guys keep thinking they should be writing physics code in their AI code ;) damn AI....

and I agree with jollyjeffers on the locking issue... stalling the graphics pipeline is always bad, you want to hold that lock for as short a time as possible, and dont worry memcpy is fast... and if not, write an MMX / SSE version to be faster :P

btw I love your commenting style, I always draw little ascii pictures in methods, its fun :)

Share on other sites
Well, thank you LEET_developer!

This is my new idea to do it:
setup quads -> realloc each time I add a new quadsort quads by texture -> so I don't have to set texture many timesfor each n vertices that fit in VB    copy n vertices in sys memory        lock VB        copy n vertices to VB    unlock VB        for each n vertices that have same texture        SetTexture        DrawIndexedPrimitive n vertices    endend

Any ideas to improve this? [smile]

Share on other sites
I wont go picking at your pseudo code :) but you sorted by textures then called a set texture every iteration in the loop... Im not sure if direct X has a check to see if you've already set that texture or if it just goes and does it, in openGl I usually set a texture with a method I make that checks if the texture is the current one, its like a "free" optimization.

but sorting is a good idea... another thing you could do, I dont know if you are familiar wtih data structures, but you could have a hash table of buckets.

a hash table is basically an associative array. think of it like doing this

quadList["texture name"] and that would return you a quad (or list of quads in my bucket example) that all have the same texture... this way you wouldnt have to sort them, you just add them to the appropriate list as they are created...

with every implementation there are tradeoffs and it usually comes down to speed vs memory in the end... so you could even just allocate a big array for quads (if you want speed instead of space) and then just have the first 200 be the first text and then the next 200 the next texture, and so on... that also allows you to eliminate the sorting step, I dont know how often you add quads, but it could turn out to be an issue.

I dont know what limitations direct X has on vertex buffers, but you could try to make one big enough for your max number of quads... then at least you'll only do one lock and one memcpy... but again, I'm fairly new to direct X so maybe there is a limitation that wont allow you to make them big enough.

Share on other sites
I didn't know you could use associative arrays in C++. (like PHP has)

But anyway, I'm gonna take a look in the msdn stuff of SetTexture.

And about the size of the VB, I don't know how many vertices I'm gonna draw, so I think I'm gonna code my pseudo code. [smile]

Share on other sites
LEET_Developer - you make some good/valid points, but I have to disagree in some ways. Micro-Optimization is indeed a useful trick, but in all my years of programming (not just games/graphics) it's yielded less returns that algorithmic-optimization.

Two examples off the top of my head:

1. Terrain rendering. In my early days I spent *ages* micro-optimizing my brute-force approach. Eliminated almost every unecessary cycle and I optimized every data structure to be properly aligned. I spent a day implementing Quadtree's - using some heavy data duplication and performance went up 10-20 fold.

2. General object culling. Had a highly optimized frustum culling algorithm that went through all the objects and checked them - but only when the object had/or camera had moved. It was about as tight as I could get it. I later went in and implemented heirarchical culling (with groups of objects as well as sub-objects) and it was almost as if culling was for free - it no longer even registered on my profiler [grin]

Then there is always the absolute classic argument against micro-optimization: Compilers. If you study the theory behind even a simple compiler you'll see that they'll optimize away a lot of the things that you've mentioned. Expression simplification, constant folding, dead-code removal.. Not only does this mean that you don't have to worry about it, but it means you don't need to confuse your source code and make it harder to read.

Also, if you start implementing your own low-level optimizations (especially if you break-out to assembly [oh]) then you're limiting the choices that a compiler has - and if you're going to try and out-optimize an optimizing compiler you really need to be very sure that what you're writing is better than what it can generate.

Quote:
 Any ideas to improve this?

Based on your original code, that looks good to me. The only ways I can see of improving it are along the lines of temporal optimization - only update things if/when you have to. Whether you can actually do this very much depends on the rest of your programs flow/structure [smile]

hth
Jack

EDIT: Missed that SetTexture [headshake]. It's likely that the runtime or the driver will spot the duplicate SetTexture calls, but it's best not to rely on such things.
EDIT 2: I need to pay more attention [headshake]

[Edited by - jollyjeffers on December 8, 2005 8:45:02 AM]

Share on other sites
Quote:
 Moving the SetTexture outside the loop is a good idea

Could you explain this to me?
Why? (I don't see why I should move it?)
Which loop? (There are two)
Where? (after inner, outer loop, before inner, outer loop)
setup quads -> realloc each time I add a new quadsort quads by texture -> so I don't have to set texture many timesfor each n vertices that fit in VB    copy n vertices in sys memory        lock VB        copy n vertices to VB    unlock VB        for each n vertices that have same texture        SetTexture        DrawIndexedPrimitive n vertices    endend

Duplicate SetTexture calls? There is actually only one SetTexture call.

PS:

I wrote a complete DDraw engine, and after it was done, It appeared to be too slow. [grin]

Share on other sites

Ignore that bit about moving the SetTexture call - its wrong. Sorry!

Jack

Share on other sites
gerbenvv:

move the settexture call outside the inner loop, maybe you were already planning on moving it outside but your for loop says

"for n vertices that have the same texture"
and then on each of these iterations (the code inside the loo) you are setting texture, but you already said in the forloop they all have the same texture... no need to set it... just set it once outside, unless Im confused about what you wrong, then I appologize.

and sorry associative arrays are not part of the C++ language, I was just showing you what I meant by hash tables (depends where / when you learn about them they can be called a million different things).. but you'd have to implement it yourself or use STL, which I would recommend against in a game context, better off writing your own to be faster and optimized for your uses.

jollyjeffers:

thanks, and I'll have to disagree with your statements :P...

1. Sure a quadtree will make rendering that much faster... if a graphics card can push 1 million triangles per second, and you are pushing one 20th of the triangles each frame, you'll get 20 more frames per second (+- with the page flipping and what not being constant across the board, but you get the picture). My point was that too often people ignore constant level optimizations in algorithms saying that "well it was n^2 and now its n, so all is good" when if n is always less than your constant work factor, your n algorithm is actually slower... a good example would be if your terrain never had more than 30 vertices (really really crappy terrain)... you wouldnt bother doing all the math involved in frustum culling, since your card can render more than 600 frames of this terrain as it is without culling.

this also applied to your heiarchial culling example... if you are making a fighting game or something, or even one of those oldschool beatem ups that had at most 6 people somewhere in the room and maybe a weapon on the ground... you wouldnt bother heiarchial culling something like this, because it would actually be slower..

these are over simplified examples, usually what you want to do is design an algorithm, do case studies, best / average / worst case, and figure out how often you expect the worst case and how much slower it is than your average case, and how important it is for the best case to be as fast as possible, etc... on an application level its far too important to remember O(n) is not ALWAYS better than O(n^2) in the real world where we do not have infinite data sets.

compilers do a great job at optimizing your code. but there are specific things compilers cannot do.

Im pretty sure if you have a dynamic loop:

for (i < 0; i < variable; i++)

where variable is not garaunteed to be any specific value (maybe based on user input) the compiler has a hard time optimizing array entries inside based on that i index... as it should. I think compilers that are in use today will still generate better code if you set a temporary reference to the elements you are looking at, to avoid the dereferencing all the way through, that being said, I havent looked into that for a long time, so I could definitely be wrong :)

I should have specified, by assembly I mean writing optimizations in MMX/SSE, being able to divide your data set by 4 on some operations is very useful, and there arent many compilers that take advantage of these features... you just have to make sure you have a legacy code path for people running pentium 133s and lower :P but I doubt they'll have a card that supports direct X 9 anyways.

and I definately know that compilers won't turn all your divides into bit shifts or multiplications as necisary, and if you dont care about the small loss in precision, thats almost 8 cycles each you are saving :P heh

Create an account

Register a new account

• Forum Statistics

• Total Topics
627711
• Total Posts
2978753

• 21
• 14
• 12
• 39