• Announcements

    • khawk

      Download the Game Design and Indie Game Marketing Freebook   07/19/17

      GameDev.net and CRC Press have teamed up to bring a free ebook of content curated from top titles published by CRC Press. The freebook, Practices of Game Design & Indie Game Marketing, includes chapters from The Art of Game Design: A Book of Lenses, A Practical Guide to Indie Game Marketing, and An Architectural Approach to Level Design. The GameDev.net FreeBook is relevant to game designers, developers, and those interested in learning more about the challenges in game development. We know game development can be a tough discipline and business, so we picked several chapters from CRC Press titles that we thought would be of interest to you, the GameDev.net audience, in your journey to design, develop, and market your next game. The free ebook is available through CRC Press by clicking here. The Curated Books The Art of Game Design: A Book of Lenses, Second Edition, by Jesse Schell Presents 100+ sets of questions, or different lenses, for viewing a game’s design, encompassing diverse fields such as psychology, architecture, music, film, software engineering, theme park design, mathematics, anthropology, and more. Written by one of the world's top game designers, this book describes the deepest and most fundamental principles of game design, demonstrating how tactics used in board, card, and athletic games also work in video games. It provides practical instruction on creating world-class games that will be played again and again. View it here. A Practical Guide to Indie Game Marketing, by Joel Dreskin Marketing is an essential but too frequently overlooked or minimized component of the release plan for indie games. A Practical Guide to Indie Game Marketing provides you with the tools needed to build visibility and sell your indie games. With special focus on those developers with small budgets and limited staff and resources, this book is packed with tangible recommendations and techniques that you can put to use immediately. As a seasoned professional of the indie game arena, author Joel Dreskin gives you insight into practical, real-world experiences of marketing numerous successful games and also provides stories of the failures. View it here. An Architectural Approach to Level Design This is one of the first books to integrate architectural and spatial design theory with the field of level design. The book presents architectural techniques and theories for level designers to use in their own work. It connects architecture and level design in different ways that address the practical elements of how designers construct space and the experiential elements of how and why humans interact with this space. Throughout the text, readers learn skills for spatial layout, evoking emotion through gamespaces, and creating better levels through architectural theory. View it here. Learn more and download the ebook by clicking here. Did you know? GameDev.net and CRC Press also recently teamed up to bring GDNet+ Members up to a 20% discount on all CRC Press books. Learn more about this and other benefits here.
Sign in to follow this  
Followers 0
lonewolff

Optimising my renderer

42 posts in this topic

Hi Guys,
 
For a while now I have been working on my own framework to replace my (unneccesary) reliance on Game Maker: Studio
 
Today I decided to see how me engine benchmarks against GM:S. Might code is reasonably tight (so I thought) but GM's renderer still runs rings around mine. GM:S also uses DirectX 9c too.
 
These are the results;
 
# Sprites (256x256)      Mine      GM:S 1.3

1                        1570      ~1350
10                       706       ~1350
100                      106       ~1100
500                      25        ~620
1000                     13        ~450
My renderer is faster when displaying one sprite, but then drops off quite rapidly.

I am using the ID3DXSprite interface to create and render the sprites. This is my entire render code. The sprites are all stored in a vector and use the same image (for testing purposes).
 
void Renderer::renderSpriteQueue()
{
	SpriteSortByDepth();

	pSprite->Begin(D3DXSPRITE_ALPHABLEND);
	std::vector<Sprite>::iterator it;

	for(it=vSprite.begin();it<vSprite.end();it++)
	{
		RECT rectSpriteTextureArea;
		D3DXVECTOR3 v3Center;
		D3DXVECTOR3 v3Position;

		rectSpriteTextureArea.top=0;
		rectSpriteTextureArea.bottom=it->nSizeY;;
		rectSpriteTextureArea.left=0;
		rectSpriteTextureArea.right=it->nSizeX;
		v3Center=D3DXVECTOR3(0,0,0);
		v3Position=D3DXVECTOR3(it->fPosX,it->fPosY,0);

		if(FAILED(pSprite->Draw(pTexture,&rectSpriteTextureArea,&v3Center,&v3Position,0xFFFFFFFF)))
			MessageBox(NULL,"Error","Error",NULL);
	}
	pSprite->Flush();
	pSprite->End();
}
Am I doing something in-efficiently here? Would it be faster to just use a textured quad instead?

Any advice would be awesome smile.png Edited by lonewolff
0

Share this post


Link to post
Share on other sites
Am I doing something in-efficiently here? Would it be faster to just use a textured quad instead?

 

 

Probably.. but at only 1000 sprites it's quite surprising to see such a huge drop in performance. Do the sprites cover the same amount of screen space in both tests?

Your test seems to scale pretty linearly over the number of sprites, which indicates that the problem is either in setup per sprite, or in fillrate.

If the sprites completely cover each other, perhaps GM optimizes away those behind. Try with like 2x2 sprites instead of 256x256 to confirm whether it can be fillrate.

2

Share this post


Link to post
Share on other sites
Yeah, the scenes are setup identically, so screen coverage is the same.

I am also in the process of trying with textured quads but am having trouble applying a texture to a single triangle (as I haven't used textured triangles before - I have another topic in this forum for that issue though). I can render the triangle ok, but cant apply a texture (or don't properly know how to smile.png )
0

Share this post


Link to post
Share on other sites

A triangle is half of a quad, texture is just a matter of computing/assigning the correct texture coordinates. If you visualize a quad as being composed of 2 triangles then that should go a long way in figuring out how to assign the correct texture coordinates.

0

Share this post


Link to post
Share on other sites

A triangle is half of a quad, texture is just a matter of computing/assigning the correct texture coordinates. If you visualize a quad as being composed of 2 triangles then that should go a long way in figuring out how to assign the correct texture coordinates.


Yeah, I can visualise how the uv's should be as it would be a simple 0 & 1 thing.

This is what I have, but I am just getting a white triangle (instead of a triangle with a png on it)
 
LPDIRECT3DVERTEXBUFFER9 pVertexObject = NULL;
void *pVertexBuffer = NULL; 

struct D3DVERTEX{
				float x,y,z,rhw;
				DWORD color;
				float u;
				float v;
					} vertices[3]; 

vertices[0].x = 50; 
vertices[0].y = 50; 
vertices[0].z = 0; 
vertices[0].rhw = 1.0f; 
vertices[0].color = 0xffffff;
vertices[0].u=0.0;
vertices[0].v=0.0;

vertices[1].x = 250; 
vertices[1].y = 50; 
vertices[1].z = 0; 
vertices[1].rhw = 1.0f; 
vertices[1].color = 0xffffff; 
vertices[1].u=1.0;
vertices[1].v=0.0;

vertices[2].x = 50; 
vertices[2].y = 250; 
vertices[2].z = 0; 
vertices[2].rhw = 1.0f;
vertices[2].color = 0xffffff;
vertices[2].u=0.0;
vertices[2].v=1.0;

if(FAILED(mRenderer->getDevice()->CreateVertexBuffer(3*sizeof(D3DVERTEX),0,D3DFVF_XYZRHW|D3DFVF_DIFFUSE|D3DFVF_TEX0,D3DPOOL_DEFAULT,&pVertexObject,NULL)))
	return(0);
 
if(FAILED(pVertexObject->Lock(0,3*sizeof(D3DVERTEX),&pVertexBuffer,0)))
	return(0);
memcpy(pVertexBuffer, vertices, 3*sizeof(D3DVERTEX));
pVertexObject->Unlock();

// do the actual render
mRenderer->getDevice()->SetStreamSource(0,pVertexObject,0,sizeof(D3DVERTEX));
mRenderer->getDevice()->SetFVF(D3DFVF_XYZRHW|D3DFVF_DIFFUSE);
mRenderer->getDevice()->DrawPrimitive(D3DPT_TRIANGLELIST,0,1);

mRenderer->getDevice()->SetTexture(0,mRenderer->pTexture);
Yes, doing this all in the draw call is nasty. I will clean this up once I get it texturing properly.

[edit]
Found out what was happening there

Third last line should be mRenderer->getDevice()->SetFVF(D3DFVF_XYZRHW|D3DFVF_DIFFUSE|D3DFVF_TEX1); Edited by lonewolff
0

Share this post


Link to post
Share on other sites
Ok, more results smile.png

I have now tested with a textured quad and here are the results

Rendered Sprites (256x256)	ID3DXSPRITE	Quad		GM:S 1.3

0				1740		1740		~1400	
1 				1570		1740		~1350
10				706		1209		~1350
100				106		297		~1100
500				25		68		~620
1000				13		35		~450
So, the results are much better (~double) when using a 'Quad' but the results a still far below GM:S.

Under heavy load GM:S is still ~10x quicker. How can that be?

Am I missing something here?
0

Share this post


Link to post
Share on other sites
I have absolutely stripped out my render phase so this is all that is happening

// render 1000 objects
for(int i=0;i<1000;i++)
{
	mRenderer->getDevice()->DrawPrimitive(D3DPT_TRIANGLESTRIP,0,2);
}

// display framerate data
lps=mRenderer->framerateGetReal();
lps=lps-1;
if(lps<0)
	lps=0;

itoa(lps,szBuffer,10);
strcpy(szBuffer2,"Frame Rate: ");
strcat(szBuffer2,szBuffer);
strcat(szBuffer2," FPS");
mRenderer->renderDebugText(600,14,szBuffer2);
I guess it is possible that the way I am rendering the frame counter might be a bottle-neck (it uses ID3DXFONT). I might strip that out and see what I can gain.

Interesting stuff smile.png
0

Share this post


Link to post
Share on other sites
Hmmm, still only 35 FPS if my entire render cycle is just this
 
for(int i=0;i<1000;i++)
        mRenderer->getDevice()->DrawPrimitive(D3DPT_TRIANGLESTRIP,0,2);

So, I must be missing some magic somewhere?

How can GM:S be faster than two lines of render code? Edited by lonewolff
0

Share this post


Link to post
Share on other sites
Awesome Ashaman73,

1. I wasn't aware of this one smile.png
2. I think I get what you are saying here (for the first part anyway). But, I don't understand how you could draw all of this in a single draw call. Also, I can't get my head around how you would position them individually (in a single call).
3. I am aware of that one. For my tests I am using a single texture. So all good on that front. Later on whe I am happy with performance, I was actually planning on cramming all of my sprites on to a 2048 x 2048 texture sheet. (But, that is later smile.png )

Thanks for the advice. I'll have a play with your suggestions and let you know how I go.
0

Share this post


Link to post
Share on other sites

To improve it further, you need to get rid of alpha transparency (performancewise this is really evil). Ok, if you dont want to get rid off it, you can atleast render the solid sprites in a more effective way. To do this, utilize the z-buffer. The videohardware is really good in utilizing the zbuffer, preventing a lot of texture fetches. To utilize it, you should use the z-coord and render the sprites in front to back order. This only works for solid sprites (alpha masking is ok, but alpha blending will not work). In general a pipeline could look like this (pseudo code):

List sl = sprite_list
List solidList = getAllNonAlphaBlendedSprites(sl);
List alphaBlendList = getAllAlphaBlendedSprites(sl);

List buckets[NUMBER_OF_DIFFERENT_TEXTURE_ATLASES];
for(sprite in solidList) {
   int atlasIndex = sprite.getAltasIndex();
   buckets[atlasIndex].add(sprite);
}

// const buffer, you only need to initialise this once
Vertex quadBuffer[MAX_SPRITES_PER_BUFFER*4];
int indexBuffer[MAX_SPRITES_PER_BUFFER*6];
for(i=0 to MAX_SPRITES_PER_BUFFER) {
  indexBuffer[i*6+0] = i*4+0;
  indexBuffer[i*6+1] = i*4+1;
  indexBuffer[i*6+2] = i*4+3;
  indexBuffer[i*6+3] = i*4+1;
  indexBuffer[i*6+4] = i*4+2;
  indexBuffer[i*6+5] = i*4+3;
}

// render phase
for(singleBucket in buckets) {
  // sort from front to back
  singleBucket.sortFrontToBack();

  // activate atlas texture
  ...

  // fill batch
 int batchedSprites = 0;
 while(singleBucket.isNotEmpty()) {
   Sprite sp = singleBucket.removeFirstSprite();
   // transfer sprite to batch
   quadBuffer[batchedSprites*4+0] = ...sprite.getVertex(0)..
   ..
    batchedSprites++;

   // render ?
   if(batchedSprites==MAX_SPRITES_PER_BUFFER) {
      DrawIndexTriArray(quadBuffer,indexBuffer,batchedSprites*2 /*count of tris*/);
      batchedSprites = 0;
   }
 }
 // render last batch
 if(batchedSprites>0) {
     DrawIndexTriArray(quadBuffer,indexBuffer,batchedSprites*2 /*count of tris*/);
 }
}

2

Share this post


Link to post
Share on other sites

Just to be sure - how do you display the FPS in your version? Isn't mRenderer->renderDebugText using D3DXFont to draw the text? It is not very efficient so it may be a source of a part of the FPS difference. But of course not all of it ;)

0

Share this post


Link to post
Share on other sites
Thanks L.Spiro - I have some reading to do smile.png

Tom KQT - I suspected that D3DXFont might be a bottleneck as well.

So, now I am sampling the renders in a 10 second period by incrementing a counter and displaying what I have after that is a MessageBox. Not the cleanest of framerate counters. But, I am pretty confident that the loop is now running as fast as it can (DX calls aside)
 
	fastCount++;
	if(GetTickCount()>fastTime+10000)
	{
		char szBuffer[16];

		itoa(fastCount/10,szBuffer,10);
		
		MessageBox(NULL,szBuffer,szBuffer,NULL);
		pVertexObject->Release();
		PostQuitMessage(0);
	}
Looks ugly, but should be quick (and leaks like a sieve - LOL) Edited by lonewolff
0

Share this post


Link to post
Share on other sites

So, I am currently trying to go through this list Microsoft recommends.
 
Using strikethough as I go smile.png
 
 

General Performance Tips

Clear only when you must. Only clearing the backbuffer
•Minimize state changes and group the remaining state changes. How do you group state changes?
Use smaller textures, if you can do so. 256 x 256 recommended.Done
Draw objects in your scene from front to back. All objects using same z depth at at the moment. Using for 2D only at this stage.
Use triangle strips instead of lists and fans. For optimal vertex cache performance, arrange strips to reuse triangle vertices sooner, rather than later. Only making quads. But, am using strips.

Gracefully degrade special effects that require a disproportionate share of system resources. Not applicable yet
Constantly test your application's performance. Well that's what we are here for smile.png
Minimize vertex buffer switches. Only have one vertex buffer in my app
•Use static vertex buffers where possible. How do you know if it is static?
•Use one large static vertex buffer per FVF for static objects, rather than one per object. What if each object has the same vertex property? Eg. all objects are 256 x 256 quads? reuse the same buffer?
•If your application needs random access into the vertex buffer in AGP memory, choose a vertex format size that is a multiple of 32 bytes. Otherwise, select the smallest appropriate format. Random access as in needing to change vertexes at runtime?
•Draw using indexed primitives. This can allow for more efficient vertex caching within hardware. Tryinng this next. Again what if each object has the same vertex property? reuse the same buffer?
If the depth buffer format contains a stencil channel, always clear the depth and stencil channels at the same time. Only using 2D with no stencils so this shouldn't apply (I am guessing)
Combine the shader instruction and the data output where possible. Not using shaders yet.


Does this sound like on I am the right path? And please correct me if anything I have written is wrong. smile.png

Edited by lonewolff
1

Share this post


Link to post
Share on other sites

•Minimize state changes and group the remaining state changes. How do you group state changes?

- for example... changing the alpha blending. 

•Use static vertex buffers where possible. How do you know if it is static?

-i forget the flags but basically... can you write to it after it's created? then it isn't static

Use one large static vertex buffer per FVF for static objects, rather than one per object. What if each object has the same vertex property? Eg. all objects are 256 x 256 quads? reuse the same buffer?

- i create 2-3 pools. switch pools every frame. hopefully reduces lag to gpu

•If your application needs random access into the vertex buffer in AGP memory, choose a vertex format size that is a multiple of 32 bytes. Otherwise, select the smallest appropriate format. Random access as in needing to change vertexes at runtime?

- data = vb->lock(); data +13 = x; vb->unlock();   lock/unlock as minimally as possible. write to a locked buffer as minimially as possible.

 

one thing i noticed is you are doing 1 draw call per sprite. group sprites by texture to reduce draw calls. you may have 10k sprites but do you have 10k textures? maybe you have 10k sprites but they are on 2 textures. you don't need 10k draw calls. you can do it in 2.

1

Share this post


Link to post
Share on other sites
Cool, thanks for the info.

I remember now that I have set the vertex buffer to read only. So, I have done the right thing there :)

I have no state changes at all (now I understand what a state change is), as I am purely doing a straight stress test render.

Also I locked the vertex buffer only once and this is done in my creation phase (out side and before the render loop), so no lag there either.

Your last point is an interesting one though. I have read a lot today about drawing sprites in one draw call, but I haven't seen how this is actually achieved. So, I have absolutely no Idea how this can be done.

I would be extremely appreciative if you could shed light on this for me :)
0

Share this post


Link to post
Share on other sites
Thanks L. Spiro,

I think it is starting to get through my thick head - LOL :)

I'll hit the code and see what I come up with. I'll post what I have later on.

Thanks again everyone for helping out. Must be frustrating sometimes. :)
0

Share this post


Link to post
Share on other sites

 

Clear only when you must. Only clearing the backbuffer

 

EDIT: This following information turned out to be incorrect, as I thought :D

I'm not really sure now but I think I saw somewhere that you should clear everything at the same time, that means if you clear backbuffer, you should also clear depthbuffer by the same clear command? But I'm really not sure, I may be imagining this.

By Clear only when you must they mean if you don't need to call Clear(), don't call it at all.

Edited by Tom KQT
0

Share this post


Link to post
Share on other sites
Ok, I am trying to put in to action what L. Spiro suggested
 
In my create code (before the render loop) I have created a vertex array of 10 identical quads (a bit hacky - but I'll clean that up later). I truncated each vertex to one line to same space here. I also set the texture here too as I'll only be using the one texture for this test (saves calling it every frame in the render cycle).
 
void *pVertexBuffer = NULL; 

struct D3DVERTEX{float x,y,z,rhw;DWORD color;float u;float v;} vertices[40];	// 4 verts * 10 quads
LPDIRECT3DVERTEXBUFFER9 pVertexObject = NULL;

// vertex description for our sprites
for(int n=0;n<10;n++)
{
	vertices[0*n].x=0;vertices[0*n].y=256;vertices[0*n].z=0;vertices[0*n].rhw=1.0f;vertices[0*n].color=0xffffff;vertices[0*n].u=0.0;vertices[0*n].v=1.0;
	vertices[1*n].x=0;vertices[1*n].y=0;vertices[1*n].z=0;vertices[1*n].rhw=1.0f;vertices[1*n].color=0xffffff;vertices[1*n].u=0.0;vertices[1*n].v=0.0;
	vertices[2*n].x=256;vertices[2*n].y=256;vertices[2*n].z=0;vertices[2*n].rhw=1.0f;vertices[2*n].color=0xffffff;vertices[2*n].u=1.0;vertices[2*n].v=1.0;
	vertices[3*n].x=256;vertices[3*n].y=0;vertices[3*n].z=0;vertices[3*n].rhw=1.0f;vertices[3*n].color=0xffffff;vertices[3*n].u=1.0;vertices[3*n].v=0.0;
}

mRenderer->getDevice()->SetTexture(0,mRenderer->pTexture);
Now for the renderer part.

// #1: Create vertex buffer. Not static/read-only. Dynamic.
void *pVertexBufferDynamic=NULL;
LPDIRECT3DVERTEXBUFFER9 pVertexObjectDynamic=NULL;

int nQuantity=10;

if(FAILED(mRenderer->getDevice()->CreateVertexBuffer(nQuantity*4*sizeof(D3DVERTEX),NULL,D3DFVF_XYZRHW|D3DFVF_DIFFUSE|D3DFVF_TEX1,D3DPOOL_MANAGED,&pVertexObjectDynamic,NULL)))
	return(1);

// #2: Lock it.
if(FAILED(pVertexObject->Lock(0,nQuantity*4*sizeof(D3DVERTEX),&pVertexBuffer,0)))
	return(2);

// #3: Fill it with the sprite vertices. Drawing 32 sprites means you put 32×4 vertices into the buffer.
memcpy(pVertexBuffer,vertices,nQuantity*4*sizeof(D3DVERTEX));

// #4: Unlock it.
pVertexObject->Unlock();

// #5: Draw it using the pre-generated 16-bit index buffer.
mRenderer->getDevice()->SetStreamSource(0,pVertexObject,0,sizeof(D3DVERTEX));
mRenderer->getDevice()->SetFVF(D3DFVF_XYZRHW|D3DFVF_DIFFUSE|D3DFVF_TEX1);

// **** then what?
// ? Draw it using the pre-generated 16-bit index buffer.

pVertexObjectDynamic->Release();
There are a couple of thing of concern here. At the locking phase, I am getting a fatal exception due to adding nQuantity. I assumed that I would have to have this in here somewhere as I am attempting to draw 10 quads. Am I right?

The other thing I am unsure of is "#5 Draw it using the pre-generated 16-bit index buffer.". What index buffer? Do I need to add anther step somewhere?

Otherwise, is my code looking like it is on the right track?

Thanks again everyone smile.png
0

Share this post


Link to post
Share on other sites

 

 

Clear only when you must. Only clearing the backbuffer

I'm not really sure now but I think I saw somewhere that you should clear everything at the same time, that means if you clear backbuffer, you should also clear depthbuffer by the same clear command? But I'm really not sure, I may be imagining this.

By Clear only when you must they mean if you don't need to call Clear(), don't call it at all.

 

 

It's more the case that if you have depth and stencil, and if you're clearing depth, then you should clear stencil at the same time.  This is because we typically see depth and stencil interleaved in a D24S8 format, so they're not separate buffers: they're a single buffer that contains the data for both, and clearing both at the same time allows the hardware to do a fast clear (which may be as fast as just awapping out a pointer or setting a flag).

 

You can't always do this of course, and some algorithms require clearing them individually, but where possible you should.

 

I don't recall ever seeing any advice about perf gains from clearing colour at the same time, but I would expect there would be none as these are separate buffers.

2

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0