Jump to content

  • Log In with Google      Sign In   
  • Create Account

Banner advertising on our site currently available from just $5!


1. Learn about the promo. 2. Sign up for GDNet+. 3. Set up your advert!


Optimising my renderer


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
42 replies to this topic

#1 DarkRonin   Members   -  Reputation: 638

Like
0Likes
Like

Posted 16 April 2014 - 05:37 PM

Hi Guys,
 
For a while now I have been working on my own framework to replace my (unneccesary) reliance on Game Maker: Studio
 
Today I decided to see how me engine benchmarks against GM:S. Might code is reasonably tight (so I thought) but GM's renderer still runs rings around mine. GM:S also uses DirectX 9c too.
 
These are the results;
 
# Sprites (256x256)      Mine      GM:S 1.3

1                        1570      ~1350
10                       706       ~1350
100                      106       ~1100
500                      25        ~620
1000                     13        ~450
My renderer is faster when displaying one sprite, but then drops off quite rapidly.

I am using the ID3DXSprite interface to create and render the sprites. This is my entire render code. The sprites are all stored in a vector and use the same image (for testing purposes).
 
void Renderer::renderSpriteQueue()
{
	SpriteSortByDepth();

	pSprite->Begin(D3DXSPRITE_ALPHABLEND);
	std::vector<Sprite>::iterator it;

	for(it=vSprite.begin();it<vSprite.end();it++)
	{
		RECT rectSpriteTextureArea;
		D3DXVECTOR3 v3Center;
		D3DXVECTOR3 v3Position;

		rectSpriteTextureArea.top=0;
		rectSpriteTextureArea.bottom=it->nSizeY;;
		rectSpriteTextureArea.left=0;
		rectSpriteTextureArea.right=it->nSizeX;
		v3Center=D3DXVECTOR3(0,0,0);
		v3Position=D3DXVECTOR3(it->fPosX,it->fPosY,0);

		if(FAILED(pSprite->Draw(pTexture,&rectSpriteTextureArea,&v3Center,&v3Position,0xFFFFFFFF)))
			MessageBox(NULL,"Error","Error",NULL);
	}
	pSprite->Flush();
	pSprite->End();
}
Am I doing something in-efficiently here? Would it be faster to just use a textured quad instead?

Any advice would be awesome smile.png

Edited by lonewolff, 16 April 2014 - 05:38 PM.


Sponsor:

#2 Pink Horror   Members   -  Reputation: 1669

Like
0Likes
Like

Posted 16 April 2014 - 07:22 PM

What's the speed without the sort?

#3 DarkRonin   Members   -  Reputation: 638

Like
0Likes
Like

Posted 16 April 2014 - 07:25 PM

Identical. I tried that out earlier. smile.png

#4 Erik Rufelt   Crossbones+   -  Reputation: 4337

Like
2Likes
Like

Posted 16 April 2014 - 07:41 PM

Am I doing something in-efficiently here? Would it be faster to just use a textured quad instead?

 

 

Probably.. but at only 1000 sprites it's quite surprising to see such a huge drop in performance. Do the sprites cover the same amount of screen space in both tests?

Your test seems to scale pretty linearly over the number of sprites, which indicates that the problem is either in setup per sprite, or in fillrate.

If the sprites completely cover each other, perhaps GM optimizes away those behind. Try with like 2x2 sprites instead of 256x256 to confirm whether it can be fillrate.



#5 DarkRonin   Members   -  Reputation: 638

Like
0Likes
Like

Posted 16 April 2014 - 07:50 PM

Yeah, the scenes are setup identically, so screen coverage is the same.

I am also in the process of trying with textured quads but am having trouble applying a texture to a single triangle (as I haven't used textured triangles before - I have another topic in this forum for that issue though). I can render the triangle ok, but cant apply a texture (or don't properly know how to smile.png )

#6 cgrant   Members   -  Reputation: 1003

Like
0Likes
Like

Posted 16 April 2014 - 07:58 PM

A triangle is half of a quad, texture is just a matter of computing/assigning the correct texture coordinates. If you visualize a quad as being composed of 2 triangles then that should go a long way in figuring out how to assign the correct texture coordinates.



#7 DarkRonin   Members   -  Reputation: 638

Like
0Likes
Like

Posted 16 April 2014 - 08:02 PM

A triangle is half of a quad, texture is just a matter of computing/assigning the correct texture coordinates. If you visualize a quad as being composed of 2 triangles then that should go a long way in figuring out how to assign the correct texture coordinates.


Yeah, I can visualise how the uv's should be as it would be a simple 0 & 1 thing.

This is what I have, but I am just getting a white triangle (instead of a triangle with a png on it)
 
LPDIRECT3DVERTEXBUFFER9 pVertexObject = NULL;
void *pVertexBuffer = NULL; 

struct D3DVERTEX{
				float x,y,z,rhw;
				DWORD color;
				float u;
				float v;
					} vertices[3]; 

vertices[0].x = 50; 
vertices[0].y = 50; 
vertices[0].z = 0; 
vertices[0].rhw = 1.0f; 
vertices[0].color = 0xffffff;
vertices[0].u=0.0;
vertices[0].v=0.0;

vertices[1].x = 250; 
vertices[1].y = 50; 
vertices[1].z = 0; 
vertices[1].rhw = 1.0f; 
vertices[1].color = 0xffffff; 
vertices[1].u=1.0;
vertices[1].v=0.0;

vertices[2].x = 50; 
vertices[2].y = 250; 
vertices[2].z = 0; 
vertices[2].rhw = 1.0f;
vertices[2].color = 0xffffff;
vertices[2].u=0.0;
vertices[2].v=1.0;

if(FAILED(mRenderer->getDevice()->CreateVertexBuffer(3*sizeof(D3DVERTEX),0,D3DFVF_XYZRHW|D3DFVF_DIFFUSE|D3DFVF_TEX0,D3DPOOL_DEFAULT,&pVertexObject,NULL)))
	return(0);
 
if(FAILED(pVertexObject->Lock(0,3*sizeof(D3DVERTEX),&pVertexBuffer,0)))
	return(0);
memcpy(pVertexBuffer, vertices, 3*sizeof(D3DVERTEX));
pVertexObject->Unlock();

// do the actual render
mRenderer->getDevice()->SetStreamSource(0,pVertexObject,0,sizeof(D3DVERTEX));
mRenderer->getDevice()->SetFVF(D3DFVF_XYZRHW|D3DFVF_DIFFUSE);
mRenderer->getDevice()->DrawPrimitive(D3DPT_TRIANGLELIST,0,1);

mRenderer->getDevice()->SetTexture(0,mRenderer->pTexture);
Yes, doing this all in the draw call is nasty. I will clean this up once I get it texturing properly.

[edit]
Found out what was happening there

Third last line should be mRenderer->getDevice()->SetFVF(D3DFVF_XYZRHW|D3DFVF_DIFFUSE|D3DFVF_TEX1);

Edited by lonewolff, 16 April 2014 - 08:05 PM.


#8 DarkRonin   Members   -  Reputation: 638

Like
0Likes
Like

Posted 16 April 2014 - 10:38 PM

Ok, more results smile.png

I have now tested with a textured quad and here are the results

Rendered Sprites (256x256)	ID3DXSPRITE	Quad		GM:S 1.3

0				1740		1740		~1400	
1 				1570		1740		~1350
10				706		1209		~1350
100				106		297		~1100
500				25		68		~620
1000				13		35		~450
So, the results are much better (~double) when using a 'Quad' but the results a still far below GM:S.

Under heavy load GM:S is still ~10x quicker. How can that be?

Am I missing something here?

#9 DarkRonin   Members   -  Reputation: 638

Like
0Likes
Like

Posted 16 April 2014 - 11:02 PM

I have absolutely stripped out my render phase so this is all that is happening

// render 1000 objects
for(int i=0;i<1000;i++)
{
	mRenderer->getDevice()->DrawPrimitive(D3DPT_TRIANGLESTRIP,0,2);
}

// display framerate data
lps=mRenderer->framerateGetReal();
lps=lps-1;
if(lps<0)
	lps=0;

itoa(lps,szBuffer,10);
strcpy(szBuffer2,"Frame Rate: ");
strcat(szBuffer2,szBuffer);
strcat(szBuffer2," FPS");
mRenderer->renderDebugText(600,14,szBuffer2);
I guess it is possible that the way I am rendering the frame counter might be a bottle-neck (it uses ID3DXFONT). I might strip that out and see what I can gain.

Interesting stuff smile.png

#10 DarkRonin   Members   -  Reputation: 638

Like
0Likes
Like

Posted 16 April 2014 - 11:15 PM

Hmmm, still only 35 FPS if my entire render cycle is just this
 
for(int i=0;i<1000;i++)
        mRenderer->getDevice()->DrawPrimitive(D3DPT_TRIANGLESTRIP,0,2);

So, I must be missing some magic somewhere?

How can GM:S be faster than two lines of render code?

Edited by lonewolff, 16 April 2014 - 11:16 PM.


#11 Ashaman73   Crossbones+   -  Reputation: 11087

Like
3Likes
Like

Posted 16 April 2014 - 11:41 PM


for(int i=0;i<1000;i++) mRenderer->getDevice()->DrawPrimitive(D3DPT_TRIANGLESTRIP,0,2);

You are on the right track. One of the most time demanding things are API calls ! So, the trick is, to reduce your API calls alot. When using quads (the right way), the next step would be to group quads and render them in a single call. This is called batching. You are quite new to 3d rendering, but here are some hints in which direction you should investigate to render a few millions sprites in the same time wink.png

1. Use indexed primitives, that is, you have an array of vertices and an array of indicies into this vertices.

2. Use batching by putting multiple sprites into a single array (10 sprites = 40 vertex array) and draw them with a single index draw call. Either use triangle list (6 indicies per quad) or a triangle strip and connect them. The triangle list is much easier and the performance impact would not be hard.

3. Use a texture atlas, that is, put multiple sprites on a single larger texture. Then group (->batch) all sprites which use the same texture into a single batch and draw it.

 

The trick is, to get rid of too many API calls !


Ashaman

 

Gnoblins: Website - Facebook - Twitter - Youtube - Steam Greenlit - IndieDB - Gamedev Log


#12 DarkRonin   Members   -  Reputation: 638

Like
0Likes
Like

Posted 16 April 2014 - 11:53 PM

Awesome Ashaman73,

1. I wasn't aware of this one smile.png
2. I think I get what you are saying here (for the first part anyway). But, I don't understand how you could draw all of this in a single draw call. Also, I can't get my head around how you would position them individually (in a single call).
3. I am aware of that one. For my tests I am using a single texture. So all good on that front. Later on whe I am happy with performance, I was actually planning on cramming all of my sprites on to a 2048 x 2048 texture sheet. (But, that is later smile.png )

Thanks for the advice. I'll have a play with your suggestions and let you know how I go.

#13 Ashaman73   Crossbones+   -  Reputation: 11087

Like
2Likes
Like

Posted 17 April 2014 - 12:00 AM

To improve it further, you need to get rid of alpha transparency (performancewise this is really evil). Ok, if you dont want to get rid off it, you can atleast render the solid sprites in a more effective way. To do this, utilize the z-buffer. The videohardware is really good in utilizing the zbuffer, preventing a lot of texture fetches. To utilize it, you should use the z-coord and render the sprites in front to back order. This only works for solid sprites (alpha masking is ok, but alpha blending will not work). In general a pipeline could look like this (pseudo code):

List sl = sprite_list
List solidList = getAllNonAlphaBlendedSprites(sl);
List alphaBlendList = getAllAlphaBlendedSprites(sl);

List buckets[NUMBER_OF_DIFFERENT_TEXTURE_ATLASES];
for(sprite in solidList) {
   int atlasIndex = sprite.getAltasIndex();
   buckets[atlasIndex].add(sprite);
}

// const buffer, you only need to initialise this once
Vertex quadBuffer[MAX_SPRITES_PER_BUFFER*4];
int indexBuffer[MAX_SPRITES_PER_BUFFER*6];
for(i=0 to MAX_SPRITES_PER_BUFFER) {
  indexBuffer[i*6+0] = i*4+0;
  indexBuffer[i*6+1] = i*4+1;
  indexBuffer[i*6+2] = i*4+3;
  indexBuffer[i*6+3] = i*4+1;
  indexBuffer[i*6+4] = i*4+2;
  indexBuffer[i*6+5] = i*4+3;
}

// render phase
for(singleBucket in buckets) {
  // sort from front to back
  singleBucket.sortFrontToBack();

  // activate atlas texture
  ...

  // fill batch
 int batchedSprites = 0;
 while(singleBucket.isNotEmpty()) {
   Sprite sp = singleBucket.removeFirstSprite();
   // transfer sprite to batch
   quadBuffer[batchedSprites*4+0] = ...sprite.getVertex(0)..
   ..
    batchedSprites++;

   // render ?
   if(batchedSprites==MAX_SPRITES_PER_BUFFER) {
      DrawIndexTriArray(quadBuffer,indexBuffer,batchedSprites*2 /*count of tris*/);
      batchedSprites = 0;
   }
 }
 // render last batch
 if(batchedSprites>0) {
     DrawIndexTriArray(quadBuffer,indexBuffer,batchedSprites*2 /*count of tris*/);
 }
}


Ashaman

 

Gnoblins: Website - Facebook - Twitter - Youtube - Steam Greenlit - IndieDB - Gamedev Log


#14 L. Spiro   Crossbones+   -  Reputation: 19239

Like
3Likes
Like

Posted 17 April 2014 - 12:50 AM

#1: I can’t tell if you are using shaders. If not, use shaders.
#2: The vertex shader does not need a whole 4×4 matrix to do what it needs to do; it only needs a single vector with the normalized screen dimensions. This reduces bandwidth when updating uniforms.
#3: Don’t use sprites.
#4: Do use 2 vertex buffers and 1 pre-generated max-filled index buffer (see Ashaman73’s code above).
- #A: The index buffer should be 16 bits.
- #B: The vertex buffers should be double-buffered. Never overwrite part of a vertex buffer immediately after drawing it. If you have to write more than MAX_SPRITES_PER_BUFFER (borrowing from Ashaman73’s code) in a single frame, then you need to write to more than one buffer that frame (let’s say 3), than write to a new set of 3 buffers the next frame. But generally MAX_SPRITES_PER_BUFFER should be set high enough that it never gets overflowed in a single frame and you bounce back and forth between only 2 buffers each frame.



Your biggest bottleneck is how you are managing your vertex buffer.
There is extensive reading material on best practices when updating a vertex buffer.
http://msdn.microsoft.com/en-us/library/windows/desktop/bb147263(v=vs.85).aspx#Using_Dynamic_Vertex_and_Index_Buffers


L. Spiro

Edited by L. Spiro, 17 April 2014 - 12:51 AM.


#15 Tom KQT   Members   -  Reputation: 1638

Like
0Likes
Like

Posted 17 April 2014 - 01:10 AM

Just to be sure - how do you display the FPS in your version? Isn't mRenderer->renderDebugText using D3DXFont to draw the text? It is not very efficient so it may be a source of a part of the FPS difference. But of course not all of it ;)



#16 DarkRonin   Members   -  Reputation: 638

Like
0Likes
Like

Posted 17 April 2014 - 01:17 AM

Thanks L.Spiro - I have some reading to do smile.png

Tom KQT - I suspected that D3DXFont might be a bottleneck as well.

So, now I am sampling the renders in a 10 second period by incrementing a counter and displaying what I have after that is a MessageBox. Not the cleanest of framerate counters. But, I am pretty confident that the loop is now running as fast as it can (DX calls aside)
 
	fastCount++;
	if(GetTickCount()>fastTime+10000)
	{
		char szBuffer[16];

		itoa(fastCount/10,szBuffer,10);
		
		MessageBox(NULL,szBuffer,szBuffer,NULL);
		pVertexObject->Release();
		PostQuitMessage(0);
	}
Looks ugly, but should be quick (and leaks like a sieve - LOL)

Edited by lonewolff, 17 April 2014 - 01:18 AM.


#17 DarkRonin   Members   -  Reputation: 638

Like
0Likes
Like

Posted 17 April 2014 - 01:37 AM

Oh, and I'm not using shaders at this point.

#18 DarkRonin   Members   -  Reputation: 638

Like
1Likes
Like

Posted 17 April 2014 - 02:18 AM

So, I am currently trying to go through this list Microsoft recommends.
 
Using strikethough as I go smile.png
 
 

General Performance Tips

Clear only when you must. Only clearing the backbuffer
•Minimize state changes and group the remaining state changes. How do you group state changes?
Use smaller textures, if you can do so. 256 x 256 recommended.Done
Draw objects in your scene from front to back. All objects using same z depth at at the moment. Using for 2D only at this stage.
Use triangle strips instead of lists and fans. For optimal vertex cache performance, arrange strips to reuse triangle vertices sooner, rather than later. Only making quads. But, am using strips.

Gracefully degrade special effects that require a disproportionate share of system resources. Not applicable yet
Constantly test your application's performance. Well that's what we are here for smile.png
Minimize vertex buffer switches. Only have one vertex buffer in my app
•Use static vertex buffers where possible. How do you know if it is static?
•Use one large static vertex buffer per FVF for static objects, rather than one per object. What if each object has the same vertex property? Eg. all objects are 256 x 256 quads? reuse the same buffer?
•If your application needs random access into the vertex buffer in AGP memory, choose a vertex format size that is a multiple of 32 bytes. Otherwise, select the smallest appropriate format. Random access as in needing to change vertexes at runtime?
•Draw using indexed primitives. This can allow for more efficient vertex caching within hardware. Tryinng this next. Again what if each object has the same vertex property? reuse the same buffer?
If the depth buffer format contains a stencil channel, always clear the depth and stencil channels at the same time. Only using 2D with no stencils so this shouldn't apply (I am guessing)
Combine the shader instruction and the data output where possible. Not using shaders yet.


Does this sound like on I am the right path? And please correct me if anything I have written is wrong. smile.png


Edited by lonewolff, 17 April 2014 - 02:22 AM.


#19 hdxpete   Members   -  Reputation: 512

Like
1Likes
Like

Posted 17 April 2014 - 03:07 AM

•Minimize state changes and group the remaining state changes. How do you group state changes?

- for example... changing the alpha blending. 

•Use static vertex buffers where possible. How do you know if it is static?

-i forget the flags but basically... can you write to it after it's created? then it isn't static

Use one large static vertex buffer per FVF for static objects, rather than one per object. What if each object has the same vertex property? Eg. all objects are 256 x 256 quads? reuse the same buffer?

- i create 2-3 pools. switch pools every frame. hopefully reduces lag to gpu

•If your application needs random access into the vertex buffer in AGP memory, choose a vertex format size that is a multiple of 32 bytes. Otherwise, select the smallest appropriate format. Random access as in needing to change vertexes at runtime?

- data = vb->lock(); data +13 = x; vb->unlock();   lock/unlock as minimally as possible. write to a locked buffer as minimially as possible.

 

one thing i noticed is you are doing 1 draw call per sprite. group sprites by texture to reduce draw calls. you may have 10k sprites but do you have 10k textures? maybe you have 10k sprites but they are on 2 textures. you don't need 10k draw calls. you can do it in 2.



#20 DarkRonin   Members   -  Reputation: 638

Like
0Likes
Like

Posted 17 April 2014 - 03:49 AM

Cool, thanks for the info.

I remember now that I have set the vertex buffer to read only. So, I have done the right thing there :)

I have no state changes at all (now I understand what a state change is), as I am purely doing a straight stress test render.

Also I locked the vertex buffer only once and this is done in my creation phase (out side and before the render loop), so no lag there either.

Your last point is an interesting one though. I have read a lot today about drawing sprites in one draw call, but I haven't seen how this is actually achieved. So, I have absolutely no Idea how this can be done.

I would be extremely appreciative if you could shed light on this for me :)




Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS