Optimising my renderer

Started by
41 comments, last by Hodgman 9 years, 12 months ago


for(int i=0;i<1000;i++) mRenderer->getDevice()->DrawPrimitive(D3DPT_TRIANGLESTRIP,0,2);

You are on the right track. One of the most time demanding things are API calls ! So, the trick is, to reduce your API calls alot. When using quads (the right way), the next step would be to group quads and render them in a single call. This is called batching. You are quite new to 3d rendering, but here are some hints in which direction you should investigate to render a few millions sprites in the same time wink.png

1. Use indexed primitives, that is, you have an array of vertices and an array of indicies into this vertices.

2. Use batching by putting multiple sprites into a single array (10 sprites = 40 vertex array) and draw them with a single index draw call. Either use triangle list (6 indicies per quad) or a triangle strip and connect them. The triangle list is much easier and the performance impact would not be hard.

3. Use a texture atlas, that is, put multiple sprites on a single larger texture. Then group (->batch) all sprites which use the same texture into a single batch and draw it.

The trick is, to get rid of too many API calls !

Advertisement
Awesome Ashaman73,

1. I wasn't aware of this one smile.png
2. I think I get what you are saying here (for the first part anyway). But, I don't understand how you could draw all of this in a single draw call. Also, I can't get my head around how you would position them individually (in a single call).
3. I am aware of that one. For my tests I am using a single texture. So all good on that front. Later on whe I am happy with performance, I was actually planning on cramming all of my sprites on to a 2048 x 2048 texture sheet. (But, that is later smile.png )

Thanks for the advice. I'll have a play with your suggestions and let you know how I go.

To improve it further, you need to get rid of alpha transparency (performancewise this is really evil). Ok, if you dont want to get rid off it, you can atleast render the solid sprites in a more effective way. To do this, utilize the z-buffer. The videohardware is really good in utilizing the zbuffer, preventing a lot of texture fetches. To utilize it, you should use the z-coord and render the sprites in front to back order. This only works for solid sprites (alpha masking is ok, but alpha blending will not work). In general a pipeline could look like this (pseudo code):


List sl = sprite_list
List solidList = getAllNonAlphaBlendedSprites(sl);
List alphaBlendList = getAllAlphaBlendedSprites(sl);

List buckets[NUMBER_OF_DIFFERENT_TEXTURE_ATLASES];
for(sprite in solidList) {
   int atlasIndex = sprite.getAltasIndex();
   buckets[atlasIndex].add(sprite);
}

// const buffer, you only need to initialise this once
Vertex quadBuffer[MAX_SPRITES_PER_BUFFER*4];
int indexBuffer[MAX_SPRITES_PER_BUFFER*6];
for(i=0 to MAX_SPRITES_PER_BUFFER) {
  indexBuffer[i*6+0] = i*4+0;
  indexBuffer[i*6+1] = i*4+1;
  indexBuffer[i*6+2] = i*4+3;
  indexBuffer[i*6+3] = i*4+1;
  indexBuffer[i*6+4] = i*4+2;
  indexBuffer[i*6+5] = i*4+3;
}

// render phase
for(singleBucket in buckets) {
  // sort from front to back
  singleBucket.sortFrontToBack();

  // activate atlas texture
  ...

  // fill batch
 int batchedSprites = 0;
 while(singleBucket.isNotEmpty()) {
   Sprite sp = singleBucket.removeFirstSprite();
   // transfer sprite to batch
   quadBuffer[batchedSprites*4+0] = ...sprite.getVertex(0)..
   ..
    batchedSprites++;

   // render ?
   if(batchedSprites==MAX_SPRITES_PER_BUFFER) {
      DrawIndexTriArray(quadBuffer,indexBuffer,batchedSprites*2 /*count of tris*/);
      batchedSprites = 0;
   }
 }
 // render last batch
 if(batchedSprites>0) {
     DrawIndexTriArray(quadBuffer,indexBuffer,batchedSprites*2 /*count of tris*/);
 }
}

#1: I can’t tell if you are using shaders. If not, use shaders.
#2: The vertex shader does not need a whole 4×4 matrix to do what it needs to do; it only needs a single vector with the normalized screen dimensions. This reduces bandwidth when updating uniforms.
#3: Don’t use sprites.
#4: Do use 2 vertex buffers and 1 pre-generated max-filled index buffer (see Ashaman73’s code above).
- #A: The index buffer should be 16 bits.
- #B: The vertex buffers should be double-buffered. Never overwrite part of a vertex buffer immediately after drawing it. If you have to write more than MAX_SPRITES_PER_BUFFER (borrowing from Ashaman73’s code) in a single frame, then you need to write to more than one buffer that frame (let’s say 3), than write to a new set of 3 buffers the next frame. But generally MAX_SPRITES_PER_BUFFER should be set high enough that it never gets overflowed in a single frame and you bounce back and forth between only 2 buffers each frame.



Your biggest bottleneck is how you are managing your vertex buffer.
There is extensive reading material on best practices when updating a vertex buffer.
http://msdn.microsoft.com/en-us/library/windows/desktop/bb147263(v=vs.85).aspx#Using_Dynamic_Vertex_and_Index_Buffers


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

Just to be sure - how do you display the FPS in your version? Isn't mRenderer->renderDebugText using D3DXFont to draw the text? It is not very efficient so it may be a source of a part of the FPS difference. But of course not all of it ;)

Thanks L.Spiro - I have some reading to do smile.png

Tom KQT - I suspected that D3DXFont might be a bottleneck as well.

So, now I am sampling the renders in a 10 second period by incrementing a counter and displaying what I have after that is a MessageBox. Not the cleanest of framerate counters. But, I am pretty confident that the loop is now running as fast as it can (DX calls aside)


	fastCount++;
	if(GetTickCount()>fastTime+10000)
	{
		char szBuffer[16];

		itoa(fastCount/10,szBuffer,10);
		
		MessageBox(NULL,szBuffer,szBuffer,NULL);
		pVertexObject->Release();
		PostQuitMessage(0);
	}
Looks ugly, but should be quick (and leaks like a sieve - LOL)
Oh, and I'm not using shaders at this point.

So, I am currently trying to go through this list Microsoft recommends.

Using strikethough as I go smile.png

General Performance Tips

•Clear only when you must. Only clearing the backbuffer
•Minimize state changes and group the remaining state changes. How do you group state changes?
•Use smaller textures, if you can do so. 256 x 256 recommended.Done
•Draw objects in your scene from front to back. All objects using same z depth at at the moment. Using for 2D only at this stage.
•Use triangle strips instead of lists and fans. For optimal vertex cache performance, arrange strips to reuse triangle vertices sooner, rather than later. Only making quads. But, am using strips.

•Gracefully degrade special effects that require a disproportionate share of system resources. Not applicable yet
•Constantly test your application's performance. Well that's what we are here for smile.png
•Minimize vertex buffer switches. Only have one vertex buffer in my app
•Use static vertex buffers where possible. How do you know if it is static?
•Use one large static vertex buffer per FVF for static objects, rather than one per object. What if each object has the same vertex property? Eg. all objects are 256 x 256 quads? reuse the same buffer?
•If your application needs random access into the vertex buffer in AGP memory, choose a vertex format size that is a multiple of 32 bytes. Otherwise, select the smallest appropriate format. Random access as in needing to change vertexes at runtime?
•Draw using indexed primitives. This can allow for more efficient vertex caching within hardware. Tryinng this next. Again what if each object has the same vertex property? reuse the same buffer?
•If the depth buffer format contains a stencil channel, always clear the depth and stencil channels at the same time. Only using 2D with no stencils so this shouldn't apply (I am guessing)
•Combine the shader instruction and the data output where possible. Not using shaders yet.


Does this sound like on I am the right path? And please correct me if anything I have written is wrong. smile.png

•Minimize state changes and group the remaining state changes. How do you group state changes?

- for example... changing the alpha blending.

•Use static vertex buffers where possible. How do you know if it is static?

-i forget the flags but basically... can you write to it after it's created? then it isn't static

Use one large static vertex buffer per FVF for static objects, rather than one per object. What if each object has the same vertex property? Eg. all objects are 256 x 256 quads? reuse the same buffer?

- i create 2-3 pools. switch pools every frame. hopefully reduces lag to gpu

•If your application needs random access into the vertex buffer in AGP memory, choose a vertex format size that is a multiple of 32 bytes. Otherwise, select the smallest appropriate format. Random access as in needing to change vertexes at runtime?

- data = vb->lock(); data +13 = x; vb->unlock(); lock/unlock as minimally as possible. write to a locked buffer as minimially as possible.

one thing i noticed is you are doing 1 draw call per sprite. group sprites by texture to reduce draw calls. you may have 10k sprites but do you have 10k textures? maybe you have 10k sprites but they are on 2 textures. you don't need 10k draw calls. you can do it in 2.

Cool, thanks for the info.

I remember now that I have set the vertex buffer to read only. So, I have done the right thing there :)

I have no state changes at all (now I understand what a state change is), as I am purely doing a straight stress test render.

Also I locked the vertex buffer only once and this is done in my creation phase (out side and before the render loop), so no lag there either.

Your last point is an interesting one though. I have read a lot today about drawing sprites in one draw call, but I haven't seen how this is actually achieved. So, I have absolutely no Idea how this can be done.

I would be extremely appreciative if you could shed light on this for me :)

This topic is closed to new replies.

Advertisement