Optimising my renderer

DividedByZero · 2014-04-22T04:50:59

Hi Guys, For a while now I have been working on my own framework to replace my (unneccesary) reliance on Game Maker: Studio Today I decided to see how me engine benchmarks against GM:S. Might code is reasonably tight (so I thought) but GM's renderer still runs rings around mine. GM:S also uses DirectX 9c too. These are the results; # Sprites (256x256) Mine GM:S 1.3 1 1570 ~1350 10 706 ~1350 100 106 ~1100 500 25 ~620 1000 13 ~450 My renderer is faster when displaying one sprite, but then drops off quite rapidly.I am using the ID3DXSprite interface to create and render the sprites. This is my entire render code. The sprites are all stored in a vector and use the same image (for testing purposes). void Renderer::renderSpriteQueue() { SpriteSortByDepth(); pSprite->Begin(D3DXSPRITE_ALPHABLEND); std::vector<Sprite>::iterator it; for(it=vSprite.begin();it<vSprite.end();it++) { RECT rectSpriteTextureArea; D3DXVECTOR3 v3Center; D3DXVECTOR3 v3Position; rectSpriteTextureArea.top=0; rectSpriteTextureArea.bottom=it->nSizeY;; rectSpriteTextureArea.left=0; rectSpriteTextureArea.right=it->nSizeX; v3Center=D3DXVECTOR3(0,0,0); v3Position=D3DXVECTOR3(it->fPosX,it->fPosY,0); if(FAILED(pSprite->Draw(pTexture,&rectSpriteTextureArea,&v3Center,&v3Position,0xFFFFFFFF))) MessageBox(NULL,"Error","Error",NULL); } pSprite->Flush(); pSprite->End(); }Am I doing something in-efficiently here? Would it be faster to just use a textured quad instead?Any advice would be awesome

Graphics and GPU Programming Programming

Started by DividedByZero April 16, 2014 11:37 PM

41 comments, last by Hodgman 10 years ago

L. Spiro

25,818

April 17, 2014 10:06 AM

Your last point is an interesting one though. I have read a lot today about drawing sprites in one draw call, but I haven't seen how this is actually achieved. So, I have absolutely no Idea how this can be done.

I would be extremely appreciative if you could shed light on this for me

It was a bit implicit in previous posts.

#1: Create vertex buffer. Not static/read-only. Dynamic.
EACH FRAME
- #2: Lock it.
- #3: Fill it with the sprite vertices. Drawing 32 sprites means you put 32×4 vertices into the buffer. Obviously you will have to transform them on the CPU by the sprite’s position, rotation (only if applicable), and scale (only if applicable).
- #4: Unlock it.
- #5: Draw it using the pre-generated 16-bit index buffer.

As I mentioned, these vertex buffers should be double- or even triple- buffered and swapped each frame.

L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

DividedByZero

1,261

Author

April 17, 2014 10:20 AM

Thanks L. Spiro,

I think it is starting to get through my thick head - LOL :)

I'll hit the code and see what I come up with. I'll post what I have later on.

Thanks again everyone for helping out. Must be frustrating sometimes. :)

TomKQT

1,708

April 17, 2014 10:23 AM

•Clear only when you must. Only clearing the backbuffer

EDIT: This following information turned out to be incorrect, as I thought :D

I'm not really sure now but I think I saw somewhere that you should clear everything at the same time, that means if you clear backbuffer, you should also clear depthbuffer by the same clear command? But I'm really not sure, I may be imagining this.

By Clear only when you must they mean if you don't need to call Clear(), don't call it at all.

DividedByZero

1,261

Author

April 17, 2014 11:35 AM

Ok, I am trying to put in to action what L. Spiro suggested

In my create code (before the render loop) I have created a vertex array of 10 identical quads (a bit hacky - but I'll clean that up later). I truncated each vertex to one line to same space here. I also set the texture here too as I'll only be using the one texture for this test (saves calling it every frame in the render cycle).

void *pVertexBuffer = NULL; 

struct D3DVERTEX{float x,y,z,rhw;DWORD color;float u;float v;} vertices[40];	// 4 verts * 10 quads
LPDIRECT3DVERTEXBUFFER9 pVertexObject = NULL;

// vertex description for our sprites
for(int n=0;n<10;n++)
{
	vertices[0*n].x=0;vertices[0*n].y=256;vertices[0*n].z=0;vertices[0*n].rhw=1.0f;vertices[0*n].color=0xffffff;vertices[0*n].u=0.0;vertices[0*n].v=1.0;
	vertices[1*n].x=0;vertices[1*n].y=0;vertices[1*n].z=0;vertices[1*n].rhw=1.0f;vertices[1*n].color=0xffffff;vertices[1*n].u=0.0;vertices[1*n].v=0.0;
	vertices[2*n].x=256;vertices[2*n].y=256;vertices[2*n].z=0;vertices[2*n].rhw=1.0f;vertices[2*n].color=0xffffff;vertices[2*n].u=1.0;vertices[2*n].v=1.0;
	vertices[3*n].x=256;vertices[3*n].y=0;vertices[3*n].z=0;vertices[3*n].rhw=1.0f;vertices[3*n].color=0xffffff;vertices[3*n].u=1.0;vertices[3*n].v=0.0;
}

mRenderer->getDevice()->SetTexture(0,mRenderer->pTexture);

Now for the renderer part.

// #1: Create vertex buffer. Not static/read-only. Dynamic.
void *pVertexBufferDynamic=NULL;
LPDIRECT3DVERTEXBUFFER9 pVertexObjectDynamic=NULL;

int nQuantity=10;

if(FAILED(mRenderer->getDevice()->CreateVertexBuffer(nQuantity*4*sizeof(D3DVERTEX),NULL,D3DFVF_XYZRHW|D3DFVF_DIFFUSE|D3DFVF_TEX1,D3DPOOL_MANAGED,&pVertexObjectDynamic,NULL)))
	return(1);

// #2: Lock it.
if(FAILED(pVertexObject->Lock(0,nQuantity*4*sizeof(D3DVERTEX),&pVertexBuffer,0)))
	return(2);

// #3: Fill it with the sprite vertices. Drawing 32 sprites means you put 32×4 vertices into the buffer.
memcpy(pVertexBuffer,vertices,nQuantity*4*sizeof(D3DVERTEX));

// #4: Unlock it.
pVertexObject->Unlock();

// #5: Draw it using the pre-generated 16-bit index buffer.
mRenderer->getDevice()->SetStreamSource(0,pVertexObject,0,sizeof(D3DVERTEX));
mRenderer->getDevice()->SetFVF(D3DFVF_XYZRHW|D3DFVF_DIFFUSE|D3DFVF_TEX1);

// **** then what?
// ? Draw it using the pre-generated 16-bit index buffer.

pVertexObjectDynamic->Release();

There are a couple of thing of concern here. At the locking phase, I am getting a fatal exception due to adding nQuantity. I assumed that I would have to have this in here somewhere as I am attempting to draw 10 quads. Am I right?

The other thing I am unsure of is "#5 Draw it using the pre-generated 16-bit index buffer.". What index buffer? Do I need to add anther step somewhere?

Otherwise, is my code looking like it is on the right track?

Thanks again everyone

21st Century Moose

13,459

April 17, 2014 12:05 PM

•Clear only when you must. Only clearing the backbuffer

I'm not really sure now but I think I saw somewhere that you should clear everything at the same time, that means if you clear backbuffer, you should also clear depthbuffer by the same clear command? But I'm really not sure, I may be imagining this.

By Clear only when you must they mean if you don't need to call Clear(), don't call it at all.

It's more the case that if you have depth and stencil, and if you're clearing depth, then you should clear stencil at the same time. This is because we typically see depth and stencil interleaved in a D24S8 format, so they're not separate buffers: they're a single buffer that contains the data for both, and clearing both at the same time allows the hardware to do a fast clear (which may be as fast as just awapping out a pointer or setting a flag).

You can't always do this of course, and some algorithms require clearing them individually, but where possible you should.

I don't recall ever seeing any advice about perf gains from clearing colour at the same time, but I would expect there would be none as these are separate buffers.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

L. Spiro

25,818

April 17, 2014 02:16 PM

The other thing I am unsure of is "#5 Draw it using the pre-generated 16-bit index buffer.". What index buffer? Do I need to add anther step somewhere?

Ashaman73 already gave you an example of it.
At the beginning of the program you make a single vertex buffer with 16-bit indices with the values [0 1 2 0 2 3 4 5 6 4 6 5] etc. with enough indices to draw the maximum number of sprites a vertex buffer can hold.
Do not draw the whole index buffer; you only draw as many primitives as you actually put into the vertex buffer.

if(FAILED(mRenderer->getDevice()->CreateVertexBuffer(nQuantity*4*sizeof(D3DVERTEX),NULL,D3DFVF_XYZRHW|D3DFVF_DIFFUSE|D3DFVF_TEX1,D3DPOOL_MANAGED,&pVertexObjectDynamic,NULL)))

That is not a dynamic vertex buffer.
You didn’t specify the D3DUSAGE_DYNAMIC flag nor did you put it in the D3DPOOL_SYSTEMMEM memory pool.
Just as a side note; white-space is not your enemy:

if( FAILED( mRenderer->getDevice()->CreateVertexBuffer( nQuantity*4*sizeof(D3DVERTEX),
    D3DUSAGE_DYNAMIC,
    D3DFVF_XYZRHW|D3DFVF_DIFFUSE|D3DFVF_TEX1,
    D3DPOOL_SYSTEMMEM,
    &pVertexObjectDynamic,
    NULL ) ) )

pVertexObjectDynamic->Release();

Why are you releasing it? There is no information in your post about where these steps are happening.
Is this done on every loop? Once at shut-down?
I said create and destroy the vertex buffer only once, fill it every frame.

// #3: Fill it with the sprite vertices. Drawing 32 sprites means you put 32×4 vertices into the buffer.
memcpy(pVertexBuffer,vertices,nQuantity*4*sizeof(D3DVERTEX));

Just copying the vertices? Did you transform them?
Are the sprites actually moving? We have all been assuming you had 1,000 sprites moving on the screen.

If these sprites are all static, then go back to a static buffer and only put vertices into it once.
A dynamic vertex buffer is for frequent updating. If you aren’t going to move the sprites around then don’t update the vertex buffer.

L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

Juliean

7,344

April 17, 2014 02:30 PM

You didn’t specify the D3DUSAGE_DYNAMIC flag nor did you put it in the D3DPOOL_SYSTEMMEM memory pool.

Is there a specific reason you place the buffer in D3DPOOL_SYSTEMMEM instead of D3DPOOL_DEFAULT? Default resources can be locked when dynamic, I always thought that was the preferred pool to store dynamic resources to. You know, since you have to write to the buffer from CPU-side, but the GPU also has to render from it afterwards.

Also L. Spiro, did you profile double/triplebuffering the vertex buffers, versus justing using nooverwrite-discard after rendering as lock flag, and for submitting the sprites locking with nooverwrite? I've never heard of anyone recommending to do this, and always just read locking like I described, which should have a similar effect as manual doublebuffering. I don't have any data from comparing the both, so thats why I'm asking, would be interesting to hear if there really is an additional performance gain by this

Waterlimon

4,401

April 17, 2014 02:33 PM

It might be that GM:S uses different approaches for different kinds of sprite behaviors. Static sprites can all be in a single static VBO, dynamic sprites can use another method more suited for them (such as the one described by L. Spiro)

You might want to alter your test scenario such that the sprites are not static, but constantly in motion, to account for this kind of specific optimization that might be done by GM:S.

o3o

NightCreature83

5,061

April 17, 2014 02:42 PM

GM:s might also just be using instanced rendering with a minimal VS which reads the position of the quad and size and transforms in the vertex buffer, this can be stored as a single vec4.

Worked on titles: CMR:DiRT2, DiRT 3, DiRT: Showdown, GRID 2, theHunter, theHunter: Primal, Mad Max, Watch Dogs: Legion

Erik Rufelt

5,903

April 17, 2014 02:42 PM

All these suggestions are excellent and I certainly don't mean this as disagreeing with them, but did you try using a much smaller sprite of just a few pixels?

Your numbers really don't fit overhead as the cause, at just 1000 sprites. If you had 100,000 then sure, but at 1000.. doubtful. What graphics card are you running on?

If your 256x256 sprites are on screen and never getting culled, then that's 65 million drawn pixels per frame.. which equals overdrawing fullscreen 1920x1080 32 times.. seems reasonable that would run at about 35 FPS, depending on the graphics card ofcourse.

(If your sprites don't have translucency, GM could very well draw them front to back and let the depth test cut down the number of drawn pixels, or use other similar techniques to improve performance. At 450 FPS x 65M pixels they are approaching the theoretical limit of NVidias $1000 graphics cards which are rarely seen in real numbers, so it seems doubtful that there's not some optimization going on).

Optimising my renderer

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Optimising my renderer

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines