Optimising my renderer

DividedByZero · 2014-04-22T04:50:59

Hi Guys, For a while now I have been working on my own framework to replace my (unneccesary) reliance on Game Maker: Studio Today I decided to see how me engine benchmarks against GM:S. Might code is reasonably tight (so I thought) but GM's renderer still runs rings around mine. GM:S also uses DirectX 9c too. These are the results; # Sprites (256x256) Mine GM:S 1.3 1 1570 ~1350 10 706 ~1350 100 106 ~1100 500 25 ~620 1000 13 ~450 My renderer is faster when displaying one sprite, but then drops off quite rapidly.I am using the ID3DXSprite interface to create and render the sprites. This is my entire render code. The sprites are all stored in a vector and use the same image (for testing purposes). void Renderer::renderSpriteQueue() { SpriteSortByDepth(); pSprite->Begin(D3DXSPRITE_ALPHABLEND); std::vector<Sprite>::iterator it; for(it=vSprite.begin();it<vSprite.end();it++) { RECT rectSpriteTextureArea; D3DXVECTOR3 v3Center; D3DXVECTOR3 v3Position; rectSpriteTextureArea.top=0; rectSpriteTextureArea.bottom=it->nSizeY;; rectSpriteTextureArea.left=0; rectSpriteTextureArea.right=it->nSizeX; v3Center=D3DXVECTOR3(0,0,0); v3Position=D3DXVECTOR3(it->fPosX,it->fPosY,0); if(FAILED(pSprite->Draw(pTexture,&rectSpriteTextureArea,&v3Center,&v3Position,0xFFFFFFFF))) MessageBox(NULL,"Error","Error",NULL); } pSprite->Flush(); pSprite->End(); }Am I doing something in-efficiently here? Would it be faster to just use a textured quad instead?Any advice would be awesome

Graphics and GPU Programming Programming

Started by DividedByZero April 16, 2014 11:37 PM

41 comments, last by Hodgman 10 years ago

L. Spiro

25,818

April 17, 2014 02:58 PM

Is there a specific reason you place the buffer in D3DPOOL_SYSTEMMEM instead of D3DPOOL_DEFAULT? Default resources can be locked when dynamic, I always thought that was the preferred pool to store dynamic resources to. You know, since you have to write to the buffer from CPU-side, but the GPU also has to render from it afterwards.

You are correct that it should be D3DPOOL_DEFAULT.
Of note to the original poster is that in any case you should always use the actual enumerated value, not 0 (and especially not NULL; it’s not a pointer and that is extremely misleading), even though D3DPOOL_DEFAULT is 0.

Also L. Spiro, did you profile double/triplebuffering the vertex buffers, versus justing using nooverwrite-discard after rendering as lock flag, and for submitting the sprites locking with nooverwrite? I've never heard of anyone recommending to do this, and always just read locking like I described, which should have a similar effect as manual doublebuffering. I don't have any data from comparing the both, so thats why I'm asking, would be interesting to hear if there really is an additional performance gain by this

We use double-buffering at work, and they did (not I) profile it on Xbox 360.
These days it may be very similar in performance, but it is more likely to be similar to orphaning in OpenGL (which is slower than double-buffering, and I have tested that, as others have) because they are basically the same process, and if the driver tries to secretly double-buffer behind the scenes then the magic that makes that happen is implicitly more cycles than manually double-buffering.

L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

21st Century Moose

13,459

April 17, 2014 03:39 PM

It's worth noting however that you can draw from a vertex buffer in D3DPOOL_SYSTEMMEM.

Dynamic buffers are more suitable for cases where you're using the discard/no-overwrite pattern, i.e you're continually appending to the buffer and never overwriting a region that's been previously written to since the last discard. A system meory buffer might be more useful if you're jumping around in the buffer and overwriting smaller regions of it with a more random access.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

TomKQT

1,708

April 17, 2014 05:28 PM

It's more the case that if you have depth and stencil, and if you're clearing depth, then you should clear stencil at the same time. This is because we typically see depth and stencil interleaved in a D24S8 format, so they're not separate buffers: they're a single buffer that contains the data for both, and clearing both at the same time allows the hardware to do a fast clear (which may be as fast as just awapping out a pointer or setting a flag).

Yea, that's probably it. And it makes perfect sense. I somehow confused it in my memory.

hdxpete

512

April 18, 2014 01:39 AM

for(texture in textures)

for(sprite in sprites)

if(sprite.texture == texture)

list.append(sprite)

if(list.size > 0)

list.draw()

thats the basics. implementation is up to you

DividedByZero

1,261

Author

April 21, 2014 12:39 AM

Thanks guys. I have been away a few days and should be able to have another go at this today

DividedByZero

1,261

Author

April 21, 2014 03:01 AM

I have had another crack at this.

Even though what I have got here is mainly in the 'create' phase of my app (I'll fix this up later). Is this more or less a working version of instancing? As, I do have two quads displaying now from the one draw call now


void *pVertexBuffer=NULL;
LPDIRECT3DVERTEXBUFFER9 pVertexObject=NULL;
LPDIRECT3DINDEXBUFFER9 pIndexBuffer=NULL; // the pointer to the index buffer

struct D3DVERTEX{float x,y,z,rhw;DWORD color;float u;float v;};

D3DVERTEX vertices[8]={ 0,256,0,1.0f,0xffffff,0.0,1.0,
0,0,0,1.0f,0xffffff,0.0,0.0,
256,256,0,1.0f,0xffffff,1.0,1.0,
256,0,0,1.0f,0xffffff,1.0,0.0,

512,256,0,1.0f,0xffffff,0.0,1.0,
512,0,0,1.0f,0xffffff,0.0,0.0,
768,256,0,1.0f,0xffffff,1.0,1.0,
768,0,0,1.0f,0xffffff,1.0,0.0};

// 2nd param was D3DUSAGE_WRITEONLY
//8 = 2x quads
if(FAILED(mRenderer->getDevice()->CreateVertexBuffer(8*sizeof(D3DVERTEX),NULL,D3DFVF_XYZRHW|D3DFVF_DIFFUSE|D3DFVF_TEX1,D3DPOOL_MANAGED,&pVertexObject,NULL)))
return(0);

if(FAILED(pVertexObject->Lock(0,8*sizeof(D3DVERTEX),&pVertexBuffer,0)))
return(0);

memcpy(pVertexBuffer,vertices,8*sizeof(D3DVERTEX));
pVertexObject->Unlock();

mRenderer->getDevice()->SetStreamSource(0,pVertexObject,0,sizeof(D3DVERTEX));
mRenderer->getDevice()->SetFVF(D3DFVF_XYZRHW|D3DFVF_DIFFUSE|D3DFVF_TEX1);

mRenderer->getDevice()->SetTexture(0,mRenderer->pTexture);

// Do the indicies
void *pVoid;

short indices[]={ 0,1,2,
2,1,3,

4,5,6,
6,5,7,};

mRenderer->getDevice()->CreateIndexBuffer(12*sizeof(short),0,D3DFMT_INDEX16,D3DPOOL_MANAGED,&pIndexBuffer,NULL);
pIndexBuffer->Lock(0,0,(void**)&pVoid,0);
memcpy(pVoid, indices, sizeof(indices));
pIndexBuffer->Unlock();

mRenderer->getDevice()->SetIndices(pIndexBuffer);

And the render phase


mRenderer->getDevice()->DrawIndexedPrimitive(D3DPT_TRIANGLELIST, 0, 0, 8, 0, 4);

I plan on cleaning this up to make it more 'dynamic', but is this the basics of how it is done?

DividedByZero

1,261

Author

April 21, 2014 08:40 AM

So, now I have managed to draw 100 quads in one draw call

Complete render loop

mRenderer->getDevice()->DrawIndexedPrimitive(D3DPT_TRIANGLELIST,0,0,sprites*4,0,sprites*2);

But, I am finding that the results are identical to before at ~300 FPS.

With an identical scene in GM:S I am still getting ~1100 FPS.

I am still somewhat bewildered. Drawing all of the sprites in one call didn't seem to make a difference what so ever (compared to having a loop and drawing each quad individually).

I am surprised that instancing made no difference at all.

Juliean

7,344

April 21, 2014 09:59 AM

When you are writing the sprites to the vertex buffer, are you locking & unlocking it once, or for every sprite seperately? I found that locking multiple times can have a very bad impact on performance. Also:

if(FAILED(pVertexObject->Lock(0,8*sizeof(D3DVERTEX),&pVertexBuffer,0)))

You shouldn't lock with "0" as flag (last parameter). Since it does not appear you are double-buffering like suggested from L.Spiro, you should lock once with "D3DLOCK_DISCARD | D3DLOCK_NOOVERWRITE" after rendering, and then with "D3DLOCK_NOOVERWRITE", optimally only once for all sprites.

DividedByZero

1,261

Author

April 21, 2014 10:40 AM

Only have the single line of code (as above) in the entire render loop (just seeing how much I can throw at the renderer).

The locking is done before the render loop and never gets called again. So, it is literally a one liner in the render cycle, nothing more than that. So, no state changes or anything (I'll do all that later on once I am happy with the renderer).

I'll try the double-buffering and see what happens.

L. Spiro

25,818

April 21, 2014 12:42 PM

I'll try the double-buffering and see what happens.

Double-buffering doesn’t do anything unless you are updating the buffers every frame.
Most of the advice you have gotten has been under the assumption that you are.

If you aren’t, as I said, go back to static buffers and draw all the sprites in a single call. Your bottleneck would be only the fact that you make 1,000 render calls instead of 1.

L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

Optimising my renderer

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Optimising my renderer

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines