Optimising my renderer

Started by
41 comments, last by Hodgman 10 years ago

Your last point is an interesting one though. I have read a lot today about drawing sprites in one draw call, but I haven't seen how this is actually achieved. So, I have absolutely no Idea how this can be done.

I would be extremely appreciative if you could shed light on this for me

It was a bit implicit in previous posts.


#1: Create vertex buffer. Not static/read-only. Dynamic.
EACH FRAME
- #2: Lock it.
- #3: Fill it with the sprite vertices. Drawing 32 sprites means you put 32×4 vertices into the buffer. Obviously you will have to transform them on the CPU by the sprite’s position, rotation (only if applicable), and scale (only if applicable).
- #4: Unlock it.
- #5: Draw it using the pre-generated 16-bit index buffer.


As I mentioned, these vertex buffers should be double- or even triple- buffered and swapped each frame.


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

Advertisement
Thanks L. Spiro,

I think it is starting to get through my thick head - LOL :)

I'll hit the code and see what I come up with. I'll post what I have later on.

Thanks again everyone for helping out. Must be frustrating sometimes. :)

•Clear only when you must. Only clearing the backbuffer

EDIT: This following information turned out to be incorrect, as I thought :D

I'm not really sure now but I think I saw somewhere that you should clear everything at the same time, that means if you clear backbuffer, you should also clear depthbuffer by the same clear command? But I'm really not sure, I may be imagining this.

By Clear only when you must they mean if you don't need to call Clear(), don't call it at all.

Ok, I am trying to put in to action what L. Spiro suggested

In my create code (before the render loop) I have created a vertex array of 10 identical quads (a bit hacky - but I'll clean that up later). I truncated each vertex to one line to same space here. I also set the texture here too as I'll only be using the one texture for this test (saves calling it every frame in the render cycle).

void *pVertexBuffer = NULL; 

struct D3DVERTEX{float x,y,z,rhw;DWORD color;float u;float v;} vertices[40];	// 4 verts * 10 quads
LPDIRECT3DVERTEXBUFFER9 pVertexObject = NULL;

// vertex description for our sprites
for(int n=0;n<10;n++)
{
	vertices[0*n].x=0;vertices[0*n].y=256;vertices[0*n].z=0;vertices[0*n].rhw=1.0f;vertices[0*n].color=0xffffff;vertices[0*n].u=0.0;vertices[0*n].v=1.0;
	vertices[1*n].x=0;vertices[1*n].y=0;vertices[1*n].z=0;vertices[1*n].rhw=1.0f;vertices[1*n].color=0xffffff;vertices[1*n].u=0.0;vertices[1*n].v=0.0;
	vertices[2*n].x=256;vertices[2*n].y=256;vertices[2*n].z=0;vertices[2*n].rhw=1.0f;vertices[2*n].color=0xffffff;vertices[2*n].u=1.0;vertices[2*n].v=1.0;
	vertices[3*n].x=256;vertices[3*n].y=0;vertices[3*n].z=0;vertices[3*n].rhw=1.0f;vertices[3*n].color=0xffffff;vertices[3*n].u=1.0;vertices[3*n].v=0.0;
}

mRenderer->getDevice()->SetTexture(0,mRenderer->pTexture);
Now for the renderer part.

// #1: Create vertex buffer. Not static/read-only. Dynamic.
void *pVertexBufferDynamic=NULL;
LPDIRECT3DVERTEXBUFFER9 pVertexObjectDynamic=NULL;

int nQuantity=10;

if(FAILED(mRenderer->getDevice()->CreateVertexBuffer(nQuantity*4*sizeof(D3DVERTEX),NULL,D3DFVF_XYZRHW|D3DFVF_DIFFUSE|D3DFVF_TEX1,D3DPOOL_MANAGED,&pVertexObjectDynamic,NULL)))
	return(1);

// #2: Lock it.
if(FAILED(pVertexObject->Lock(0,nQuantity*4*sizeof(D3DVERTEX),&pVertexBuffer,0)))
	return(2);

// #3: Fill it with the sprite vertices. Drawing 32 sprites means you put 32×4 vertices into the buffer.
memcpy(pVertexBuffer,vertices,nQuantity*4*sizeof(D3DVERTEX));

// #4: Unlock it.
pVertexObject->Unlock();

// #5: Draw it using the pre-generated 16-bit index buffer.
mRenderer->getDevice()->SetStreamSource(0,pVertexObject,0,sizeof(D3DVERTEX));
mRenderer->getDevice()->SetFVF(D3DFVF_XYZRHW|D3DFVF_DIFFUSE|D3DFVF_TEX1);

// **** then what?
// ? Draw it using the pre-generated 16-bit index buffer.

pVertexObjectDynamic->Release();
There are a couple of thing of concern here. At the locking phase, I am getting a fatal exception due to adding nQuantity. I assumed that I would have to have this in here somewhere as I am attempting to draw 10 quads. Am I right?

The other thing I am unsure of is "#5 Draw it using the pre-generated 16-bit index buffer.". What index buffer? Do I need to add anther step somewhere?

Otherwise, is my code looking like it is on the right track?

Thanks again everyone smile.png

•Clear only when you must. Only clearing the backbuffer

I'm not really sure now but I think I saw somewhere that you should clear everything at the same time, that means if you clear backbuffer, you should also clear depthbuffer by the same clear command? But I'm really not sure, I may be imagining this.

By Clear only when you must they mean if you don't need to call Clear(), don't call it at all.

It's more the case that if you have depth and stencil, and if you're clearing depth, then you should clear stencil at the same time. This is because we typically see depth and stencil interleaved in a D24S8 format, so they're not separate buffers: they're a single buffer that contains the data for both, and clearing both at the same time allows the hardware to do a fast clear (which may be as fast as just awapping out a pointer or setting a flag).

You can't always do this of course, and some algorithms require clearing them individually, but where possible you should.

I don't recall ever seeing any advice about perf gains from clearing colour at the same time, but I would expect there would be none as these are separate buffers.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

The other thing I am unsure of is "#5 Draw it using the pre-generated 16-bit index buffer.". What index buffer? Do I need to add anther step somewhere?

Ashaman73 already gave you an example of it.
At the beginning of the program you make a single vertex buffer with 16-bit indices with the values [0 1 2 0 2 3 4 5 6 4 6 5] etc. with enough indices to draw the maximum number of sprites a vertex buffer can hold.
Do not draw the whole index buffer; you only draw as many primitives as you actually put into the vertex buffer.


if(FAILED(mRenderer->getDevice()->CreateVertexBuffer(nQuantity*4*sizeof(D3DVERTEX),NULL,D3DFVF_XYZRHW|D3DFVF_DIFFUSE|D3DFVF_TEX1,D3DPOOL_MANAGED,&pVertexObjectDynamic,NULL)))

That is not a dynamic vertex buffer.
You didn’t specify the D3DUSAGE_DYNAMIC flag nor did you put it in the D3DPOOL_SYSTEMMEM memory pool.
Just as a side note; white-space is not your enemy:
if( FAILED( mRenderer->getDevice()->CreateVertexBuffer( nQuantity*4*sizeof(D3DVERTEX),
    D3DUSAGE_DYNAMIC,
    D3DFVF_XYZRHW|D3DFVF_DIFFUSE|D3DFVF_TEX1,
    D3DPOOL_SYSTEMMEM,
    &pVertexObjectDynamic,
    NULL ) ) )

pVertexObjectDynamic->Release();

Why are you releasing it? There is no information in your post about where these steps are happening.
Is this done on every loop? Once at shut-down?
I said create and destroy the vertex buffer only once, fill it every frame.


// #3: Fill it with the sprite vertices. Drawing 32 sprites means you put 32×4 vertices into the buffer.
memcpy(pVertexBuffer,vertices,nQuantity*4*sizeof(D3DVERTEX));

Just copying the vertices? Did you transform them?
Are the sprites actually moving? We have all been assuming you had 1,000 sprites moving on the screen.


If these sprites are all static, then go back to a static buffer and only put vertices into it once.
A dynamic vertex buffer is for frequent updating. If you aren’t going to move the sprites around then don’t update the vertex buffer.



L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid


You didn’t specify the D3DUSAGE_DYNAMIC flag nor did you put it in the D3DPOOL_SYSTEMMEM memory pool.

Is there a specific reason you place the buffer in D3DPOOL_SYSTEMMEM instead of D3DPOOL_DEFAULT? Default resources can be locked when dynamic, I always thought that was the preferred pool to store dynamic resources to. You know, since you have to write to the buffer from CPU-side, but the GPU also has to render from it afterwards.

Also L. Spiro, did you profile double/triplebuffering the vertex buffers, versus justing using nooverwrite-discard after rendering as lock flag, and for submitting the sprites locking with nooverwrite? I've never heard of anyone recommending to do this, and always just read locking like I described, which should have a similar effect as manual doublebuffering. I don't have any data from comparing the both, so thats why I'm asking, would be interesting to hear if there really is an additional performance gain by this biggrin.png

It might be that GM:S uses different approaches for different kinds of sprite behaviors. Static sprites can all be in a single static VBO, dynamic sprites can use another method more suited for them (such as the one described by L. Spiro)

You might want to alter your test scenario such that the sprites are not static, but constantly in motion, to account for this kind of specific optimization that might be done by GM:S.

o3o

GM:s might also just be using instanced rendering with a minimal VS which reads the position of the quad and size and transforms in the vertex buffer, this can be stored as a single vec4.

Worked on titles: CMR:DiRT2, DiRT 3, DiRT: Showdown, GRID 2, theHunter, theHunter: Primal, Mad Max, Watch Dogs: Legion

All these suggestions are excellent and I certainly don't mean this as disagreeing with them, but did you try using a much smaller sprite of just a few pixels?

Your numbers really don't fit overhead as the cause, at just 1000 sprites. If you had 100,000 then sure, but at 1000.. doubtful. What graphics card are you running on?

If your 256x256 sprites are on screen and never getting culled, then that's 65 million drawn pixels per frame.. which equals overdrawing fullscreen 1920x1080 32 times.. seems reasonable that would run at about 35 FPS, depending on the graphics card ofcourse.

(If your sprites don't have translucency, GM could very well draw them front to back and let the depth test cut down the number of drawn pixels, or use other similar techniques to improve performance. At 450 FPS x 65M pixels they are approaching the theoretical limit of NVidias $1000 graphics cards which are rarely seen in real numbers, so it seems doubtful that there's not some optimization going on).

This topic is closed to new replies.

Advertisement