# Good Performance test?

This topic is 2140 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

So I have finally got a batcher going using a Vertex and Index buffer

But the real question in my head is... how good is it? Is there a good method of testing this?

It is only rendering untextured Quads that are 32 x 32 (no rotation or etc)

And my buffers can hold exactly 10000 Quads

	for(int i = 0; i < 10000; i++)
{
batcher->draw((rand()%700) + 10, (rand()%500) + 10, 32.0f, 32.0f);
}


If my dev specs are:

Core Duo 2 2.66 GHz

4.0 Gb Ram

And I have a dt time of ~0.036 to 0.037 seconds how am I doing?

##### Share on other sites

One valid comparison would be with (1) an unbatched version of the same number of quads, and (2) the same number of quads in a static buffer.  Your result will be somewhere between the two, and you'll want to be nearer (much nearer) to (2) than you are to (1).

##### Share on other sites

One valid comparison would be with (1) an unbatched version of the same number of quads, and (2) the same number of quads in a static buffer.  Your result will be somewhere between the two, and you'll want to be nearer (much nearer) to (2) than you are to (1).

I'm a little confuse about comparing it to an unbatched amount of quads. Is it not already unbatched since the buffer can hold the exact amount of quads I'm attempting to draw?

Also I think I failed to give more detail this is my render method. My quads are drawn at a random position every frame

void render()
{
systemX11.deviceContext->ClearRenderTargetView(systemX11.backBufferRenderTarget, D3DXCOLOR(0.0f, 0.2f, 0.4f, 1.0f));
batcher->beginBatch();

for(int i = 0; i < 10000; i++)
{
batcher->draw((rand()%700) + 10, (rand()%500) + 10, 32.0f, 32.0f);
}

batcher->endBatch();

systemX11.swapChain->Present(0,0);
}


##### Share on other sites

I'm not sure if this is what you're asking but what I do when testing performance is something like this:

// gets called every frame
void heartbeat()
{
performance_counter.start();
function_i_want_to_test();
elapsed+=performance_counter.stop();
frame_count++;

// keep this out of the counter
d3d_device->Present(NULL);
}

// average time to complete function
average=elapsed/frame_count;


If you aren't sure what a performance counter is just duckduckgo "Win32 Performance Counter".

I hope that helps.

##### Share on other sites

I'm not sure if this is what you're asking but what I do when testing performance is something like this:

I think we are doing the same thing in a slightly different way

One valid comparison would be with (1) an unbatched version of the same number of quads, and (2) the same number of quads in a static buffer.  Your result will be somewhere between the two, and you'll want to be nearer (much nearer) to (2) than you are to (1).

So I went back and created a static buffer and either I have coded my batcher very wrong or I am doing this test incorrectly ( <- I'm hoping for this)

When I create a static index and vertex buffer

	D3D11_BUFFER_DESC vertexBufferDesc;
//Create the static buffer and fill it
ZeroMemory(&vertexBufferDesc, sizeof(D3D11_BUFFER_DESC));
vertexBufferDesc.Usage = D3D11_USAGE_DEFAULT;
vertexBufferDesc.CPUAccessFlags = 0;
vertexBufferDesc.BindFlags = D3D11_BIND_VERTEX_BUFFER;
vertexBufferDesc.ByteWidth = maxVertices * sizeof(Vertex);
vertexBufferDesc.StructureByteStride = 0;
vertexBufferDesc.MiscFlags = 0;

static Vertex vertices[4 * 10000];
int position = 0;
for(std::vector<Quad>::iterator i = drawData.begin(); i != drawData.end(); i++)
{
memcpy(vertices + position, (*i).vertices, sizeof((*i).vertices));
position += 4;
}

D3D11_SUBRESOURCE_DATA resourceData;
ZeroMemory( &resourceData, sizeof( resourceData ) );
resourceData.pSysMem = vertices;

/* Rest of the create Code etc */

D3D11_BUFFER_DESC indexBufferDesc;
ZeroMemory(&indexBufferDesc, sizeof(D3D11_BUFFER_DESC));
indexBufferDesc.Usage = D3D11_USAGE_DEFAULT;
indexBufferDesc.CPUAccessFlags = 0;
indexBufferDesc.BindFlags = D3D11_BIND_INDEX_BUFFER;
indexBufferDesc.ByteWidth = maxIndices * sizeof(USHORT);
indexBufferDesc.StructureByteStride = 0;
indexBufferDesc.MiscFlags = 0;

static USHORT index[6 * 10000];
int indexPosition = 0;
int vertPosition = 0;
for(std::vector<Quad>::iterator i = drawData.begin(); i != drawData.end(); i++)
{
ind[indexPosition] = vertPosition ;
ind[indexPosition + 1] = vertPosition + 1;
ind[indexPosition + 2] = vertPosition + 2;
ind[indexPosition + 3] = vertPosition + 3;
ind[indexPosition + 4] = vertPosition;
ind[indexPosition + 5] = vertPosition + 2;
indexPosition += 6;
vertPosition += 4;
}

D3D11_SUBRESOURCE_DATA resourceData2;
ZeroMemory( &resourceData2, sizeof( resourceData ) );
resourceData2.pSysMem = ind;

/* Other create code */


And load it like this before any rendering is done

srand(time(NULL));
for(int i = 0; i < 10000; i++)
{
batcher->draw((rand()%700) + 10, (rand()%500) + 10, 32.0f, 32.0f);
}


Then use my static render method

void render()
{
//Calculate DT using the Query Performance Counter
mainClock.tick();
systemX11.deviceContext->ClearRenderTargetView(systemX11.backBufferRenderTarget, D3DXCOLOR(0.0f, 0.2f, 0.4f, 1.0f));

systemX11.deviceContext->IASetPrimitiveTopology(D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST);
systemX11.deviceContext->DrawIndexed(60000, 0, 0); //Draw 60k indices because of the 10000 quads

systemX11.swapChain->Present(0,0);

std::cout<<"DT: "<<mainClock.getDeltaTime()<<std::endl;
}


My DT time is ~0.0029 - 0.0051

Compared to the DT time of ~0.035 - 0.040 when using Dynamic vertex and Index buffers where my render method looks like

void render()
{
//Calculate DT using the Query Performance Counter
mainClock.tick();
systemX11.deviceContext->ClearRenderTargetView(systemX11.backBufferRenderTarget, D3DXCOLOR(0.0f, 0.2f, 0.4f, 1.0f));

for(int i = 0; i < 10000; i++)
{
batcher->draw((rand()%700) + 10, (rand()%500) + 10, 32.0f, 32.0f);
}

//End the batch summary
//1. Lock the Vertex and Index buffers using a NO OVERWRITE
//2. Check to see if we are full; if so unlock, call DrawIndexed, and change the lock flag to DISCARD
//3. If we were full relock the buffer with DISCARD, reset positions, and change the flag back to NO OVERWRITE
//4. Place the vertex and index data into the mapResouce
//5. If we have no more Quads to draw; finish the method by drawing anything we have not drawn using DrawIndex
//6. Calculate index offsets and etc
batcher->endBatch();

systemX11.swapChain->Present(0,0);
std::cout<<"DT: "<<mainClock.getDeltaTime()<<std::endl;
}


Even though its only about a .0100 difference I still feel that its really bad. Assuming I did this correctly, but then again I'm not sure how you compare times where in one you reload 10000 quads every frame vs loading 10000 quads once at start and then just making the draw call

Edited by noodleBowl

##### Share on other sites

Have you tried precaching you random values inside a big table ?

Here it looks to me that you're testing rand() performance more than/as much as your batcher's.

##### Share on other sites

Have you tried precaching you random values inside a big table ?

Here it looks to me that you're testing rand() performance more than/as much as your batcher's.

I don't think this is the case

If I place this in my start up method

	for(int i = 0, j = 0; i < 10000; i++, j+=2)
{
spots[j] = (rand() % 700) + 10;
spots[j+1] = (rand() % 500) + 10;
}


Then change the render to

void render()
{
mainClock.tick();
systemX11.deviceContext->ClearRenderTargetView(systemX11.backBufferRenderTarget, D3DXCOLOR(0.0f, 0.2f, 0.4f, 1.0f));

for(int i = 0, j = 0; i < 10000; i++, j+=2)
{
batcher->draw(spots[j], spots[j+1], 32.0f, 32.0f);
}

batcher->endBatch();

systemX11.swapChain->Present(0,0);
std::cout<<"DT: "<<mainClock.getDeltaTime()<<std::endl;
}


I get roughly around the same time. DT is ~0.034 - 0.036

I'm not sure what needs to be done. Is there just something conceptually wrong that I am doing?

void SystemBatcher::endBatch()
{
//Lock the buffers; starts off with NO OVERWRITE
batchContext->Map(vertexBuffer, 0, mappingFlag, 0, &mapVertexResource);
batchContext->Map(indexBuffer, 0, mappingFlag, 0, &mapIndexResource);

for(i = drawData.begin(); i != drawData.end(); i++)
{
//Check if the bufer is full
if(vertexBufferSize - drawDataAmount == 0)
{
//IF so, unlock
batchContext->Unmap(indexBuffer, 0);
batchContext->Unmap(vertexBuffer, 0);

//Draw if things need to be drawn
if(indexDrawCount > 0)
{
batchContext->IASetPrimitiveTopology(D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST);
batchContext->DrawIndexed(indexDrawCount, indexOffset, 0);
}

//Change the lock flag to discard; Reset the positions
vertexBufferPosition = 0;
indexBufferPosition = 0;
drawDataAmount = 0;
indexDrawCount = 0;
indexOffset = 0;

//Relock the buffers; change the flag back to NO OVERWRITE
batchContext->Map(vertexBuffer, 0, mappingFlag, 0, &mapVertexResource);
batchContext->Map(indexBuffer, 0, mappingFlag, 0, &mapIndexResource);
mappingFlag = D3D11_MAP_WRITE_NO_OVERWRITE;
}

//Place in the vertex data
((Vertex*)mapVertexResource.pData)[vertexBufferPosition] = (*i).vertices[0];
((Vertex*)mapVertexResource.pData)[vertexBufferPosition + 1] = (*i).vertices[1];
((Vertex*)mapVertexResource.pData)[vertexBufferPosition + 2] = (*i).vertices[2];
((Vertex*)mapVertexResource.pData)[vertexBufferPosition + 3] = (*i).vertices[3];

//Place in the index data
((USHORT*)mapIndexResource.pData)[indexBufferPosition] = vertexBufferPosition;
((USHORT*)mapIndexResource.pData)[indexBufferPosition + 1] = vertexBufferPosition + 1;
((USHORT*)mapIndexResource.pData)[indexBufferPosition + 2] = vertexBufferPosition + 2;
((USHORT*)mapIndexResource.pData)[indexBufferPosition + 3] = vertexBufferPosition + 3;
((USHORT*)mapIndexResource.pData)[indexBufferPosition + 4] = vertexBufferPosition;
((USHORT*)mapIndexResource.pData)[indexBufferPosition + 5] = vertexBufferPosition + 2;

//Increment counts
vertexBufferPosition += 4;
indexBufferPosition += 6;
indexDrawCount += 6;
}

//Second Unlock of buffers;
batchContext->Unmap(indexBuffer, 0);
batchContext->Unmap(vertexBuffer, 0);

//Draw anything left that has not been drawn
batchContext->IASetPrimitiveTopology(D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST);
batchContext->DrawIndexed(indexDrawCount, indexOffset, 0);

//Increment needed counts
indexOffset += indexDrawCount;

//Reset the draw count
indexDrawCount = 0;

drawData.clear();
}

Edited by noodleBowl

##### Share on other sites

Is there just something conceptually wrong that I am doing?

For on you are potentially updating a vertex buffer/index buffer while still in use by the GPU (if the buffer is full).
Use double-buffering and growable vertex buffers. Do work on buffer A, present, next frame work on buffer B, present, back to A. If you run out of room in the vertex buffer, just grow it. It will only be slow for a little bit at the start, if even that (perhaps growing once is enough).

Secondly, avoid all that memory-copying.
First you generate data, copy it into a vector, then map the vertex buffers and copy from the vector into there.
The routine should be:

batcher->beginBatch(); // Calls Map() internally.
for(int i = 0; i < 10000; i++)
{
// Places a quad directly into the mapped pointer.
batcher->draw((rand()%700) + 10, (rand()%500) + 10, 32.0f, 32.0f);
}
batcher->endBatch(); // Unmaps.
batcher->draw(); // Draws.

Thirdly, if you use double-buffering and start from the beginning of the buffer on each frame, you no longer need to update your index buffer at all, as each time it would be writing the same values into itself.

L. Spiro

##### Share on other sites

I got so many questions

For on you are potentially updating a vertex buffer/index buffer while still in use by the GPU (if the buffer is full).

Can you explain this a little more? I don't understand how I can potentially be updating while the GPU is still in use.

Does the DISCARD flag not give me a new location in memory to use or am I thinking of this wrong?

Use double-buffering and growable vertex buffers. Do work on buffer A, present, next frame work on buffer B, present, back to A. If you run out of room in the vertex buffer, just grow it. It will only be slow for a little bit at the start, if even that (perhaps growing once is enough).

What exactly do you mean by a growable vertex buffer? Also does this mean you are saying to DISCARD on every frame?

Also I only half understand the A then B buffer update. How does this exactly save bandwidth, how does it work?

It seems like I would stall on the GPU somewhere

Secondly, avoid all that memory-copying.
First you generate data, copy it into a vector, then map the vertex buffers and copy from the vector into there.
The routine should be:

batcher->beginBatch(); // Calls Map() internally.
for(int i = 0; i < 10000; i++)
{
// Places a quad directly into the mapped pointer.
batcher->draw((rand()%700) + 10, (rand()%500) + 10, 32.0f, 32.0f);
}
batcher->endBatch(); // Unmaps.
batcher->draw(); // Draws.
Thirdly, if you use double-buffering and start from the beginning of the buffer on each frame, you no longer need to update your index buffer at all, as each time it would be writing the same values into itself.

This part definitely needs to happen, this is my main bottleneck for sure.

But what I am a little confused about is, are you saying to only use the DISCARD flag. Does this not defeat the purpose of using a dynamic buffer and reusing buffer slots with NO OVERWRITE? I also assume, I do not need to check if I hit my MaxBufferSize because this is where the growable buffer would fall in right?

##### Share on other sites

Does the DISCARD flag not give me a new location in memory to use or am I thinking of this wrong?

Even if so, that means an extra allocation and free by the driver, which is a costly operation.

What exactly do you mean by a growable vertex buffer? Also does this mean you are saying to DISCARD on every frame?

If you run out of room, release the buffer and allocate a new one of the correct size.
Basically like std::vector. A little slow at the start but once it gets big enough it won’t be resized anymore and performance will be optimal.

Also I only half understand the A then B buffer update. How does this exactly save bandwidth, how does it work?
It seems like I would stall on the GPU somewhere

It avoids stalling by ensuring you are only updating resource B while resource A is being used by the GPU and vice versa.
It isn’t meant to save bandwidth. When you submit render commands, they are queued until the driver detects a flush is needed. Until that time it is costly to update a vertex buffer that has been submitted for rendering, so don’t. Update the other one.

But what I am a little confused about is, are you saying to only use the DISCARD flag. Does this not defeat the purpose of using a dynamic buffer and reusing buffer slots with NO OVERWRITE?

You still need the buffers to be D3D11_USAGE_DYNAMIC and update with D3D11_MAP_WRITE on the buffer that is not being used by the GPU.

L. Spiro

• ### Game Developer Survey

We are looking for qualified game developers to participate in a 10-minute online survey. Qualified participants will be offered a \$15 incentive for your time and insights. Click here to start!

• 16
• 9
• 15
• 9
• 11