I'm not sure if this is what you're asking but what I do when testing performance is something like this:
I think we are doing the same thing in a slightly different way
One valid comparison would be with (1) an unbatched version of the same number of quads, and (2) the same number of quads in a static buffer. Your result will be somewhere between the two, and you'll want to be nearer (much nearer) to (2) than you are to (1).
So I went back and created a static buffer and either I have coded my batcher very wrong or I am doing this test incorrectly ( <- I'm hoping for this)
When I create a static index and vertex buffer
D3D11_BUFFER_DESC vertexBufferDesc;
//Create the static buffer and fill it
ZeroMemory(&vertexBufferDesc, sizeof(D3D11_BUFFER_DESC));
vertexBufferDesc.Usage = D3D11_USAGE_DEFAULT;
vertexBufferDesc.CPUAccessFlags = 0;
vertexBufferDesc.BindFlags = D3D11_BIND_VERTEX_BUFFER;
vertexBufferDesc.ByteWidth = maxVertices * sizeof(Vertex);
vertexBufferDesc.StructureByteStride = 0;
vertexBufferDesc.MiscFlags = 0;
static Vertex vertices[4 * 10000];
int position = 0;
for(std::vector<Quad>::iterator i = drawData.begin(); i != drawData.end(); i++)
{
memcpy(vertices + position, (*i).vertices, sizeof((*i).vertices));
position += 4;
}
D3D11_SUBRESOURCE_DATA resourceData;
ZeroMemory( &resourceData, sizeof( resourceData ) );
resourceData.pSysMem = vertices;
/* Rest of the create Code etc */
D3D11_BUFFER_DESC indexBufferDesc;
ZeroMemory(&indexBufferDesc, sizeof(D3D11_BUFFER_DESC));
indexBufferDesc.Usage = D3D11_USAGE_DEFAULT;
indexBufferDesc.CPUAccessFlags = 0;
indexBufferDesc.BindFlags = D3D11_BIND_INDEX_BUFFER;
indexBufferDesc.ByteWidth = maxIndices * sizeof(USHORT);
indexBufferDesc.StructureByteStride = 0;
indexBufferDesc.MiscFlags = 0;
static USHORT index[6 * 10000];
int indexPosition = 0;
int vertPosition = 0;
for(std::vector<Quad>::iterator i = drawData.begin(); i != drawData.end(); i++)
{
ind[indexPosition] = vertPosition ;
ind[indexPosition + 1] = vertPosition + 1;
ind[indexPosition + 2] = vertPosition + 2;
ind[indexPosition + 3] = vertPosition + 3;
ind[indexPosition + 4] = vertPosition;
ind[indexPosition + 5] = vertPosition + 2;
indexPosition += 6;
vertPosition += 4;
}
D3D11_SUBRESOURCE_DATA resourceData2;
ZeroMemory( &resourceData2, sizeof( resourceData ) );
resourceData2.pSysMem = ind;
/* Other create code */
And load it like this before any rendering is done
srand(time(NULL));
for(int i = 0; i < 10000; i++)
{
batcher->draw((rand()%700) + 10, (rand()%500) + 10, 32.0f, 32.0f);
}
Then use my static render method
void render()
{
//Calculate DT using the Query Performance Counter
mainClock.tick();
systemX11.deviceContext->ClearRenderTargetView(systemX11.backBufferRenderTarget, D3DXCOLOR(0.0f, 0.2f, 0.4f, 1.0f));
//Draw the quads loaded into the buffer
systemX11.deviceContext->IASetPrimitiveTopology(D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST);
systemX11.deviceContext->DrawIndexed(60000, 0, 0); //Draw 60k indices because of the 10000 quads
systemX11.swapChain->Present(0,0);
std::cout<<"DT: "<<mainClock.getDeltaTime()<<std::endl;
}
My DT time is ~0.0029 - 0.0051
Compared to the DT time of ~0.035 - 0.040 when using Dynamic vertex and Index buffers where my render method looks like
void render()
{
//Calculate DT using the Query Performance Counter
mainClock.tick();
systemX11.deviceContext->ClearRenderTargetView(systemX11.backBufferRenderTarget, D3DXCOLOR(0.0f, 0.2f, 0.4f, 1.0f));
Place all the quads into the Quad Vector
for(int i = 0; i < 10000; i++)
{
Places a quad into the quad vector
batcher->draw((rand()%700) + 10, (rand()%500) + 10, 32.0f, 32.0f);
}
//End the batch summary
//1. Lock the Vertex and Index buffers using a NO OVERWRITE
//2. Check to see if we are full; if so unlock, call DrawIndexed, and change the lock flag to DISCARD
//3. If we were full relock the buffer with DISCARD, reset positions, and change the flag back to NO OVERWRITE
//4. Place the vertex and index data into the mapResouce
//5. If we have no more Quads to draw; finish the method by drawing anything we have not drawn using DrawIndex
//6. Calculate index offsets and etc
//7. Clear the Quad Vector
batcher->endBatch();
systemX11.swapChain->Present(0,0);
std::cout<<"DT: "<<mainClock.getDeltaTime()<<std::endl;
}
Even though its only about a .0100 difference I still feel that its really bad. Assuming I did this correctly, but then again I'm not sure how you compare times where in one you reload 10000 quads every frame vs loading 10000 quads once at start and then just making the draw call