Good Performance test?

Graphics and GPU Programming Programming

Started by noodleBowl December 09, 2013 12:21 AM

13 comments, last by noodleBowl 10 years, 4 months ago

noodleBowl

718

Author

December 09, 2013 12:21 AM

So I have finally got a batcher going using a Vertex and Index buffer

But the real question in my head is... how good is it? Is there a good method of testing this?

It is only rendering untextured Quads that are 32 x 32 (no rotation or etc)

And my buffers can hold exactly 10000 Quads


	for(int i = 0; i < 10000; i++)
	{
		batcher->draw((rand()%700) + 10, (rand()%500) + 10, 32.0f, 32.0f);
	}

If my dev specs are:

Core Duo 2 2.66 GHz

4.0 Gb Ram

Ati Mobility Radeon HD 4670

And I have a dt time of ~0.036 to 0.037 seconds how am I doing?

21st Century Moose

13,459

December 09, 2013 12:31 AM

One valid comparison would be with (1) an unbatched version of the same number of quads, and (2) the same number of quads in a static buffer. Your result will be somewhere between the two, and you'll want to be nearer (much nearer) to (2) than you are to (1).

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

noodleBowl

718

Author

December 09, 2013 12:51 AM

One valid comparison would be with (1) an unbatched version of the same number of quads, and (2) the same number of quads in a static buffer. Your result will be somewhere between the two, and you'll want to be nearer (much nearer) to (2) than you are to (1).

I'm a little confuse about comparing it to an unbatched amount of quads. Is it not already unbatched since the buffer can hold the exact amount of quads I'm attempting to draw?

Also I think I failed to give more detail this is my render method. My quads are drawn at a random position every frame


void render()
{
	systemX11.deviceContext->ClearRenderTargetView(systemX11.backBufferRenderTarget, D3DXCOLOR(0.0f, 0.2f, 0.4f, 1.0f));
        batcher->beginBatch();

   	for(int i = 0; i < 10000; i++)
	{
		batcher->draw((rand()%700) + 10, (rand()%500) + 10, 32.0f, 32.0f);
	}

	batcher->endBatch();

	systemX11.swapChain->Present(0,0);
}

Endemoniada

431

December 09, 2013 01:30 AM

I'm not sure if this is what you're asking but what I do when testing performance is something like this:


// gets called every frame
void heartbeat()
{
 performance_counter.start();
 function_i_want_to_test();
 elapsed+=performance_counter.stop();
 frame_count++;
 
 // keep this out of the counter
 d3d_device->Present(NULL);
}
 
// average time to complete function
average=elapsed/frame_count;

If you aren't sure what a performance counter is just duckduckgo "Win32 Performance Counter".

I hope that helps.

noodleBowl

718

Author

December 09, 2013 03:02 AM

I'm not sure if this is what you're asking but what I do when testing performance is something like this:

I think we are doing the same thing in a slightly different way

One valid comparison would be with (1) an unbatched version of the same number of quads, and (2) the same number of quads in a static buffer. Your result will be somewhere between the two, and you'll want to be nearer (much nearer) to (2) than you are to (1).

So I went back and created a static buffer and either I have coded my batcher very wrong or I am doing this test incorrectly ( <- I'm hoping for this)

When I create a static index and vertex buffer


	D3D11_BUFFER_DESC vertexBufferDesc;
        //Create the static buffer and fill it
	ZeroMemory(&vertexBufferDesc, sizeof(D3D11_BUFFER_DESC));
	vertexBufferDesc.Usage = D3D11_USAGE_DEFAULT;
	vertexBufferDesc.CPUAccessFlags = 0;
	vertexBufferDesc.BindFlags = D3D11_BIND_VERTEX_BUFFER;
	vertexBufferDesc.ByteWidth = maxVertices * sizeof(Vertex);
	vertexBufferDesc.StructureByteStride = 0;
	vertexBufferDesc.MiscFlags = 0;

	static Vertex vertices[4 * 10000];
	 int position = 0;
	 for(std::vector<Quad>::iterator i = drawData.begin(); i != drawData.end(); i++)
	{
		memcpy(vertices + position, (*i).vertices, sizeof((*i).vertices));
		position += 4;
	 }

	D3D11_SUBRESOURCE_DATA resourceData;
	ZeroMemory( &resourceData, sizeof( resourceData ) );
	resourceData.pSysMem = vertices;

        /* Rest of the create Code etc */

	D3D11_BUFFER_DESC indexBufferDesc;
	ZeroMemory(&indexBufferDesc, sizeof(D3D11_BUFFER_DESC));
	indexBufferDesc.Usage = D3D11_USAGE_DEFAULT;
	indexBufferDesc.CPUAccessFlags = 0;
	indexBufferDesc.BindFlags = D3D11_BIND_INDEX_BUFFER;
	indexBufferDesc.ByteWidth = maxIndices * sizeof(USHORT);
	indexBufferDesc.StructureByteStride = 0;
	indexBufferDesc.MiscFlags = 0;

	static USHORT index[6 * 10000];
	int indexPosition = 0;
        int vertPosition = 0;
	for(std::vector<Quad>::iterator i = drawData.begin(); i != drawData.end(); i++)
	{
		ind[indexPosition] = vertPosition ;
		ind[indexPosition + 1] = vertPosition + 1;
		ind[indexPosition + 2] = vertPosition + 2;
		ind[indexPosition + 3] = vertPosition + 3;
		ind[indexPosition + 4] = vertPosition;
		ind[indexPosition + 5] = vertPosition + 2;
		indexPosition += 6;
		vertPosition += 4;
	}

	D3D11_SUBRESOURCE_DATA resourceData2;
	ZeroMemory( &resourceData2, sizeof( resourceData ) );
	resourceData2.pSysMem = ind;
       
        /* Other create code */

And load it like this before any rendering is done


srand(time(NULL));
for(int i = 0; i < 10000; i++)
{
	batcher->draw((rand()%700) + 10, (rand()%500) + 10, 32.0f, 32.0f);
}

Then use my static render method


void render()
{
        //Calculate DT using the Query Performance Counter
	mainClock.tick();
	systemX11.deviceContext->ClearRenderTargetView(systemX11.backBufferRenderTarget, D3DXCOLOR(0.0f, 0.2f, 0.4f, 1.0f));

        //Draw the quads loaded into the buffer
	systemX11.deviceContext->IASetPrimitiveTopology(D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST);
	systemX11.deviceContext->DrawIndexed(60000, 0, 0); //Draw 60k indices because of the 10000 quads

	systemX11.swapChain->Present(0,0);

	std::cout<<"DT: "<<mainClock.getDeltaTime()<<std::endl;
}

My DT time is ~0.0029 - 0.0051

Compared to the DT time of ~0.035 - 0.040 when using Dynamic vertex and Index buffers where my render method looks like


void render()
{
        //Calculate DT using the Query Performance Counter
	mainClock.tick();
	systemX11.deviceContext->ClearRenderTargetView(systemX11.backBufferRenderTarget, D3DXCOLOR(0.0f, 0.2f, 0.4f, 1.0f));

        Place all the quads into the Quad Vector
	for(int i = 0; i < 10000; i++)
	{
             Places a quad into the quad vector
	     batcher->draw((rand()%700) + 10, (rand()%500) + 10, 32.0f, 32.0f);
	}

        //End the batch summary
        //1. Lock the Vertex and Index buffers using a NO OVERWRITE
        //2. Check to see if we are full; if so unlock, call DrawIndexed, and change the lock flag to DISCARD
        //3. If we were full relock the buffer with DISCARD, reset positions, and change the flag back to NO OVERWRITE
        //4. Place the vertex and index data into the mapResouce
        //5. If we have no more Quads to draw; finish the method by drawing anything we have not drawn using DrawIndex
        //6. Calculate index offsets and etc
        //7. Clear the Quad Vector
	batcher->endBatch();

	systemX11.swapChain->Present(0,0);
	std::cout<<"DT: "<<mainClock.getDeltaTime()<<std::endl;
}

Even though its only about a .0100 difference I still feel that its really bad. Assuming I did this correctly, but then again I'm not sure how you compare times where in one you reload 10000 quads every frame vs loading 10000 quads once at start and then just making the draw call

sunaiac

112

December 09, 2013 02:45 PM

Have you tried precaching you random values inside a big table ?

Here it looks to me that you're testing rand() performance more than/as much as your batcher's.

noodleBowl

718

Author

December 10, 2013 12:57 AM

Have you tried precaching you random values inside a big table ?

Here it looks to me that you're testing rand() performance more than/as much as your batcher's.

I don't think this is the case

If I place this in my start up method


	for(int i = 0, j = 0; i < 10000; i++, j+=2)
	{
		spots[j] = (rand() % 700) + 10;
		spots[j+1] = (rand() % 500) + 10;
	}

Then change the render to


void render()
{
	mainClock.tick();
	systemX11.deviceContext->ClearRenderTargetView(systemX11.backBufferRenderTarget, D3DXCOLOR(0.0f, 0.2f, 0.4f, 1.0f));

	for(int i = 0, j = 0; i < 10000; i++, j+=2)
	{
		batcher->draw(spots[j], spots[j+1], 32.0f, 32.0f);
	}

	batcher->endBatch();

	systemX11.swapChain->Present(0,0);
	std::cout<<"DT: "<<mainClock.getDeltaTime()<<std::endl;
}

I get roughly around the same time. DT is ~0.034 - 0.036

I'm not sure what needs to be done. Is there just something conceptually wrong that I am doing?


void SystemBatcher::endBatch()
{
	//Lock the buffers; starts off with NO OVERWRITE
	batchContext->Map(vertexBuffer, 0, mappingFlag, 0, &mapVertexResource);
	batchContext->Map(indexBuffer, 0, mappingFlag, 0, &mapIndexResource);

	//For every Quad in the Quad Vector [this case 10K]
	for(i = drawData.begin(); i != drawData.end(); i++)
	{
		//Check if the bufer is full
		if(vertexBufferSize - drawDataAmount == 0)
		{
			//IF so, unlock
			batchContext->Unmap(indexBuffer, 0);
			batchContext->Unmap(vertexBuffer, 0);
			
			//Draw if things need to be drawn
			if(indexDrawCount > 0)
			{
				batchContext->IASetPrimitiveTopology(D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST);
				batchContext->DrawIndexed(indexDrawCount, indexOffset, 0);
			}
			
			//Change the lock flag to discard; Reset the positions
			mappingFlag = D3D11_MAP_WRITE_DISCARD;
			vertexBufferPosition = 0;
			indexBufferPosition = 0;
			drawDataAmount = 0;
			indexDrawCount = 0;
			indexOffset = 0;

			//Relock the buffers; change the flag back to NO OVERWRITE
			batchContext->Map(vertexBuffer, 0, mappingFlag, 0, &mapVertexResource);
			batchContext->Map(indexBuffer, 0, mappingFlag, 0, &mapIndexResource);
			mappingFlag = D3D11_MAP_WRITE_NO_OVERWRITE;
		}

		//Place in the vertex data
		((Vertex*)mapVertexResource.pData)[vertexBufferPosition] = (*i).vertices[0];
		((Vertex*)mapVertexResource.pData)[vertexBufferPosition + 1] = (*i).vertices[1];
		((Vertex*)mapVertexResource.pData)[vertexBufferPosition + 2] = (*i).vertices[2];
		((Vertex*)mapVertexResource.pData)[vertexBufferPosition + 3] = (*i).vertices[3];

		//Place in the index data
		((USHORT*)mapIndexResource.pData)[indexBufferPosition] = vertexBufferPosition;
		((USHORT*)mapIndexResource.pData)[indexBufferPosition + 1] = vertexBufferPosition + 1;
		((USHORT*)mapIndexResource.pData)[indexBufferPosition + 2] = vertexBufferPosition + 2;
		((USHORT*)mapIndexResource.pData)[indexBufferPosition + 3] = vertexBufferPosition + 3;
		((USHORT*)mapIndexResource.pData)[indexBufferPosition + 4] = vertexBufferPosition;
		((USHORT*)mapIndexResource.pData)[indexBufferPosition + 5] = vertexBufferPosition + 2;

		//Increment counts
		vertexBufferPosition += 4;
		indexBufferPosition += 6;
		drawDataAmount += dataPerQuad;
		indexDrawCount += 6;
	}

	//Second Unlock of buffers; 
	batchContext->Unmap(indexBuffer, 0);
	batchContext->Unmap(vertexBuffer, 0);

	//Draw anything left that has not been drawn
	batchContext->IASetPrimitiveTopology(D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST);
	batchContext->DrawIndexed(indexDrawCount, indexOffset, 0);

	//Increment needed counts
	indexOffset += indexDrawCount;
	
	//Reset the draw count
	indexDrawCount = 0;

	//Clear out the Quad Vector
	drawData.clear();
}

L. Spiro

25,818

December 10, 2013 03:32 AM

Is there just something conceptually wrong that I am doing?

For on you are potentially updating a vertex buffer/index buffer while still in use by the GPU (if the buffer is full).
Use double-buffering and growable vertex buffers. Do work on buffer A, present, next frame work on buffer B, present, back to A. If you run out of room in the vertex buffer, just grow it. It will only be slow for a little bit at the start, if even that (perhaps growing once is enough).

Secondly, avoid all that memory-copying.
First you generate data, copy it into a vector, then map the vertex buffers and copy from the vector into there.
The routine should be:

batcher->beginBatch(); // Calls Map() internally.
for(int i = 0; i < 10000; i++)
{
     // Places a quad directly into the mapped pointer.
     batcher->draw((rand()%700) + 10, (rand()%500) + 10, 32.0f, 32.0f);
}
batcher->endBatch(); // Unmaps.
batcher->draw(); // Draws.

Thirdly, if you use double-buffering and start from the beginning of the buffer on each frame, you no longer need to update your index buffer at all, as each time it would be writing the same values into itself.

L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

noodleBowl

718

Author

December 10, 2013 04:43 AM

I got so many questions

For on you are potentially updating a vertex buffer/index buffer while still in use by the GPU (if the buffer is full).

Can you explain this a little more? I don't understand how I can potentially be updating while the GPU is still in use.

Does the DISCARD flag not give me a new location in memory to use or am I thinking of this wrong?

Use double-buffering and growable vertex buffers. Do work on buffer A, present, next frame work on buffer B, present, back to A. If you run out of room in the vertex buffer, just grow it. It will only be slow for a little bit at the start, if even that (perhaps growing once is enough).

What exactly do you mean by a growable vertex buffer? Also does this mean you are saying to DISCARD on every frame?

Also I only half understand the A then B buffer update. How does this exactly save bandwidth, how does it work?

It seems like I would stall on the GPU somewhere

Secondly, avoid all that memory-copying.
First you generate data, copy it into a vector, then map the vertex buffers and copy from the vector into there.
The routine should be:
batcher->beginBatch(); // Calls Map() internally.
for(int i = 0; i < 10000; i++)
{
     // Places a quad directly into the mapped pointer.
     batcher->draw((rand()%700) + 10, (rand()%500) + 10, 32.0f, 32.0f);
}
batcher->endBatch(); // Unmaps.
batcher->draw(); // Draws.
Thirdly, if you use double-buffering and start from the beginning of the buffer on each frame, you no longer need to update your index buffer at all, as each time it would be writing the same values into itself.

This part definitely needs to happen, this is my main bottleneck for sure.

But what I am a little confused about is, are you saying to only use the DISCARD flag. Does this not defeat the purpose of using a dynamic buffer and reusing buffer slots with NO OVERWRITE? I also assume, I do not need to check if I hit my MaxBufferSize because this is where the growable buffer would fall in right?

L. Spiro

25,818

December 10, 2013 08:59 PM

Does the DISCARD flag not give me a new location in memory to use or am I thinking of this wrong?

Even if so, that means an extra allocation and free by the driver, which is a costly operation.

What exactly do you mean by a growable vertex buffer? Also does this mean you are saying to DISCARD on every frame?

If you run out of room, release the buffer and allocate a new one of the correct size.
Basically like std::vector. A little slow at the start but once it gets big enough it won’t be resized anymore and performance will be optimal.

Also I only half understand the A then B buffer update. How does this exactly save bandwidth, how does it work?
It seems like I would stall on the GPU somewhere

It avoids stalling by ensuring you are only updating resource B while resource A is being used by the GPU and vice versa.
It isn’t meant to save bandwidth. When you submit render commands, they are queued until the driver detects a flush is needed. Until that time it is costly to update a vertex buffer that has been submitted for rendering, so don’t. Update the other one.

But what I am a little confused about is, are you saying to only use the DISCARD flag. Does this not defeat the purpose of using a dynamic buffer and reusing buffer slots with NO OVERWRITE?

You still need the buffers to be D3D11_USAGE_DYNAMIC and update with D3D11_MAP_WRITE on the buffer that is not being used by the GPU.

L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

Good Performance test?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Good Performance test?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines