Sign in to follow this  

DX11 Trying to finding bottlenecks in my renderer

Recommended Posts

I just finished up my 1st iteration of my sprite renderer and I'm sort of questioning its performance.

Currently, I am trying to render 10K worth of 64x64 textured sprites in a 800x600 window. These sprites all using the same texture, vertex shader, and pixel shader. There is basically no state changes. The sprite renderer itself is dynamic using the D3D11_MAP_WRITE_NO_OVERWRITE then D3D11_MAP_WRITE_DISCARD when the vertex buffer is full. The buffer is large enough to hold all 10K sprites and execute them in a single draw call. Cutting the buffer size down to only being able to fit 1000 sprites before a draw call is executed does not seem to matter / improve performance.  When I clock the time it takes to complete the render method for my sprite renderer (the only renderer that is running) I'm getting about 40ms. Aside from trying to adjust the size of the vertex buffer, I have tried using 1x1 texture and making the window smaller (640x480) as quick and dirty check to see if the GPU was the bottleneck, but I still get 40ms with both of those cases. 

I'm kind of at a loss. What are some of the ways that I could figure out where my bottleneck is?
I feel like only being able to render 10K sprites is really low, but I'm not sure. I'm not sure if I coded a poor renderer and there is a bottleneck somewhere or I'm being limited by my hardware

Just some other info:

Dev PC specs:

GPU: Intel HD Graphics 4600 / Nvidia GTX 850M (Nvidia is set to be the preferred GPU in the Nvida control panel. Vsync is set to off)
CPU: Intel Core i7-4710HQ @ 2.5GHz

Renderer:

//The renderer has a working depth buffer

//Sprites have matrices that are precomputed. These pretransformed vertices are placed into the buffer
Matrix4 model = sprite->getModelMatrix();
verts[0].position = model * verts[0].position;
verts[1].position = model * verts[1].position;
verts[2].position = model * verts[2].position;
verts[3].position = model * verts[3].position;
verts[4].position = model * verts[4].position;
verts[5].position = model * verts[5].position;

//Vertex buffer is flaged for dynamic use
vertexBuffer = BufferModule::createVertexBuffer(D3D11_USAGE_DYNAMIC, D3D11_CPU_ACCESS_WRITE, sizeof(SpriteVertex) * MAX_VERTEX_COUNT_FOR_BUFFER);

//The vertex buffer is mapped to when adding a sprite to the buffer
//vertexBufferMapType could be D3D11_MAP_WRITE_NO_OVERWRITE or D3D11_MAP_WRITE_DISCARD depending on the data already in the vertex buffer
D3D11_MAPPED_SUBRESOURCE resource = vertexBuffer->map(vertexBufferMapType); 
memcpy(((SpriteVertex*)resource.pData) + vertexCountInBuffer, verts, BYTES_PER_SPRITE);
vertexBuffer->unmap();

//The constant buffer used for the MVP matrix is updated once per draw call
D3D11_MAPPED_SUBRESOURCE resource = mvpConstBuffer->map(D3D11_MAP_WRITE_DISCARD);
memcpy(resource.pData, projectionMatrix.getData(), sizeof(Matrix4));
mvpConstBuffer->unmap();

Vertex / Pixel Shader:

cbuffer mvpBuffer : register(b0)
{
	matrix mvp;
}

struct VertexInput
{
	float4 position : POSITION;
	float2 texCoords : TEXCOORD0;
	float4 color : COLOR;
};

struct PixelInput
{
	float4 position : SV_POSITION;
	float2 texCoords : TEXCOORD0;
	float4 color : COLOR;
};

PixelInput VSMain(VertexInput input)
{
	input.position.w = 1.0f;

	PixelInput output;
	output.position = mul(mvp, input.position);
	output.texCoords = input.texCoords;
	output.color = input.color;

	return output;
}

Texture2D shaderTexture;
SamplerState samplerType;
float4 PSMain(PixelInput input) : SV_TARGET
{
	float4 textureColor = shaderTexture.Sample(samplerType, input.texCoords);
	
	return textureColor;
}

 

If anymore info is needed feel free to ask, I would really like to know how I can improve this assuming I'm not hardware limited

Share this post


Link to post
Share on other sites
2 hours ago, noodleBowl said:

Nvidia is set to be the preferred GPU in the Nvida control panel.

Add this to some non-empty .cpp file (if I put them in an empty .cpp file, it seems to be ignored) to automatically chose the dedicated instead of integrated GPU:

extern "C" {
    __declspec(dllexport) DWORD NvOptimusEnablement;
}
extern "C" {
    __declspec(dllexport) int AmdPowerXpressRequestHighPerformance;
}

This will avoid changing the Nvidia control panel for all your different builds.

Edited by matt77hias

Share this post


Link to post
Share on other sites

What happens if you don't update sprite vertices per frame? (I assume uploading that much data is the bottleneck, you may consider uploading the transforms instead, which would be 4 values per sprite instead 6 * 4.)

Edit: Additionally you probably should use double buffering or a ring buffer to allow some frames of latency for the GPU, if you don't already.

I tried something similar with Vulkan and Fury GPU:

Render 2 million textured boxes, vertex.w = integer index to pick the proper 4*4 matrix from a regular buffer (not uniform as usual) -> 80 fps.

I do not remember if this number was with or without per frame upload, probably without, but the upload was definitively the bottleneck, especially because i did not use double buffering IIRC.

 

Edited by JoeJ

Share this post


Link to post
Share on other sites
6 hours ago, noodleBowl said:

input.position.w = 1.0f;

Just use a float3 Position instead of float4 Position, you will get the w coordinate of 1.0f for free. Furthermore, it does not make sense to use a float4 and to immediately overwrite the w coordinate. Just use an explicit float3 to inform the compiler.

6 hours ago, noodleBowl said:

Matrix4 model = sprite->getModelMatrix(); verts[0].position = model * verts[0].position; verts[1].position = model * verts[1].position; verts[2].position = model * verts[2].position; verts[3].position = model * verts[3].position; verts[4].position = model * verts[4].position; verts[5].position = model * verts[5].position;

A sprite is basically a quad consisting of two triangles. You can reuse the position of the shared vertices. This will reduce the number of matrix multiplications by 1/3.

6 hours ago, noodleBowl said:

When I clock the time it takes to complete the render method for my sprite renderer (the only renderer that is running) I'm getting about 40ms. Aside from trying to adjust the size of the vertex buffer, I have tried using 1x1 texture and making the window smaller (640x480) as quick and dirty check to see if the GPU was the bottleneck, but I still get 40ms with both of those cases.

If you skip the draw, do you still have +- 40ms? If this is the case, skip the map/unmaps as well. If you still have 40ms, your CPU is definitely the culprit (and not the code that you are showing).

Edited by matt77hias

Share this post


Link to post
Share on other sites
7 hours ago, Michael Aganier said:

Look at your CPU usage in the task manager. If the rendering thread is at 100%, then your renderer is the bottleneck. If your rendering thread is not at 100%, then your GPU is the bottleneck.

It's almost 2018. Update your Windows 10 and your Task Manager will be able to show GPU usage % and GPU memory usage.

Share this post


Link to post
Share on other sites
13 hours ago, noodleBowl said:

When I clock the time it takes to complete the render method for my sprite renderer (the only renderer that is running) I'm getting about 40ms. 

Is that just the map, memcpy, unmap shown above? Or does it involve drawing / Present too? 

Add more detail to the timing - see if you can find which specific function is using most of that time. Also measure how long Present is taking. 

Share this post


Link to post
Share on other sites

Thanks for all the responses! Tried to cover everything, let me know if I missed something

20 hours ago, Michael Aganier said:

Look at your CPU usage in the task manager. If the rendering thread is at 100%, then your renderer is the bottleneck. If your rendering thread is not at 100%, then your GPU is the bottleneck.

12 hours ago, Zaoshi Kaba said:

It's almost 2018. Update your Windows 10 and your Task Manager will be able to show GPU usage % and GPU memory usage.

Not sure how helpful this is, but looking at my task manager its says:

CPU: ~21% (Amount used by my application. Not total CPU usage)
GPU 0 [Intel HD Graphics]: ~11%
GPU 1 [NVidia GeForce GTX 850M]: ~18%

This is rendering 10K sprites with a 64x64 texture in a 800x600 window

 

14 hours ago, JoeJ said:

What happens if you don't update sprite vertices per frame? (I assume uploading that much data is the bottleneck, you may consider uploading the transforms instead, which would be 4 values per sprite instead 6 * 4.)

So I don't think this is exactly what you mean, but speaking from a map/unmap stand point if I move things around and only map once per draw call my time goes down to 25ms. To do this I created an intermediate array that is the same size as my vertex buffer. Then I place my sprite data into this intermediate array, when I need to draw I just do a memcpy straight into the vertex buffer 

//Created at Sprite Renderer init
vertices = new SpriteVertex[MAX_VERTEX_COUNT_FOR_BUFFER];

//In side of my function that flushes the buffer
resource = vertexBuffer->map(vertexBufferMapType);
memcpy(resource.pData, vertices, vertexCountInBuffer * sizeof(SpriteVertex));
vertexBuffer->unmap();

graphicsDevice->getDeviceContext()->Draw(vertexCountToDraw, vertexCountDrawnOffset);

 

14 hours ago, matt77hias said:

Just use a float3 Position instead of float4 Position, you will get the w coordinate of 1.0f for free. Furthermore, it does not make sense to use a float4 and to immediately overwrite the w coordinate. Just use an explicit float3 to inform the compiler.

Currently my SpriteVertex class is using a float3 for the position on the CPU side.

class SpriteVertex
{

public:
	SpriteVertex();
	SpriteVertex(Vector3 position, Vector2 texCoords, Color color);
	~SpriteVertex();
	Vector3 position;
	Vector2 texCoords;
	Color color;
};

On the shader side I have it as float4 because of the MVP matrix. Changing the position float3 (shader side) makes the window just show red. I assume I'm super zoomed into the sprites or something. I removed the unneeded input.position.w = 1.0f though

14 hours ago, matt77hias said:

A sprite is basically a quad consisting of two triangles. You can reuse the position of the shared vertices. This will reduce the number of matrix multiplications by 1/3.

Currently I have no index buffer setup, so I will have to go back and try this out. I do believe this would help a little bit in the very least, because you are right I would do less matrix calculations this way

14 hours ago, matt77hias said:

If you skip the draw, do you still have +- 40ms? If this is the case, skip the map/unmaps as well. If you still have 40ms, your CPU is definitely the culprit (and not the code that you are showing).

So if I comment out the Draw call I still have ~40ms. If I also take out the map/unmap calls I get around ~36ms. So there is a minor different but I'm starting to think my CPU is the issue.
 

8 hours ago, Hodgman said:

Is that just the map, memcpy, unmap shown above? Or does it involve drawing / Present too? 

Add more detail to the timing - see if you can find which specific function is using most of that time. Also measure how long Present is taking.

The 40ms time is just the cost of doing the render, so this is just the Draw and unmap/map calls. When I time this function I'm doing it like so:

void SpriteRenderer::render(double deltaTime)
{
	//Get the start time
	QueryPerformanceCounter(&startTime);

	renderStart(); //Setup/reset since other renderes may have ran. Only this renderer is running
	sortRenderList(); //This is only done once. On the first frame. Only sorting by texture too
	
	Sprite* sprite = nullptr;
	for (std::vector<Sprite*>::iterator i = renderList.begin(); i != renderList.end(); ++i)
	{
		sprite = (*i);
		if (sprite->isVisible() == false)
			continue;

      		//Put the sprite into the buffer. This is where the map/unmap calls are
		addToVertexBuffer(sprite);
	}

    	//Draw the sprites that were placed in the buffer. Draw call is here
	flushVertexBuffer();
      
    	//Get the end time and calculate how long it took
	QueryPerformanceCounter(&endTime);
	Logger::info("RENDER TIME: " + std::to_string(((endTime.QuadPart - startTime.QuadPart) * 1000) / frq.QuadPart));

}

void SpriteRenderer::addToVertexBuffer(Sprite* sprite)
{
	Texture* spriteTexture = sprite->getTexture();
	if (spriteTexture != boundTexture)
	{
		flushVertexBuffer();
		bindTexture(spriteTexture);
	}

	if (vertexCountInBuffer == MAX_VERTEX_COUNT_FOR_BUFFER)
	{
		flushVertexBuffer();
		vertexCountInBuffer = 0;
		vertexCountDrawnOffset = 0;
		vertexBufferMapType = D3D11_MAP_WRITE_DISCARD;
	}

  	/* Code to setup the sprite. Vertex transform, flipping, applying texture clip rect, etc */
  
  	//Put the sprite in the buffer
	D3D11_MAPPED_SUBRESOURCE resource = vertexBuffer->map(vertexBufferMapType);
	memcpy(((SpriteVertex*)resource.pData) + vertexCountInBuffer, verts, BYTES_PER_SPRITE);
	vertexBuffer->unmap();

	vertexCountToDraw += VERTEX_PER_QUAD;
	vertexCountInBuffer += VERTEX_PER_QUAD;
	vertexBufferMapType = D3D11_MAP_WRITE_NO_OVERWRITE;
}

void SpriteRenderer::renderStart()
{
	graphicsDevice = GraphicsDeviceModule::getGraphicsDevice();
	graphicsDevice->getDeviceContext()->VSSetShader(defaultVertexShader->getShader(), 0, 0);
	graphicsDevice->getDeviceContext()->VSSetConstantBuffers(0, 1, mvpConstBuffer->getBuffer());
	graphicsDevice->getDeviceContext()->PSSetShader(defaultPixelShader->getShader(), 0, 0);
	graphicsDevice->getDeviceContext()->IASetInputLayout(inputLayout->getInputLayout());
	graphicsDevice->getDeviceContext()->IASetPrimitiveTopology(D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST);
	graphicsDevice->getDeviceContext()->IASetVertexBuffers(0, 1, vertexBuffer->getBuffer(), &STRIDE_PER_VERTEX, &VERTEX_BUFFER_OFFSET);

	boundTexture = nullptr;
}

Now the one thing I'm not sure about is that when I time like the above (using the QueryPerformanceCounter) am I really timing my methods calls or am I timing how long they take to return. This probably makes more sense with timing something like the Present timer

QueryPerformanceCounter(&startTime);
GraphicsDeviceModule::getGraphicsDevice()->present();
QueryPerformanceCounter(&endTime);
Logger::info("PRESENT TIME: " + std::to_string(((endTime.QuadPart - startTime.QuadPart) * 1000) / frq.QuadPart));

Did I just time how long it really takes to present everything to the screen or did I just time how long it took to post the command to the GPU? I think I'm timing the how long it takes to return since my time comes back as 0ms 

Edited by noodleBowl

Share this post


Link to post
Share on other sites
1 hour ago, noodleBowl said:

Not sure how helpful this is, but looking at my task manager its says:


CPU: ~21% (Amount used by my application. Not total CPU usage)
GPU 0 [Intel HD Graphics]: ~11%
GPU 1 [NVidia GeForce GTX 850M]: ~18%

You have to look at individual cores, but if you have your total CPU at 21%, it means one of the cores might be running at 100% which inflates the average.

The GPU usage is not important because we are not measuring the performance of the GPU. We are measuring the performance of your renderer to prepare instructions on the CPU.

Knowing the usage of the rendering thread is important because If it is at 100%, it means that the GPU is waiting for more instructions because you're not sending them fast enough or in a such a way that the GPU can parallelize them. If this is the case, you have a 100% confirmation that the problem is your renderer.

This is an answer to your first question:

23 hours ago, noodleBowl said:

What are some of the ways that I could figure out where my bottleneck is?

 

Share this post


Link to post
Share on other sites
11 hours ago, noodleBowl said:

The 40ms time is just the cost of doing the render, so this is just the Draw and unmap/map calls. When I time this function I'm doing it like so:

<wall of code>

Am I understanding this correctly: your renderList contains 10,000 sprites and you Map() > memcpy() > Unmap() each one individually? No wonder you're having issues. It should be a single Map() > memcpy() > Unmap() call.

Share this post


Link to post
Share on other sites
14 hours ago, noodleBowl said:

So I don't think this is exactly what you mean, but speaking from a map/unmap stand point if I move things around and only map once per draw call my time goes down to 25ms. To do this I created an intermediate array that is the same size as my vertex buffer. Then I place my sprite data into this intermediate array, when I need to draw I just do a memcpy straight into the vertex buffer 

This is a much better way of doing things rather than mapping and unmapping 10,000 times.  But 25ms still seems to much for what you're doing.

14 hours ago, noodleBowl said:

Currently my SpriteVertex class is using a float3 for the position on the CPU side.

Are you creating 10,000 * 6 sprite vertex's per frame invidually?  Or up front?  60,000 dynamic memory allocations a frame might slow you down especially on a laptop CPU.

What does flushvertexbuffer do exactly?

 

Share this post


Link to post
Share on other sites
13 hours ago, Michael Aganier said:

Knowing the usage of the rendering thread is important because If it is at 100%, it means that the GPU is waiting for more instructions because you're not sending them fast enough or in a such a way that the GPU can parallelize them. If this is the case, you have a 100% confirmation that the problem is your renderer.

So I'm not sure if I did this right but I went back and looked my CPU with the graph set to show logical processors

cpuScreen.PNG.24020bc4300b7b9d9e9253d72abd3c3b.PNG

Based on this it looks like my CPU is crunching hard

12 hours ago, matt77hias said:

BTW if you use Visual Studio, you can use the built-in profiler. This will give you a rough idea of the methods taking most of the time. Furthermore, they do not use your timer. So you can rule out the issues you think to have with your timer.

Didn't know this was a thing. This is awesome. I feel like this confirms its my CPU. I'm not sure if I can set it to look at one run of a function or not but I set the "look at" frame as small as I can. Which is about 44ms

perfRender.thumb.PNG.41993efbbdc93d16835562f1e62dba11.PNG

perfAddVert.thumb.PNG.acee99493cdd45afa776f059d5723fbf.PNG

Looking at this a hugh chunk of time is spent in the addVertexToBuffer method. Which makes sense since this method is ran once per sprite. This method basically sets up the sprite, in here we are doing things like setting positions for the vertices, checking what tex coords to map, and etc. Honestly I don't know if this is the right place for this. I feel like the code in here might be better suited in the actual sprite class inset of having to redo it every frame for each sprite. Moving it to the Sprite class itself this stuff could be "precomputed" and the addToVertexBuffer would just be a data copy method

void SpriteRenderer::addToVertexBuffer(Sprite* sprite)
{
	Texture* spriteTexture = sprite->getTexture();
	if (spriteTexture != boundTexture)
	{
		flushVertexBuffer();
		bindTexture(spriteTexture);
	}

	if (vertexCountInBuffer == MAX_VERTEX_COUNT_FOR_BUFFER)
	{
		flushVertexBuffer();
		vertexCountInBuffer = 0;
		vertexCountDrawnOffset = 0;
		vertexBufferMapType = D3D11_MAP_WRITE_DISCARD;
	}

	float width;
	float height;
	float u = 0.0f;
	float v = 0.0f;
	float uWidth = 1.0f;
	float vHeight = 1.0f;
	float textureWidth = (float)spriteTexture->getWidth();
	float textureHeight = (float)spriteTexture->getHeight();
	SpriteVertex verts[6];

	Rect* rect = sprite->getTextureClippingRectangle();
	if (rect == nullptr)
	{
		width = textureWidth / 2.0f;
		height = textureHeight / 2.0f;
	}
	else
	{
		width = rect->width / 2.0f;
		height = rect->height / 2.0f;

		u = rect->x / textureWidth;
		v = rect->y / textureHeight;
		uWidth = (rect->x + rect->width) / textureWidth;
		vHeight = (rect->y + rect->height) / textureHeight;
	}
	verts[0].position.setXYZ(-width, -height, 0.0f);
	verts[1].position.setXYZ(width, height, 0.0f);
	verts[2].position.setXYZ(width, -height, 0.0f);
	verts[3].position.setXYZ(-width, -height, 0.0f);
	verts[4].position.setXYZ(-width, height, 0.0f);
	verts[5].position.setXYZ(width, height, 0.0f);

	if (sprite->isFlipped() == false)
	{
		verts[0].texCoords.setXY(u, vHeight);
		verts[1].texCoords.setXY(uWidth, v);
		verts[2].texCoords.setXY(uWidth, vHeight);
		verts[3].texCoords.setXY(u, vHeight);
		verts[4].texCoords.setXY(u, v);
		verts[5].texCoords.setXY(uWidth, v);
	}
	else
	{
		verts[0].texCoords.setXY(uWidth, vHeight);
		verts[1].texCoords.setXY(u, v);
		verts[2].texCoords.setXY(u, vHeight);
		verts[3].texCoords.setXY(uWidth, vHeight);
		verts[4].texCoords.setXY(uWidth, v);
		verts[5].texCoords.setXY(u, v);
	}

	verts[0].color.setRGB(0.0f, 0.0f, 0.0f);
	verts[1].color.setRGB(0.0f, 0.0f, 0.0f);
	verts[2].color.setRGB(0.0f, 0.0f, 0.0f);
	verts[3].color.setRGB(0.0f, 0.0f, 0.0f);
	verts[4].color.setRGB(0.0f, 0.0f, 0.0f);
	verts[5].color.setRGB(0.0f, 0.0f, 0.0f);

	//Pre transform the positions
	Matrix4 model = sprite->getModelMatrix();
	verts[0].position = model * verts[0].position;
	verts[1].position = model * verts[1].position;
	verts[2].position = model * verts[2].position;
	verts[3].position = model * verts[3].position;
	verts[4].position = model * verts[4].position;
	verts[5].position = model * verts[5].position;

	D3D11_MAPPED_SUBRESOURCE resource = vertexBuffer->map(vertexBufferMapType);
	memcpy(((SpriteVertex*)resource.pData) + vertexCountInBuffer, verts, BYTES_PER_SPRITE);
	vertexBuffer->unmap();
  
	vertexCountToDraw += VERTEX_PER_QUAD;
	vertexCountInBuffer += VERTEX_PER_QUAD;
	vertexBufferMapType = D3D11_MAP_WRITE_NO_OVERWRITE;
}

After this most of my time is spent doing the matrix multiplications. By just commenting out the code that does this:

//Pre transform the positions
Matrix4 model = sprite->getModelMatrix();
verts[0].position = model * verts[0].position;
verts[1].position = model * verts[1].position;
verts[2].position = model * verts[2].position;
verts[3].position = model * verts[3].position;
verts[4].position = model * verts[4].position;
verts[5].position = model * verts[5].position;

My SpriteRenderer::render method time drops down to ~30ms. So its not crazy great but its still a pretty solid drop. So using a index buffer to cut out 2 of those pre-transformations might help too. Also sprite->getModelMatrix() under the hood is really doing transform * rotation * scale. So thats another set of matrix multiplications. I wonder if I should just recreate the matrix based on position, rotation, and scale vectors as it might be less math in the end

3 hours ago, Zaoshi Kaba said:

Am I understanding this correctly: your renderList contains 10,000 sprites and you Map() > memcpy() > Unmap() each one individually? No wonder you're having issues. It should be a single Map() > memcpy() > Unmap() call.

You are looking at that correctly and I 100% agree. I have actually done a quick test, where I place everything into a normal array first and then do a Map() > memcpy() from the normal array to the vertex buffer > Unmap() only when I am about to do a Draw call. By doing this it makes my SpriteRenderer::render method drop down to only taking ~25ms. Definitely a change that I need to make

51 minutes ago, Infinisearch said:

This is a much better way of doing things rather than mapping and unmapping 10,000 times.  But 25ms still seems to much for what you're doing.

Are you creating 10,000 * 6 sprite vertex's per frame invidually?  Or up front?  60,000 dynamic memory allocations a frame might slow you down especially on a laptop CPU.

What does flushvertexbuffer do exactly?

Oh no, the sprites are all created at the beginning of the application's start and added to the render list. Once that is done thats it. Nothing is added or removed during the test.

flushVertexBuffer is where my Draw call takes place. It basically updates the constant buffer for the MVP matrix and the does the draw call. Drawing as much data as I need (which in this case is all 10K sprites). Then any counters needed to determine where we are in the buffer or where to draw from, or etc are updated/reset

 

Sorry for the post walls, just trying to cover everything :)

Edited by noodleBowl

Share this post


Link to post
Share on other sites
14 hours ago, noodleBowl said:

flushVertexBuffer is where my Draw call takes place. It basically updates the constant buffer for the MVP matrix and the does the draw call. Drawing as much data as I need (which in this case is all 10K sprites). Then any counters needed to determine where we are in the buffer or where to draw from, or etc are updated/reset

Half of that sentence is very suspicious. Could you show code for your flushVertexBuffer method? Or even better - upload all code in .zip (if possible), it's somewhat hard to track things when it's all over the place.

I see that addToVertexBuffer(Sprite* sprite) also creates geometry for sprites every frame. Most of the time that's not the right way to do it and it probably still uses majority of your CPU.

Share this post


Link to post
Share on other sites
On 12/8/2017 at 12:18 AM, noodleBowl said:

The 40ms time is just the cost of doing the render, so this is just the Draw and unmap/map calls. When I time this function I'm doing it like so:

That should just be the CPU side of the equation... but if you're getting 40 or even 25ms for that you should focus your efforts into the CPU side.

I'd like to see your flushvertexbuffer method as well.

Edited by Infinisearch

Share this post


Link to post
Share on other sites

 So here is the code for the sprite renderer in full.

5 hours ago, Zaoshi Kaba said:

Half of that sentence is very suspicious. Could you show code for your flushVertexBuffer method?

2 hours ago, Infinisearch said:

That should just be the CPU side of the equation... but if you're getting 40 or even 25ms for that you should focus your efforts into the CPU side.

I'd like to see your flushvertexbuffer method as well.

Here is the code for my Sprite Renderer. Also here is a quick look at the  flushvertexbuffer

void SpriteRenderer::flushVertexBuffer()
{
    if (vertexCountToDraw == 0)
        return;
 
    D3D11_MAPPED_SUBRESOURCE resource = mvpConstBuffer->map(D3D11_MAP_WRITE_DISCARD);
    memcpy(resource.pData, projectionMatrix.getData(), sizeof(Matrix4));
    mvpConstBuffer->unmap();
 
    //Draw the sprites that we need to
    graphicsDevice->getDeviceContext()->Draw(vertexCountToDraw, vertexCountDrawnOffset);
    vertexCountDrawnOffset += vertexCountToDraw;
    vertexCountToDraw = 0;
 
    ++drawCallCount;
 
}
5 hours ago, Zaoshi Kaba said:

I see that addToVertexBuffer(Sprite* sprite) also creates geometry for sprites every frame. Most of the time that's not the right way to do it and it probably still uses majority of your CPU.

I agree with this. Most of my CPU crunching comes from this method. Based on the performance tests done with the profiler in visual studio and manually confirming it this seems to be the case

I'm not really sure how dynamic sprites should be handled other then this way. I have thought about having the sprite data be precomuted in the Sprite class  itself and then only updated when its needed. Then my addToVertexBuffer(Sprite* sprite) method would just becomes a simple copy method to place the sprites that should display into the vertex buffer

Share this post


Link to post
Share on other sites

Generating 10k sprites per frame should be feasible at fairly low performance cost. I've worked on games where we generated dozens of entire player models (10k verts each) per frame on the CPU and streamed them to the GPU.

On 09/12/2017 at 7:21 AM, noodleBowl said:

I have actually done a quick test, where I place everything into a normal array first and then do a Map() > memcpy() from the normal array to the vertex buffer > Unmap() only when I am about to do a Draw call. By doing this it makes my SpriteRenderer::render method drop down to only taking ~25ms.

25ms is still massive :o Do some more timing to measure each individual step.

Do you by chance have any of the iterator debug features of std::vector enabled? Iterating standard contains can be horribly slow when you do.

Share this post


Link to post
Share on other sites

So I went back and did some refactoring. Changes:

  • Setup an index buffer that is flagged immutable and is filled at the renderer's init
  • Reduced the number of vertex calculations I have to do because of the index buffer usage
  • I moved my mvpConstBuffer map/unmap to only do at the start of my SpriteRenderer::render method
  • I only a map/unmap my vertex buffer once per Draw call instead of every time I add a sprite

Current render time for doing 10K 64x64 textured sprites over 2 Draw calls (I limited the vertex buffer to only hold 5K worth of sprite data) is ~22.5ms. So its a improvement, but still not that great

23 hours ago, Hodgman said:

Do you by chance have any of the iterator debug features of std::vector enabled? Iterating standard contains can be horribly slow when you do.

Didn't even know this was a thing. According to the MSDN docs this is enabled by default for debug mode. For some reason I can't get it to shutoff (assuming its on still). I tried putting _ITERATOR_DEBUG_LEVEL 0 in the preprocessor settings of my visual studio project properties, but it seems to make no difference for debug build. If I try putting it at the top of my main .CPP I get conflicting errors where it says _ITERATOR_DEBUG_LEVEL 0 does not match _ITERATOR_DEBUG_LEVEL 2 for ****.obj

Regardless, running in Debug Mode I get the ~22.5ms for 10K sprites. Running in Release Mode I get ~1.5ms - ~4.3ms for 10K sprites. If I bump up the sprite count to 125K for Release Mode my render method time is ~19.8ms - ~24.75ms

Even when I look the Release Mode numbers I still think thats pretty low...

23 hours ago, Hodgman said:

25ms is still massive  Do some more timing to measure each individual step.

I went back and starting timing each step like you said and this is when I'm looking at:

For 10K sprites. Debug mode

Render method: ~22.5ms
addSpriteToVertexBuffer: ~0.002053 per sprite. Total time for the 10K: ~20.53ms
Setting the sprite data/vertices (Done inside addSpriteToVertexBuffer): ~0.000411ms per sprite. Total time for the 10K: ~4.11ms

Spoiler

//Set up the vertex data for the sprite
float textureWidth = (float)spriteTexture->getWidth();
float textureHeight = (float)spriteTexture->getHeight();
float width = textureWidth / 2.0f;
float height = textureHeight / 2.0f;
spriteVertexData[vertexCountInBuffer].position.setXYZ(-width, -height, 0.0f);
spriteVertexData[vertexCountInBuffer + 1].position.setXYZ(width, height, 0.0f);
spriteVertexData[vertexCountInBuffer + 2].position.setXYZ(width, -height, 0.0f);
spriteVertexData[vertexCountInBuffer + 3].position.setXYZ(-width, height, 0.0f);

float u = 0.0f;
float v = 0.0f;
float uWidth = 1.0f;
float vHeight = 1.0f;
spriteVertexData[vertexCountInBuffer].texCoords.setXY(u, vHeight);
spriteVertexData[vertexCountInBuffer + 1].texCoords.setXY(uWidth, v);
spriteVertexData[vertexCountInBuffer + 2].texCoords.setXY(uWidth, vHeight);
spriteVertexData[vertexCountInBuffer + 3].texCoords.setXY(u, v);

spriteVertexData[vertexCountInBuffer].color.setRGB(0.0f, 0.0f, 0.0f);
spriteVertexData[vertexCountInBuffer + 1].color.setRGB(0.0f, 0.0f, 0.0f);
spriteVertexData[vertexCountInBuffer + 2].color.setRGB(0.0f, 0.0f, 0.0f);
spriteVertexData[vertexCountInBuffer + 3].color.setRGB(0.0f, 0.0f, 0.0f);
	

 

Matrix matrix calculations (Done inside addSpriteToVertexBuffer): ~0.001232 per sprite. Total time for the 10K: ~12.32ms

Spoiler

//Matrix calculations
Matrix4 model = sprite->getModelMatrix(); //returns a matrix that is translation * rotation * scale
spriteVertexData[vertexCountInBuffer].position = model * spriteVertexData[vertexCountInBuffer].position;
spriteVertexData[vertexCountInBuffer + 1].position = model * spriteVertexData[vertexCountInBuffer + 1].position;
spriteVertexData[vertexCountInBuffer + 2].position = model * spriteVertexData[vertexCountInBuffer + 2].position;
spriteVertexData[vertexCountInBuffer + 3].position = model * spriteVertexData[vertexCountInBuffer + 3].position;
QueryPerformanceCounter(&endTime);

//getModelMatrix Method
Matrix4 SpriteV2::getModelMatrix()
{
	return translationMatrix * rotationMatrix * scaleMatrix;
}

 

A decent chunk of my time is spent doing Matrix calculations, but I really don't know how I can get the timing down on those. The profiler that comes with visual studio seems to back this up too. Just in case anyone is curious here are the profiler's reports: 10KSpritesPerfReports.zip

 

I also noticed something kind of weird while timing the individual steps. For example when timing the matrix calculation block my application's window would be blank. Not blank like I saw the clear color, but blank like I only saw the window background. Once all my debug statements were done printing I saw my sprites. I think this has something to do with the way that I am printing out the debug statements. Currently I'm allocating a console at the start of the application, so I have the Win32 app and then I have a traditional console, then to print any debug statements I just do

std::cout << myDebugStatementAsString << std::endl;

So it got me thinking, could this also be crippling everything? How should I normally print out debug information? Trying to grasp at anything here

Edited by noodleBowl

Share this post


Link to post
Share on other sites
30 minutes ago, noodleBowl said:

A decent chunk of my time is spent doing Matrix calculations, but I really don't know how I can get the timing down on those.

Are you using SIMD math?  Moreover are you using DirectXMath?

https://github.com/Microsoft/DirectXMath

https://msdn.microsoft.com/en-us/library/windows/desktop/hh437833(v=vs.85).aspx

Share this post


Link to post
Share on other sites
9 hours ago, Infinisearch said:

I'm not using DirectXMath and my system currently does not use SIMD

BUT I did go back and I made a SIMD test. Where I timed multiplying a Matrix4x4 by a Vector3 using SIMD operations and then timed the same thing using normal math operations. I tried to mimic what my 10K Sprite test is doing, so I run the Matrix4x4 * Vector3 operation 4 times and then repeat this 10K times.

The weird thing is the SIMD method runs a little slower then the normal math operations. I really would have thought it would have been the other way around.

Test Results:

//Debug Mode
SIMD TIME: 5.204931ms
NORM TIME: 4.222079ms

//Release Mode
SIMD TIME: 0.300521ms
NORM TIME: 0.242634ms


This is my complete test:

Spoiler

#include <iostream>
#include <memory>
#include <pmmintrin.h>
#include <Windows.h>
#include <string>

class Vector3 
{

public:
	Vector3()
	{
		x = 0.0f;
		y = 0.0f;
		z = 0.0f;
	}

	~Vector3()
	{
	}

	float x;
	float y;
	float z;
};

class Matrix4
{

public:
	Matrix4()
	{
		data[0] = 1.0f;
		data[1] = 0.0f;
		data[2] = 0.0f;
		data[3] = 0.0f;

		data[4] = 0.0f;
		data[5] = 1.0f;
		data[6] = 0.0f;
		data[7] = 0.0f;

		data[8] = 0.0f;
		data[9] = 0.0f;
		data[10] = 1.0f;
		data[11] = 0.0f;

		data[12] = 0.0f;
		data[13] = 0.0f;
		data[14] = 0.0f;
		data[15] = 1.0f;
	}
	~Matrix4() {}

	float data[16];

};

Vector3 normMul(const Matrix4& m, const Vector3 &b)
{

	Vector3 r;
	r.x = m.data[0] * b.x + m.data[4] * b.y + m.data[8]  * b.z + m.data[12] * 1.0f;
	r.y = m.data[1] * b.x + m.data[5] * b.y + m.data[9]  * b.z + m.data[13] * 1.0f;
	r.z = m.data[2] * b.x + m.data[6] * b.y + m.data[10] * b.z + m.data[14] * 1.0f;

	return r;
}

Vector3 simdMul(const Matrix4& m, const Vector3 &b)
{

	//Setup
	Vector3 r;
	__m128 m1 = _mm_set_ps(m.data[12], m.data[8], m.data[4], m.data[0]);
	__m128 m2 = _mm_set_ps(m.data[13], m.data[9], m.data[5], m.data[1]);
	__m128 m3 = _mm_set_ps(m.data[14], m.data[10], m.data[6], m.data[2]);
	__m128 vec = _mm_set_ps(1.0f, b.z, b.y, b.x);

	//Multiple the vecs with the columns; Matrices are column order
	m1 = _mm_mul_ps(m1, vec);
	m2 = _mm_mul_ps(m2, vec);
	m3 = _mm_mul_ps(m3, vec);

	//Get result x
	m1 = _mm_hadd_ps(m1, m1);
	r.x = _mm_cvtss_f32(_mm_hadd_ps(m1, m1));

	//Get result y
	m2 = _mm_hadd_ps(m2, m2);
	r.y = _mm_cvtss_f32(_mm_hadd_ps(m2, m2));

	//Get result z
	m3 = _mm_hadd_ps(m3, m3);
	r.z = _mm_cvtss_f32(_mm_hadd_ps(m3, m3));

	return r;
}

int main()
{

	LARGE_INTEGER startTime;
	LARGE_INTEGER endTime;
	LARGE_INTEGER frq;
	QueryPerformanceFrequency(&frq);

	Vector3 result;
	Matrix4 mat1;
	Vector3 v1;
	v1.x = 2.0f;
	v1.y = 5.0f;
	v1.z = 10.0f;
	QueryPerformanceCounter(&startTime);
	for (int i = 0; i < 10000; ++i)
	{
		for (int j = 0; j < 4; ++j)
		{
			result = simdMul(mat1, v1);
		}
	}
	QueryPerformanceCounter(&endTime);
	std::cout << "SIMD TIME: " + std::to_string((double)((endTime.QuadPart - startTime.QuadPart) * 1000) / (double)frq.QuadPart) + "ms" << std::endl;

	Matrix4 mat2;
	Vector3 v2;
	v2.x = 2.0f;
	v2.y = 5.0f;
	v2.z = 10.0f;
	QueryPerformanceCounter(&startTime);
	for (int i = 0; i < 10000; ++i)
	{
		for (int j = 0; j < 4; ++j)
		{
			result = normMul(mat2, v2);
		}
	}
	QueryPerformanceCounter(&endTime);
	std::cout << "NORM TIME: " + std::to_string((double)((endTime.QuadPart - startTime.QuadPart) * 1000) / (double)frq.QuadPart) + "ms" << std::endl;

	std::cout << "Complete" << std::endl;
}

 

 

Edited by noodleBowl

Share this post


Link to post
Share on other sites
1 hour ago, noodleBowl said:

The weird thing is the SIMD method runs a little slower then the normal math operations. I really would have thought it would have been the other way around.

Using SIMD correctly is more than just using intrinsics... I didn't check your code but I suspect you're not doing things correctly.  Trust me when I say just use DirectXMath vertex transformation is one of the things it was basically made to do.  Make a separate project just as a test of using DirectXMath if you don't believe me.  It's been a long time since I messed with SIMD but IIRC transform 4 vertices at once, use a SOA (structure of array) layout on your input vertex data. (you should check what I just said, its been a really long time) 

edit - BTW the main reason I say just use directxmath is that now there is SSE, SSE2, SSE3, SSE4, SSE 4.1, AVX, AVX2, and AVX-512... directxmath supports all or most of them.  When you're just trying to get up and running its a blessing.  You could always learn SIMD programming as a side project, but you don't want to get sidetracked.  Anyway here is a link to an old article on transforming a 3d vector by a 4x4 matrix it will give you an idea of whats involved.  http://www.hugi.scene.org/online/hugi25/hugi 25 - coding corner optimizing cort optimizing for sse a case study.htm  Basically I think you should do a test using directxmath but thats just my opinion.

Edited by Infinisearch

Share this post


Link to post
Share on other sites

Alright, so I had some time to investigate the DirectX Math lib and run some tests. And I got some questions

So here are my results for my tests

//============ DEBUG MODE Times ==========
//Using normal math operations
Norm TIME: 4.275861ms

//Using DirectX Math where items were loaded from XMFloat3/XMMatrix4x4 and then stored to XMFloat3
DirectX Math XMFLOAT TIME: 4.965582ms

//Using DirectX Math where XMVector/XMMatrix were used directly
DirectX Math RAW SIMD TIME: 2.183706ms

//Using custom solution where __m128 was directly used
New RAW SIMD Solution TIME: 1.502607ms

//Original attempt, used loaded data from a Vector3/Matrix44 and stored the result back into a Vector3
Original SIMD Solution TIME: 5.034964ms

Code used in case anyone is interested

Spoiler

#include <iostream>
#include <pmmintrin.h>
#include <Windows.h>
#include <string>
#include <DirectXMath.h>

class Vector3 
{

public:
	Vector3()
	{
		x = 0.0f;
		y = 0.0f;
		z = 0.0f;
	}

	~Vector3()
	{
	}

	float x;
	float y;
	float z;
};

class Matrix4
{

public:
	Matrix4()
	{
		data[0] = 1.0f;
		data[1] = 0.0f;
		data[2] = 0.0f;
		data[3] = 0.0f;

		data[4] = 0.0f;
		data[5] = 1.0f;
		data[6] = 0.0f;
		data[7] = 0.0f;

		data[8] = 0.0f;
		data[9] = 0.0f;
		data[10] = 1.0f;
		data[11] = 0.0f;

		data[12] = 0.0f;
		data[13] = 0.0f;
		data[14] = 0.0f;
		data[15] = 1.0f;
	}
	~Matrix4() {}

	float data[16];

	void set(float* b)
	{
		data[0] = b[0];
		data[1] = b[1];
		data[2] = b[2];
		data[3] = b[3];

		data[4] = b[4];
		data[5] = b[5];
		data[6] = b[6];
		data[7] = b[7];

		data[8] = b[8];
		data[9] = b[9];
		data[10] = b[10];
		data[11] = b[11];

		data[12] = b[12];
		data[13] = b[13];
		data[14] = b[14];
		data[15] = b[15];
	}

};

class SIMDVector3
{

public:
	SIMDVector3()
	{
		data = _mm_setzero_ps();
	}

	SIMDVector3(__m128 data)
	{
		this->data = data;
	}

	SIMDVector3(float x, float y, float z)
	{
		data = _mm_set_ps(1.0f, z, y, x);
	}

	~SIMDVector3()
	{
	}

	__m128 data;
};

class SIMDMatrix4
{

public:
	SIMDMatrix4()
	{
		data[0] = _mm_set_ps(1.0f, 0.0f, 0.0f, 0.0f);
		data[1] = _mm_set_ps(0.0f, 1.0f, 0.0f, 0.0f);
		data[2] = _mm_set_ps(0.0f, 0.0f, 1.0f, 0.0f);
		data[3] = _mm_set_ps(0.0f, 0.0f, 0.0f, 1.0f);
	}

	SIMDMatrix4(float* b)
	{
		data[0] = _mm_set_ps(b[3], b[2], b[1], b[0]);
		data[1] = _mm_set_ps(b[7], b[6], b[5], b[4]);
		data[2] = _mm_set_ps(b[11], b[10], b[9], b[8]);
		data[3] = _mm_set_ps(b[15], b[14], b[13], b[12]);
	}

	~SIMDMatrix4()
	{
	}

	__m128 data[4];
};


Vector3 normMul(const Matrix4 &m, const Vector3 &b)
{

	Vector3 r;
	r.x = m.data[0] * b.x + m.data[4] * b.y + m.data[8]  * b.z + m.data[12] * 1.0f;
	r.y = m.data[1] * b.x + m.data[5] * b.y + m.data[9]  * b.z + m.data[13] * 1.0f;
	r.z = m.data[2] * b.x + m.data[6] * b.y + m.data[10] * b.z + m.data[14] * 1.0f;

	return r;
}

Vector3 origSIMDMul(const Matrix4 &m, const Vector3 &b)
{

	//Setup
	Vector3 r;
	__m128 m1 = _mm_set_ps(m.data[12], m.data[8], m.data[4], m.data[0]);
	__m128 m2 = _mm_set_ps(m.data[13], m.data[9], m.data[5], m.data[1]);
	__m128 m3 = _mm_set_ps(m.data[14], m.data[10], m.data[6], m.data[2]);
	__m128 vec = _mm_set_ps(1.0f, b.z, b.y, b.x);

	//Multiple the vecs with the columns; Matrices are column order
	m1 = _mm_mul_ps(m1, vec);
	m2 = _mm_mul_ps(m2, vec);
	m3 = _mm_mul_ps(m3, vec);

	//Get result x
	m1 = _mm_hadd_ps(m1, m1);
	r.x = _mm_cvtss_f32(_mm_hadd_ps(m1, m1));

	//Get result y
	m2 = _mm_hadd_ps(m2, m2);
	r.y = _mm_cvtss_f32(_mm_hadd_ps(m2, m2));

	//Get result z
	m3 = _mm_hadd_ps(m3, m3);
	r.z = _mm_cvtss_f32(_mm_hadd_ps(m3, m3));

	return r;
}

void simdMul(const SIMDMatrix4 &m, const SIMDVector3 &b, SIMDVector3 &r)
{

	__m128 x = _mm_mul_ps(m.data[0], _mm_shuffle_ps(b.data, b.data, _MM_SHUFFLE(0, 0, 0, 0)));
	__m128 y = _mm_mul_ps(m.data[1], _mm_shuffle_ps(b.data, b.data, _MM_SHUFFLE(1, 1, 1, 1)));
	__m128 z = _mm_mul_ps(m.data[2], _mm_shuffle_ps(b.data, b.data, _MM_SHUFFLE(2, 2, 2, 2)));
	r.data = _mm_add_ps(x, _mm_add_ps(y, _mm_add_ps(z, m.data[3])));

}

int main()
{
	
	LARGE_INTEGER startTime;
	LARGE_INTEGER endTime;
	LARGE_INTEGER frq;
	QueryPerformanceFrequency(&frq);

	DirectX::XMFLOAT3 xmResult;
	DirectX::XMFLOAT3 xmVec3(2.0f, 5.0f, 10.0f);
	DirectX::XMFLOAT4X4 xmMat44(1.0f, 0.0f, 0.0f, 0.0f,
							  0.0f, 1.0f, 0.0f, 0.0f, 
							  0.0f, 0.0f, 1.0f, 0.0f, 
							  0.0f, 0.0f, 0.0f, 1.0f);
	DirectX::XMVECTOR rawVec;
	DirectX::XMMATRIX rawMat;
	QueryPerformanceCounter(&startTime);
	for (int i = 0; i < 10000; ++i)
	{
		rawMat = DirectX::XMLoadFloat4x4(&xmMat44);
		for (int j = 0; j < 4; ++j)
		{
			rawVec = DirectX::XMLoadFloat3(&xmVec3);
			DirectX::XMStoreFloat3(&xmResult, DirectX::XMVector3Transform(rawVec, rawMat));
		}
	}
	QueryPerformanceCounter(&endTime);
	std::cout << "DirectX Math XMFLOAT TIME: " + std::to_string((double)((endTime.QuadPart - startTime.QuadPart) * 1000) / (double)frq.QuadPart) + "ms" << std::endl;


	DirectX::XMVECTOR xmSimdResult;
	DirectX::XMVECTOR xmSimdVec = DirectX::XMVectorSet(2.0f, 5.0f, 10.0f, 1.0f);
	DirectX::XMMATRIX xmSimdMat = DirectX::XMMatrixIdentity();
	QueryPerformanceCounter(&startTime);
	for (int i = 0; i < 10000; ++i)
	{
		for (int j = 0; j < 4; ++j)
		{
			xmSimdResult = DirectX::XMVector3Transform(xmSimdVec, xmSimdMat);
		}
	}
	QueryPerformanceCounter(&endTime);
	std::cout << "DirectX Math RAW SIMD TIME: " + std::to_string((double)((endTime.QuadPart - startTime.QuadPart) * 1000) / (double)frq.QuadPart) + "ms" << std::endl;
	
	SIMDVector3 smRes;
	float data[16];
	data[0] = 1.0f;
	data[1] = 5.0f;
	data[2] = 9.0f;
	data[3] = 13.0f;

	data[4] = 2.0f;
	data[5] = 6.0f;
	data[6] = 10.0f;
	data[7] = 14.0f;

	data[8] = 3.0f;
	data[9] = 7.0f;
	data[10] = 11.0f;
	data[11] = 15.0f;

	data[12] = 4.0f;
	data[13] = 8.0f;
	data[14] = 12.0f;
	data[15] = 16.0f;
	SIMDMatrix4 smMat(data);
	SIMDVector3 smVec(2.0f, 5.0f, 10.0f);
	QueryPerformanceCounter(&startTime);
	for (int i = 0; i < 10000; ++i)
	{
		for (int j = 0; j < 4; ++j)
		{
			simdMul(smMat, smVec, smRes);
		}
	}
	QueryPerformanceCounter(&endTime);
	std::cout << "New RAW SIMD Solution TIME: " + std::to_string((double)((endTime.QuadPart - startTime.QuadPart) * 1000) / (double)frq.QuadPart) + "ms" << std::endl;

	Vector3 vecRes;
	Matrix4 mat1;
	Vector3 v1;
	v1.x = 2.0f;
	v1.y = 5.0f;
	v1.z = 10.0f;
	QueryPerformanceCounter(&startTime);
	for (int i = 0; i < 10000; ++i)
	{
		for (int j = 0; j < 4; ++j)
		{
			vecRes = origSIMDMul(mat1, v1);
		}
	}
	QueryPerformanceCounter(&endTime);
	std::cout << "Original SIMD Solution TIME: " + std::to_string((double)((endTime.QuadPart - startTime.QuadPart) * 1000) / (double)frq.QuadPart) + "ms" << std::endl;

	Matrix4 mat2;
	Vector3 v2;
	v2.x = 2.0f;
	v2.y = 5.0f;
	v2.z = 10.0f;
	QueryPerformanceCounter(&startTime);
	for (int i = 0; i < 10000; ++i)
	{
		for (int j = 0; j < 4; ++j)
		{
			vecRes = normMul(mat2, v2);
		}
	}
	QueryPerformanceCounter(&endTime);
	std::cout << "Norm TIME: " + std::to_string((double)((endTime.QuadPart - startTime.QuadPart) * 1000) / (double)frq.QuadPart) + "ms" << std::endl;

	std::cout << "Complete" << std::endl;
}

 

 

On 12/11/2017 at 6:25 AM, Hodgman said:

This function spends more time converting between non-simd arranged data to simd-arranged data, and back again, than it does actually doing any calculations.

Looking at the test times @Hodgman is 100% right. Loading data into the SIMD registries and then getting it back out completely out weighs the benefit of the SIMD fast calculations. This can also be seen in the DirectX Math test where I use the XMFloat3 / XMMatrix4x4 types as these need to be loaded/stored. I have a questions about this later down the line

 

On 12/11/2017 at 3:36 AM, Infinisearch said:

Using SIMD correctly is more than just using intrinsics

SIMD operations are insanely fast. When running in release mode the timing on the RAW SIMD tests can't even register (0ms). I can bump the loop up to simulate over 100 million vector transformations against a matrix and it still comes as 0ms on the timer. So you can really do some serious work if you directly use the SIMD __m128 type and do not load/unload things often

 

Now this brings me back to my questions about DirectX Math and how to use the lib. According to the MSDN DirectXMath guide they say the XMVECTOR and XMMATRIX types are the work horses for the DirectXMath Library. Which makes total sense, but then they go to say 

Quote

Allocations from the heap, however, are more complicated. As such, you need to be careful whenever you use either XMVECTOR or XMMATRIX as a member of a class or structure to be allocated from the heap. On Windows x64, all heap allocations are 16-byte aligned, but for Windows x86, they are only 8-byte aligned. There are options for allocating structures from the heap with 16-byte alignment (see Properly Align Allocations). For C++ programs, you can use operator new/delete/new[]/delete[] overloads (either globally or class-specific) to enforce optimal alignment if desired.

Which I understand, but I guess I'm not really sure what is expected in the overloaded new/delete/new[]/delete[]. I just know that doing:

class Sprite
{
	Sprite(){}
	~Sprite(){}
	XMVECTOR position;
	XMVECTOR texCoords;
	XMVECTOR color;
};

Sprite* mySprite = new Sprite;

Is going to mess up the alignment and make SIMD operations take a performance hit


Then they go on to say

Quote

However, often it is easier and more compact to avoid using XMVECTOR or XMMATRIX directly in a class or structure. Instead, make use of the XMFLOAT3, XMFLOAT4, XMFLOAT4X3, XMFLOAT4X4, and so on, as members of your structure. Further, you can use the Vector Loading and Vector Storage functions to move the data efficiently into XMVECTOR or XMMATRIX local variables, perform computations, and store the results. There are also streaming functions (XMVector3TransformStream, XMVector4TransformStream, and so on) that efficiently operate directly on arrays of these data types

And that's where I get thrown off

Am I normally supposed to be using the XMFLOAT[n] / XMMatrix[n]x[m] types?
Based on the above statement it sounds like I should, but that does not make sense to me if I want to take advantage of SIMD operations. As having to load/unload data causes a major performance hit making the timings often worse then using normal math operations

 

Also I noticed during my tests and this maybe my fault, but it seems like I have to transpose the matrix before multiplying it by the vector to get the correct vector result when using DirectXMath. Is this normal?

//Multiplying matrix by vec should get me the result vector of 46, 118, 190, 262
//But this only happens if I transpose the matrix first
//If I DO NOT transpose the matrix first I get the result vector of 130, 148, 166, 184 which is wrong?
DirectX::XMVECTOR vec = DirectX::XMVectorSet(2.0f, 5.0f, 10.0f, 1.0f);
DirectX::XMMATRIX mat = {
  1.0f, 2.0f, 3.0f, 4.0f,
  5.0f, 6.0f, 7.0f, 8.0f,
  9.0f, 10.0f, 11.0f, 12.0f,
  13.0f, 14.0f, 15.0f, 16.0f,
};
mat = DirectX::XMMatrixTranspose(mat);
DirectX::XMVECTOR r = DirectX::XMVector3Transform(vec, mat);


 

 

Edited by noodleBowl

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this  

  • Forum Statistics

    • Total Topics
      628728
    • Total Posts
      2984417
  • Similar Content

    • By Josheir
      void update() { if (thrust) { dx += cos(angle*DEGTORAD)*.02; dy += sin(angle*DEGTORAD)*.02; } else { dx*=0.99; dy*=0.99; } int maxSpeed = 15; float speed = sqrt(dx*dx+dy*dy); if (speed>maxSpeed) { dx *= maxSpeed/speed; dy *= maxSpeed/speed; } x+=dx; y+=dy; . . . } In the above code, why is maxSpeed being divided by the speed variable.  I'm stumped.
       
      Thank you,
      Josheir
    • By Benjamin Shefte
      Hey there,  I have this old code im trying to compile using GCC and am running into a few issues..
      im trying to figure out how to convert these functions to gcc
      static __int64 MyQueryPerformanceFrequency() { static __int64 aFreq = 0; if(aFreq!=0) return aFreq; LARGE_INTEGER s1, e1, f1; __int64 s2, e2, f2; QueryPerformanceCounter(&s1); s2 = MyQueryPerformanceCounter(); Sleep(50); e2 = MyQueryPerformanceCounter(); QueryPerformanceCounter(&e1); QueryPerformanceFrequency(&f1); double aTime = (double)(e1.QuadPart - s1.QuadPart)/f1.QuadPart; f2 = (e2 - s2)/aTime; aFreq = f2; return aFreq; } void PerfTimer::GlobalStart(const char *theName) { gPerfTimerStarted = true; gPerfTotalTime = 0; gPerfTimerStartCount = 0; gPerfElapsedTime = 0; LARGE_INTEGER anInt; QueryPerformanceCounter(&anInt); gPerfResetTick = anInt.QuadPart; } /////////////////////////////////////////////////////////////////////////////// /////////////////////////////////////////////////////////////////////////////// void PerfTimer::GlobalStop(const char *theName) { LARGE_INTEGER anInt; QueryPerformanceCounter(&anInt); LARGE_INTEGER aFreq; QueryPerformanceFrequency(&aFreq); gPerfElapsedTime = (double)(anInt.QuadPart - gPerfResetTick)/aFreq.QuadPart*1000.0; gPerfTimerStarted = false; }  
      I also tried converting this function (original function is the first function below and my converted for gcc function is under that) is this correct?:
      #if defined(WIN32) static __int64 MyQueryPerformanceCounter() { // LARGE_INTEGER anInt; // QueryPerformanceCounter(&anInt); // return anInt.QuadPart; #if defined(WIN32) unsigned long x,y; _asm { rdtsc mov x, eax mov y, edx } __int64 result = y; result<<=32; result|=x; return result; } #else static __int64 MyQueryPerformanceCounter() { struct timeval t1, t2; double elapsedTime; // start timer gettimeofday(&t1, NULL); Sleep(50); // stop timer gettimeofday(&t2, NULL); // compute and print the elapsed time in millisec elapsedTime = (t2.tv_sec - t1.tv_sec) * 1000.0; // sec to ms elapsedTime += (t2.tv_usec - t1.tv_usec) / 1000.0; // us to ms return elapsedTime; } #endif Any help would be appreciated, Thank you!
    • By GreenGodDiary
      Having some issues with a geometry shader in a very basic DX app.
      We have an assignment where we are supposed to render a rotating textured quad, and in the geometry shader duplicate this quad and offset it by its normal. Very basic stuff essentially.
      My issue is that the duplicated quad, when rendered in front of the original quad, seems to fail the Z test and thus the original quad is rendered on top of it.
      Whats even weirder is that this only happens for one of the triangles in the duplicated quad, against one of the original quads triangles.

      Here's a video to show you what happens: Video (ignore the stretched textures)

      Here's my GS: (VS is simple passthrough shader and PS is just as basic)
      struct VS_OUT { float4 Pos : SV_POSITION; float2 UV : TEXCOORD; }; struct VS_IN { float4 Pos : POSITION; float2 UV : TEXCOORD; }; cbuffer cbPerObject : register(b0) { float4x4 WVP; }; [maxvertexcount(6)] void main( triangle VS_IN input[3], inout TriangleStream< VS_OUT > output ) { //Calculate normal float4 faceEdgeA = input[1].Pos - input[0].Pos; float4 faceEdgeB = input[2].Pos - input[0].Pos; float3 faceNormal = normalize(cross(faceEdgeA.xyz, faceEdgeB.xyz)); //Input triangle, transformed for (uint i = 0; i < 3; i++) { VS_OUT element; VS_IN vert = input[i]; element.Pos = mul(vert.Pos, WVP); element.UV = vert.UV; output.Append(element); } output.RestartStrip(); for (uint j = 0; j < 3; j++) { VS_OUT element; VS_IN vert = input[j]; element.Pos = mul(vert.Pos + float4(faceNormal, 0.0f), WVP); element.Pos.xyz; element.UV = vert.UV; output.Append(element); } }  
      I havent used geometry shaders much so im not 100% on what happens behind the scenes.
      Any tips appreciated! 
    • By mister345
      Hi, I'm building a game engine using DirectX11 in c++.
      I need a basic physics engine to handle collisions and motion, and no time to write my own.
      What is the easiest solution for this? Bullet and PhysX both seem too complicated and would still require writing my own wrapper classes, it seems. 
      I found this thing called PAL - physics abstraction layer that can support bullet, physx, etc, but it's so old and no info on how to download or install it.
      The simpler the better. Please let me know, thanks!
  • Popular Now