Trying to finding bottlenecks in my renderer

Started by
37 comments, last by Matias Goldberg 6 years, 3 months ago
1 hour ago, noodleBowl said:

The weird thing is the SIMD method runs a little slower then the normal math operations. I really would have thought it would have been the other way around.

Using SIMD correctly is more than just using intrinsics... I didn't check your code but I suspect you're not doing things correctly.  Trust me when I say just use DirectXMath vertex transformation is one of the things it was basically made to do.  Make a separate project just as a test of using DirectXMath if you don't believe me.  It's been a long time since I messed with SIMD but IIRC transform 4 vertices at once, use a SOA (structure of array) layout on your input vertex data. (you should check what I just said, its been a really long time) 

edit - BTW the main reason I say just use directxmath is that now there is SSE, SSE2, SSE3, SSE4, SSE 4.1, AVX, AVX2, and AVX-512... directxmath supports all or most of them.  When you're just trying to get up and running its a blessing.  You could always learn SIMD programming as a side project, but you don't want to get sidetracked.  Anyway here is a link to an old article on transforming a 3d vector by a 4x4 matrix it will give you an idea of whats involved.  http://www.hugi.scene.org/online/hugi25/hugi 25 - coding corner optimizing cort optimizing for sse a case study.htm  Basically I think you should do a test using directxmath but thats just my opinion.

-potential energy is easily made kinetic-

Advertisement
3 hours ago, noodleBowl said:

Vector3 simdMul(const Matrix4& m, const Vector3 &b) {

This function spends more time converting between non-simd arranged data to simd-arranged data, and back again, than it does actually doing any calculations.

Alright, so I had some time to investigate the DirectX Math lib and run some tests. And I got some questions

So here are my results for my tests


//============ DEBUG MODE Times ==========
//Using normal math operations
Norm TIME: 4.275861ms

//Using DirectX Math where items were loaded from XMFloat3/XMMatrix4x4 and then stored to XMFloat3
DirectX Math XMFLOAT TIME: 4.965582ms

//Using DirectX Math where XMVector/XMMatrix were used directly
DirectX Math RAW SIMD TIME: 2.183706ms

//Using custom solution where __m128 was directly used
New RAW SIMD Solution TIME: 1.502607ms

//Original attempt, used loaded data from a Vector3/Matrix44 and stored the result back into a Vector3
Original SIMD Solution TIME: 5.034964ms

Code used in case anyone is interested

Spoiler


#include <iostream>
#include <pmmintrin.h>
#include <Windows.h>
#include <string>
#include <DirectXMath.h>

class Vector3 
{

public:
	Vector3()
	{
		x = 0.0f;
		y = 0.0f;
		z = 0.0f;
	}

	~Vector3()
	{
	}

	float x;
	float y;
	float z;
};

class Matrix4
{

public:
	Matrix4()
	{
		data[0] = 1.0f;
		data[1] = 0.0f;
		data[2] = 0.0f;
		data[3] = 0.0f;

		data[4] = 0.0f;
		data[5] = 1.0f;
		data[6] = 0.0f;
		data[7] = 0.0f;

		data[8] = 0.0f;
		data[9] = 0.0f;
		data[10] = 1.0f;
		data[11] = 0.0f;

		data[12] = 0.0f;
		data[13] = 0.0f;
		data[14] = 0.0f;
		data[15] = 1.0f;
	}
	~Matrix4() {}

	float data[16];

	void set(float* b)
	{
		data[0] = b[0];
		data[1] = b[1];
		data[2] = b[2];
		data[3] = b[3];

		data[4] = b[4];
		data[5] = b[5];
		data[6] = b[6];
		data[7] = b[7];

		data[8] = b[8];
		data[9] = b[9];
		data[10] = b[10];
		data[11] = b[11];

		data[12] = b[12];
		data[13] = b[13];
		data[14] = b[14];
		data[15] = b[15];
	}

};

class SIMDVector3
{

public:
	SIMDVector3()
	{
		data = _mm_setzero_ps();
	}

	SIMDVector3(__m128 data)
	{
		this->data = data;
	}

	SIMDVector3(float x, float y, float z)
	{
		data = _mm_set_ps(1.0f, z, y, x);
	}

	~SIMDVector3()
	{
	}

	__m128 data;
};

class SIMDMatrix4
{

public:
	SIMDMatrix4()
	{
		data[0] = _mm_set_ps(1.0f, 0.0f, 0.0f, 0.0f);
		data[1] = _mm_set_ps(0.0f, 1.0f, 0.0f, 0.0f);
		data[2] = _mm_set_ps(0.0f, 0.0f, 1.0f, 0.0f);
		data[3] = _mm_set_ps(0.0f, 0.0f, 0.0f, 1.0f);
	}

	SIMDMatrix4(float* b)
	{
		data[0] = _mm_set_ps(b[3], b[2], b[1], b[0]);
		data[1] = _mm_set_ps(b[7], b[6], b[5], b[4]);
		data[2] = _mm_set_ps(b[11], b[10], b[9], b[8]);
		data[3] = _mm_set_ps(b[15], b[14], b[13], b[12]);
	}

	~SIMDMatrix4()
	{
	}

	__m128 data[4];
};


Vector3 normMul(const Matrix4 &m, const Vector3 &b)
{

	Vector3 r;
	r.x = m.data[0] * b.x + m.data[4] * b.y + m.data[8]  * b.z + m.data[12] * 1.0f;
	r.y = m.data[1] * b.x + m.data[5] * b.y + m.data[9]  * b.z + m.data[13] * 1.0f;
	r.z = m.data[2] * b.x + m.data[6] * b.y + m.data[10] * b.z + m.data[14] * 1.0f;

	return r;
}

Vector3 origSIMDMul(const Matrix4 &m, const Vector3 &b)
{

	//Setup
	Vector3 r;
	__m128 m1 = _mm_set_ps(m.data[12], m.data[8], m.data[4], m.data[0]);
	__m128 m2 = _mm_set_ps(m.data[13], m.data[9], m.data[5], m.data[1]);
	__m128 m3 = _mm_set_ps(m.data[14], m.data[10], m.data[6], m.data[2]);
	__m128 vec = _mm_set_ps(1.0f, b.z, b.y, b.x);

	//Multiple the vecs with the columns; Matrices are column order
	m1 = _mm_mul_ps(m1, vec);
	m2 = _mm_mul_ps(m2, vec);
	m3 = _mm_mul_ps(m3, vec);

	//Get result x
	m1 = _mm_hadd_ps(m1, m1);
	r.x = _mm_cvtss_f32(_mm_hadd_ps(m1, m1));

	//Get result y
	m2 = _mm_hadd_ps(m2, m2);
	r.y = _mm_cvtss_f32(_mm_hadd_ps(m2, m2));

	//Get result z
	m3 = _mm_hadd_ps(m3, m3);
	r.z = _mm_cvtss_f32(_mm_hadd_ps(m3, m3));

	return r;
}

void simdMul(const SIMDMatrix4 &m, const SIMDVector3 &b, SIMDVector3 &r)
{

	__m128 x = _mm_mul_ps(m.data[0], _mm_shuffle_ps(b.data, b.data, _MM_SHUFFLE(0, 0, 0, 0)));
	__m128 y = _mm_mul_ps(m.data[1], _mm_shuffle_ps(b.data, b.data, _MM_SHUFFLE(1, 1, 1, 1)));
	__m128 z = _mm_mul_ps(m.data[2], _mm_shuffle_ps(b.data, b.data, _MM_SHUFFLE(2, 2, 2, 2)));
	r.data = _mm_add_ps(x, _mm_add_ps(y, _mm_add_ps(z, m.data[3])));

}

int main()
{
	
	LARGE_INTEGER startTime;
	LARGE_INTEGER endTime;
	LARGE_INTEGER frq;
	QueryPerformanceFrequency(&frq);

	DirectX::XMFLOAT3 xmResult;
	DirectX::XMFLOAT3 xmVec3(2.0f, 5.0f, 10.0f);
	DirectX::XMFLOAT4X4 xmMat44(1.0f, 0.0f, 0.0f, 0.0f,
							  0.0f, 1.0f, 0.0f, 0.0f, 
							  0.0f, 0.0f, 1.0f, 0.0f, 
							  0.0f, 0.0f, 0.0f, 1.0f);
	DirectX::XMVECTOR rawVec;
	DirectX::XMMATRIX rawMat;
	QueryPerformanceCounter(&startTime);
	for (int i = 0; i < 10000; ++i)
	{
		rawMat = DirectX::XMLoadFloat4x4(&xmMat44);
		for (int j = 0; j < 4; ++j)
		{
			rawVec = DirectX::XMLoadFloat3(&xmVec3);
			DirectX::XMStoreFloat3(&xmResult, DirectX::XMVector3Transform(rawVec, rawMat));
		}
	}
	QueryPerformanceCounter(&endTime);
	std::cout << "DirectX Math XMFLOAT TIME: " + std::to_string((double)((endTime.QuadPart - startTime.QuadPart) * 1000) / (double)frq.QuadPart) + "ms" << std::endl;


	DirectX::XMVECTOR xmSimdResult;
	DirectX::XMVECTOR xmSimdVec = DirectX::XMVectorSet(2.0f, 5.0f, 10.0f, 1.0f);
	DirectX::XMMATRIX xmSimdMat = DirectX::XMMatrixIdentity();
	QueryPerformanceCounter(&startTime);
	for (int i = 0; i < 10000; ++i)
	{
		for (int j = 0; j < 4; ++j)
		{
			xmSimdResult = DirectX::XMVector3Transform(xmSimdVec, xmSimdMat);
		}
	}
	QueryPerformanceCounter(&endTime);
	std::cout << "DirectX Math RAW SIMD TIME: " + std::to_string((double)((endTime.QuadPart - startTime.QuadPart) * 1000) / (double)frq.QuadPart) + "ms" << std::endl;
	
	SIMDVector3 smRes;
	float data[16];
	data[0] = 1.0f;
	data[1] = 5.0f;
	data[2] = 9.0f;
	data[3] = 13.0f;

	data[4] = 2.0f;
	data[5] = 6.0f;
	data[6] = 10.0f;
	data[7] = 14.0f;

	data[8] = 3.0f;
	data[9] = 7.0f;
	data[10] = 11.0f;
	data[11] = 15.0f;

	data[12] = 4.0f;
	data[13] = 8.0f;
	data[14] = 12.0f;
	data[15] = 16.0f;
	SIMDMatrix4 smMat(data);
	SIMDVector3 smVec(2.0f, 5.0f, 10.0f);
	QueryPerformanceCounter(&startTime);
	for (int i = 0; i < 10000; ++i)
	{
		for (int j = 0; j < 4; ++j)
		{
			simdMul(smMat, smVec, smRes);
		}
	}
	QueryPerformanceCounter(&endTime);
	std::cout << "New RAW SIMD Solution TIME: " + std::to_string((double)((endTime.QuadPart - startTime.QuadPart) * 1000) / (double)frq.QuadPart) + "ms" << std::endl;

	Vector3 vecRes;
	Matrix4 mat1;
	Vector3 v1;
	v1.x = 2.0f;
	v1.y = 5.0f;
	v1.z = 10.0f;
	QueryPerformanceCounter(&startTime);
	for (int i = 0; i < 10000; ++i)
	{
		for (int j = 0; j < 4; ++j)
		{
			vecRes = origSIMDMul(mat1, v1);
		}
	}
	QueryPerformanceCounter(&endTime);
	std::cout << "Original SIMD Solution TIME: " + std::to_string((double)((endTime.QuadPart - startTime.QuadPart) * 1000) / (double)frq.QuadPart) + "ms" << std::endl;

	Matrix4 mat2;
	Vector3 v2;
	v2.x = 2.0f;
	v2.y = 5.0f;
	v2.z = 10.0f;
	QueryPerformanceCounter(&startTime);
	for (int i = 0; i < 10000; ++i)
	{
		for (int j = 0; j < 4; ++j)
		{
			vecRes = normMul(mat2, v2);
		}
	}
	QueryPerformanceCounter(&endTime);
	std::cout << "Norm TIME: " + std::to_string((double)((endTime.QuadPart - startTime.QuadPart) * 1000) / (double)frq.QuadPart) + "ms" << std::endl;

	std::cout << "Complete" << std::endl;
}

 

 

On 12/11/2017 at 6:25 AM, Hodgman said:

This function spends more time converting between non-simd arranged data to simd-arranged data, and back again, than it does actually doing any calculations.

Looking at the test times @Hodgman is 100% right. Loading data into the SIMD registries and then getting it back out completely out weighs the benefit of the SIMD fast calculations. This can also be seen in the DirectX Math test where I use the XMFloat3 / XMMatrix4x4 types as these need to be loaded/stored. I have a questions about this later down the line

 

On 12/11/2017 at 3:36 AM, Infinisearch said:

Using SIMD correctly is more than just using intrinsics

SIMD operations are insanely fast. When running in release mode the timing on the RAW SIMD tests can't even register (0ms). I can bump the loop up to simulate over 100 million vector transformations against a matrix and it still comes as 0ms on the timer. So you can really do some serious work if you directly use the SIMD __m128 type and do not load/unload things often

 

Now this brings me back to my questions about DirectX Math and how to use the lib. According to the MSDN DirectXMath guide they say the XMVECTOR and XMMATRIX types are the work horses for the DirectXMath Library. Which makes total sense, but then they go to say 

Quote

Allocations from the heap, however, are more complicated. As such, you need to be careful whenever you use either XMVECTOR or XMMATRIX as a member of a class or structure to be allocated from the heap. On Windows x64, all heap allocations are 16-byte aligned, but for Windows x86, they are only 8-byte aligned. There are options for allocating structures from the heap with 16-byte alignment (see Properly Align Allocations). For C++ programs, you can use operator new/delete/new[]/delete[] overloads (either globally or class-specific) to enforce optimal alignment if desired.

Which I understand, but I guess I'm not really sure what is expected in the overloaded new/delete/new[]/delete[]. I just know that doing:


class Sprite
{
	Sprite(){}
	~Sprite(){}
	XMVECTOR position;
	XMVECTOR texCoords;
	XMVECTOR color;
};

Sprite* mySprite = new Sprite;

Is going to mess up the alignment and make SIMD operations take a performance hit


Then they go on to say

Quote

However, often it is easier and more compact to avoid using XMVECTOR or XMMATRIX directly in a class or structure. Instead, make use of the XMFLOAT3, XMFLOAT4, XMFLOAT4X3, XMFLOAT4X4, and so on, as members of your structure. Further, you can use the Vector Loading and Vector Storage functions to move the data efficiently into XMVECTOR or XMMATRIX local variables, perform computations, and store the results. There are also streaming functions (XMVector3TransformStream, XMVector4TransformStream, and so on) that efficiently operate directly on arrays of these data types

And that's where I get thrown off

Am I normally supposed to be using the XMFLOAT[n] / XMMatrix[n]x[m] types?
Based on the above statement it sounds like I should, but that does not make sense to me if I want to take advantage of SIMD operations. As having to load/unload data causes a major performance hit making the timings often worse then using normal math operations

 

Also I noticed during my tests and this maybe my fault, but it seems like I have to transpose the matrix before multiplying it by the vector to get the correct vector result when using DirectXMath. Is this normal?


//Multiplying matrix by vec should get me the result vector of 46, 118, 190, 262
//But this only happens if I transpose the matrix first
//If I DO NOT transpose the matrix first I get the result vector of 130, 148, 166, 184 which is wrong?
DirectX::XMVECTOR vec = DirectX::XMVectorSet(2.0f, 5.0f, 10.0f, 1.0f);
DirectX::XMMATRIX mat = {
  1.0f, 2.0f, 3.0f, 4.0f,
  5.0f, 6.0f, 7.0f, 8.0f,
  9.0f, 10.0f, 11.0f, 12.0f,
  13.0f, 14.0f, 15.0f, 16.0f,
};
mat = DirectX::XMMatrixTranspose(mat);
DirectX::XMVECTOR r = DirectX::XMVector3Transform(vec, mat);


 

 

On 12/13/2017 at 10:23 PM, noodleBowl said:

So here are my results for my tests

On 12/13/2017 at 10:23 PM, noodleBowl said:

SIMD operations are insanely fast. When running in release mode the timing on the RAW SIMD tests can't even register (0ms). I can bump the loop up to simulate over 100 million vector transformations against a matrix and it still comes as 0ms on the timer. So you can really do some serious work if you directly use the SIMD __m128 type and do not load/unload things often

1.Never profile in debug mode, it's pointless.  Throw out those results.  Benchmark your version vs directxmath in release mode.

2.Profile exactly what you're gonna do, which is transform either four or six vertices per matrix.

3.If you're not registering any time in milliseconds then you should try microseconds or even nanoseconds.  Alternatively you can benchmark more (which you sort of tried to do... although I don't trust your results) work.

4. _m128 isn't really used in your class, the difference in your original results most likely stem from your use of debug mode.  Data in your classes are stored in memory (instead of  no matter what, the only question is whether or not your function call is using a faster way of passing data to the function.  See this: https://msdn.microsoft.com/en-us/library/ayeb3ayc.aspx

On 12/13/2017 at 10:23 PM, noodleBowl said:

Also I noticed during my tests and this maybe my fault, but it seems like I have to transpose the matrix before multiplying it by the vector to get the correct vector result when using DirectXMath. Is this normal?

IIRC this is correct, and should be the same for your code as well depending on if you're using row major or column major matrix's (or something like that).  If you're not transposing I think you most likely have multiple mistakes that cancel each other out.

On 12/13/2017 at 10:23 PM, noodleBowl said:

Is going to mess up the alignment and make SIMD operations take a performance hit

No because position is the first member and you don't have any virtual methods.  This might be a good case for using SOA, but I can't say from experience since for a good while now I do all my transforms on the GPU. (3d instead of 2d)

 

Anyway sorry for the late reply, I was gonna reply but then I got busy and after that my computer broke and I had to fix it.

-potential energy is easily made kinetic-

On 12/19/2017 at 10:18 AM, Infinisearch said:

1.Never profile in debug mode, it's pointless.  Throw out those results.  Benchmark your version vs directxmath in release mode.

2.Profile exactly what you're gonna do, which is transform either four or six vertices per matrix.

I am definitely simulating what my renderer will do, but why should I not profile in debug mode? I understand that in release mode the compiler will apply optimizations (speaking about the default ones applied) and timings in debug mode are inflated because of various debug checks. But I figure that if I get a low time in debug mode then my release mode timing will definitely be better and overall the application runs better. Might be flawed logic

On 12/19/2017 at 10:18 AM, Infinisearch said:

3.If you're not registering any time in milliseconds then you should try microseconds or even nanoseconds.  Alternatively you can benchmark more (which you sort of tried to do... although I don't trust your results) work.

Currently I'm using the Query Performance Counter to do my timing, but it looks like that my start time and end time in release mode come out to be the same so thus I get 0ms. This is even if I try to use INT_MAX as the limit on my for-loop. I am not sure if this means I am doing something wrong or if it is really just that fast. How can I get microsecond / nanosecond precision?

Code I'm using

Spoiler

 



//-----------[ classes / methods ]----------------
class SIMDVector3
{

public:
	SIMDVector3()
	{
		data = _mm_setzero_ps();
	}

	SIMDVector3(__m128 data)
	{
		this->data = data;
	}

	SIMDVector3(float x, float y, float z)
	{
		data = _mm_set_ps(1.0f, z, y, x);
	}

	~SIMDVector3()
	{
	}

	__m128 data;
};

class SIMDMatrix4
{

public:
	SIMDMatrix4()
	{
		data[0] = _mm_set_ps(1.0f, 0.0f, 0.0f, 0.0f);
		data[1] = _mm_set_ps(0.0f, 1.0f, 0.0f, 0.0f);
		data[2] = _mm_set_ps(0.0f, 0.0f, 1.0f, 0.0f);
		data[3] = _mm_set_ps(0.0f, 0.0f, 0.0f, 1.0f);
	}

	SIMDMatrix4(float* b)
	{
		data[0] = _mm_set_ps(b[3], b[2], b[1], b[0]);
		data[1] = _mm_set_ps(b[7], b[6], b[5], b[4]);
		data[2] = _mm_set_ps(b[11], b[10], b[9], b[8]);
		data[3] = _mm_set_ps(b[15], b[14], b[13], b[12]);
	}

	~SIMDMatrix4()
	{
	}

	__m128 data[4];
};

void simdMul(const SIMDMatrix4 &m, const SIMDVector3 &b, SIMDVector3 &r)
{

	__m128 x = _mm_mul_ps(m.data[0], _mm_shuffle_ps(b.data, b.data, _MM_SHUFFLE(0, 0, 0, 0)));
	__m128 y = _mm_mul_ps(m.data[1], _mm_shuffle_ps(b.data, b.data, _MM_SHUFFLE(1, 1, 1, 1)));
	__m128 z = _mm_mul_ps(m.data[2], _mm_shuffle_ps(b.data, b.data, _MM_SHUFFLE(2, 2, 2, 2)));
	r.data = _mm_add_ps(x, _mm_add_ps(y, _mm_add_ps(z, m.data[3])));

}

//------------[ In main method ]-----------

	LARGE_INTEGER startTime;
	LARGE_INTEGER endTime;
	LARGE_INTEGER frq;
	QueryPerformanceFrequency(&frq);	

	SIMDVector3 smRes;
	float data[16];
	data[0] = 1.0f;
	data[1] = 5.0f;
	data[2] = 9.0f;
	data[3] = 13.0f;

	data[4] = 2.0f;
	data[5] = 6.0f;
	data[6] = 10.0f;
	data[7] = 14.0f;

	data[8] = 3.0f;
	data[9] = 7.0f;
	data[10] = 11.0f;
	data[11] = 15.0f;

	data[12] = 4.0f;
	data[13] = 8.0f;
	data[14] = 12.0f;
	data[15] = 16.0f;
	SIMDMatrix4 smMat(data);
	SIMDVector3 smVec(2.0f, 5.0f, 10.0f);
	QueryPerformanceCounter(&startTime);
	for (int i = 0; i < INT_MAX; ++i)
	{
		for (int j = 0; j < 4; ++j)
		{
			simdMul(smMat, smVec, smRes);
		}
	}
	QueryPerformanceCounter(&endTime);
	std::cout << "New RAW SIMD Solution TIME: " + std::to_string((double)((endTime.QuadPart - startTime.QuadPart) * 1000000) / (double)frq.QuadPart) + "micro" << std::endl;

 

 

On 12/19/2017 at 10:18 AM, Infinisearch said:

4. _m128 isn't really used in your class, the difference in your original results most likely stem from your use of debug mode.  Data in your classes are stored in memory (instead of  no matter what, the only question is whether or not your function call is using a faster way of passing data to the function.  See this: https://msdn.microsoft.com/en-us/library/ayeb3ayc.aspx

I really don't understand what you mean here. Are you saying that I don't use __m128 (XMVECTOR) in my classes when I'm creating an application and I should use the XMFLOATn variables OR are you saying that my classes (the ones provided in my example code) do not use the __m128 (XMVECTOR) type as class members?

Anyway, I definitely getting a faster time because of the way I am doing my method. I am passing a ref to the where result should be stored, where as DirectXMath is sending back a copy. If I were to also send back a copy they come out being the same in terms of timing

On 12/19/2017 at 10:18 AM, Infinisearch said:

IIRC this is correct, and should be the same for your code as well depending on if you're using row major or column major matrix's (or something like that).  If you're not transposing I think you most likely have multiple mistakes that cancel each other out.

I am using matrices in column major order. As I understand directx traditionally has row major order and you are supposed to transpose them, because this is what is expected when it comes to shaders (they expect column major order). Now when it comes to DirectXMath I'm not sure if this is just an extension of that idea or what

It kind of makes me think that I'm doing everything wrong or that it is wrong when I need to transpose my matrices in order to get the correct result. Now, if I only had to transpose them just to pass it along to a shader then I would feel more comfortable

This is probably stupid question, but why do you multiply each vertex by matrix in cpu instead of passing it to shader and doing matrix multiplication inside the shader? I mean this:


//Sprites have matrices that are precomputed. These pretransformed vertices are placed into the buffer
Matrix4 model = sprite->getModelMatrix();
verts[0].position = model * verts[0].position;
verts[1].position = model * verts[1].position;
verts[2].position = model * verts[2].position;
verts[3].position = model * verts[3].position;
verts[4].position = model * verts[4].position;
verts[5].position = model * verts[5].position;

Or maybe thinking about using instances with the dx11, so there would be less data being sent since all the instances are just quads.

I skimmed through very long thread just to find only the last post (Mekamani's) point it out.

You're multiplying each vertex against a matrix from your CPU, no threading. Taking 40ms for doing 40k matrix multiplications per frame for a single core sounds about correct. That's your problem.

8 hours ago, Matias Goldberg said:

I skimmed through very long thread just to find only the last post (Mekamani's) point it out.

You're multiplying each vertex against a matrix from your CPU, no threading. Taking 40ms for doing 40k matrix multiplications per frame for a single core sounds about correct. That's your problem.

I definitely agree this is the problem. Its kind of been called out a few times which is why there is talk about DirectXMath and SIMD instructions. Which is awesome because I like hearing anything anyone has to offer. I am trying to learn / get better and why I like that @Mekamani brought up instancing since its a change in my approach

8 hours ago, Mekamani said:

This is probably stupid question, but why do you multiply each vertex by matrix in cpu instead of passing it to shader and doing matrix multiplication inside the shader? I mean this:



//Sprites have matrices that are precomputed. These pretransformed vertices are placed into the buffer
Matrix4 model = sprite->getModelMatrix();
verts[0].position = model * verts[0].position;
verts[1].position = model * verts[1].position;
verts[2].position = model * verts[2].position;
verts[3].position = model * verts[3].position;
verts[4].position = model * verts[4].position;
verts[5].position = model * verts[5].position;

Or maybe thinking about using instances with the dx11, so there would be less data being sent since all the instances are just quads.

Stupid question, not at all! :)

So the reason I'm doing all of this CPU side is because I didn't want to issue a draw call per sprite since each sprite has its own model matrix. This also includes the need to update the constant buffer for the sprite's model matrix, mapping the vertex buffer with the sprite's data, and etc

I figured it would be better performance wise to pretransform the vertices, batch them up, and then do a single draw call per batch which is why that code snippet above exists. But as you can see by this thread it might not be the best idea in the world or at least one that was executed poorly on my part

You do bring up a good point with the whole instancing thing. I haven't tried to using instancing before, but with the little I understand this is something I should explore. I'm just worried that because my sprites are dynamic instancing is something I can't use, but like I said before my knowledge about it is very very limited

 

18 hours ago, noodleBowl said:

Currently I'm using the Query Performance Counter to do my timing, but it looks like that my start time and end time in release mode come out to be the same so thus I get 0ms. This is even if I try to use INT_MAX as the limit on my for-loop. I am not sure if this means I am doing something wrong or if it is really just that fast. How can I get microsecond / nanosecond precision?

Code I'm using

  Reveal hidden contents

 




//-----------[ classes / methods ]----------------
class SIMDVector3
{

public:
	SIMDVector3()
	{
		data = _mm_setzero_ps();
	}

	SIMDVector3(__m128 data)
	{
		this->data = data;
	}

	SIMDVector3(float x, float y, float z)
	{
		data = _mm_set_ps(1.0f, z, y, x);
	}

	~SIMDVector3()
	{
	}

	__m128 data;
};

class SIMDMatrix4
{

public:
	SIMDMatrix4()
	{
		data[0] = _mm_set_ps(1.0f, 0.0f, 0.0f, 0.0f);
		data[1] = _mm_set_ps(0.0f, 1.0f, 0.0f, 0.0f);
		data[2] = _mm_set_ps(0.0f, 0.0f, 1.0f, 0.0f);
		data[3] = _mm_set_ps(0.0f, 0.0f, 0.0f, 1.0f);
	}

	SIMDMatrix4(float* b)
	{
		data[0] = _mm_set_ps(b[3], b[2], b[1], b[0]);
		data[1] = _mm_set_ps(b[7], b[6], b[5], b[4]);
		data[2] = _mm_set_ps(b[11], b[10], b[9], b[8]);
		data[3] = _mm_set_ps(b[15], b[14], b[13], b[12]);
	}

	~SIMDMatrix4()
	{
	}

	__m128 data[4];
};

void simdMul(const SIMDMatrix4 &m, const SIMDVector3 &b, SIMDVector3 &r)
{

	__m128 x = _mm_mul_ps(m.data[0], _mm_shuffle_ps(b.data, b.data, _MM_SHUFFLE(0, 0, 0, 0)));
	__m128 y = _mm_mul_ps(m.data[1], _mm_shuffle_ps(b.data, b.data, _MM_SHUFFLE(1, 1, 1, 1)));
	__m128 z = _mm_mul_ps(m.data[2], _mm_shuffle_ps(b.data, b.data, _MM_SHUFFLE(2, 2, 2, 2)));
	r.data = _mm_add_ps(x, _mm_add_ps(y, _mm_add_ps(z, m.data[3])));

}

//------------[ In main method ]-----------

	LARGE_INTEGER startTime;
	LARGE_INTEGER endTime;
	LARGE_INTEGER frq;
	QueryPerformanceFrequency(&frq);	

	SIMDVector3 smRes;
	float data[16];
	data[0] = 1.0f;
	data[1] = 5.0f;
	data[2] = 9.0f;
	data[3] = 13.0f;

	data[4] = 2.0f;
	data[5] = 6.0f;
	data[6] = 10.0f;
	data[7] = 14.0f;

	data[8] = 3.0f;
	data[9] = 7.0f;
	data[10] = 11.0f;
	data[11] = 15.0f;

	data[12] = 4.0f;
	data[13] = 8.0f;
	data[14] = 12.0f;
	data[15] = 16.0f;
	SIMDMatrix4 smMat(data);
	SIMDVector3 smVec(2.0f, 5.0f, 10.0f);
	QueryPerformanceCounter(&startTime);
	for (int i = 0; i < INT_MAX; ++i)
	{
		for (int j = 0; j < 4; ++j)
		{
			simdMul(smMat, smVec, smRes);
		}
	}
	QueryPerformanceCounter(&endTime);
	std::cout << "New RAW SIMD Solution TIME: " + std::to_string((double)((endTime.QuadPart - startTime.QuadPart) * 1000000) / (double)frq.QuadPart) + "micro" << std::endl;

 

 

You're never using the results of your matrix multiplication, so the entire benchmarking loop will be optimized away in release mode.

20 hours ago, noodleBowl said:

I figured it would be better performance wise to pretransform the vertices, batch them up, and then do a single draw call per batch which is why that code snippet above exists. But as you can see by this thread it might not be the best idea in the world or at least one that was executed poorly on my part

You don't have to issue one draw call per sprite/batch!!!

Create a const buffer with the matrices and index them via SV_VertexID (assuming 6 vertices per sprite):


cbuffer Matrices : register(b0)
{
	float4x4 modelMatrix[1024]; //65536 bytes max CB size per buffer / 64 bytes per matrix = 1024
	// Alternatively
	//float4x3 modelMatrix[1365]; //65536 bytes max CB size per buffer / 48 bytes per matrix = 1365.3333
};

uint idx = svVertexId / 6u;
outVertex = mul( modelMatrix[idx], inVertex );

That means you need a DrawPrimitive call every 1024 sprites (or every 1365 sprites if you use affine matrices).

You could make it just a single draw call by using a texture buffer instead (which doesn't have the 64kb limit).

This will yield much better performance. Even then, it's not ideal, because accessing a different matrix every 6 threads in a wavefront will lead to bank conflicts.

A more optimal path would be to update the vertices using a compute shader that processes all 6 vertices in the same thread, thus each thread in a wavefront will access a different bank (i.e. one thread per sprite).

20 hours ago, noodleBowl said:

I figured it would be better performance wise to pretransform the vertices, batch them up, and then do a single draw call per batch which is why that code snippet above exists. But as you can see by this thread it might not be the best idea in the world or at least one that was executed poorly on my part

You do bring up a good point with the whole instancing thing. I haven't tried to using instancing before, but with the little I understand this is something I should explore. I'm just worried that because my sprites are dynamic instancing is something I can't use, but like I said before my knowledge about it is very very limited
 

Instancing will not lead to good performance, as each sprite will very likely will be given its own wavefront unless you're lucky (on an AMD GPU, you'll be using 9.4% of processing capacity while the rest is wasted!)

See Vertex Shader Tricks by Bill Bilodeau.

This topic is closed to new replies.

Advertisement