Square Root Profiler

Started by
8 comments, last by eq 18 years, 9 months ago
While taking an Optimization course, a solution for the old Fast Square Root problem was discussed. One of the ways we were taught to optimize code, is by considering using the assembly language for our taget platform. While asm limits you to one platform, if the functions are fairly simple, and make a big impact on time, they are definitly worth considering. So I decided to test out this theory. I was amazed with the results... and they are not at all what I expected. Here is the profiler I wrote: I used Visual Studio .NET 2003 for compiling, with all of the default settings. EDIT: The assembly code is written for the intel x86 processor. Hopefully this doesn't exclude too many people from being able to compile.

//////////////////////////////////////////////////////////////////////////
// Square Root Profiler
// Author: Shane Colliatie
// Date: 7/23/05
//////////////////////////////////////////////////////////////////////////
#include <iostream>
#include <math.h>
using namespace std;

//#define UINT unsigned int
typedef unsigned int UINT;
#define ITERATIONS 10000000

//////////////////////////////////////////////////////////////////////////
// Assembly implementation of Square Root approximation
//////////////////////////////////////////////////////////////////////////
float FastSqrt (float f) {
	float RetVal;
	_asm{ mov eax, f
		sub eax, 0x3f800000
		sar eax, 1
		add eax, 0x3f800000
		mov RetVal, eax };
	return RetVal;
}

//////////////////////////////////////////////////////////////////////////
// System implementation of Square Root
//////////////////////////////////////////////////////////////////////////
float SlowSqrt (float f) {
	return sqrt(f);
}

//////////////////////////////////////////////////////////////////////////
// Quake 2 implementation of Square Root Approximation
//////////////////////////////////////////////////////////////////////////
float CarmSqrt(float x){
	union{
		int intPart;
		float floatPart;
	} convertor;
	union{
		int intPart;
		float floatPart;
	} convertor2;
	convertor.floatPart = x;
	convertor2.floatPart = x;
	convertor.intPart = 0x1FBCF800 + (convertor.intPart >> 1);
	convertor2.intPart = 0x5f3759df - (convertor2.intPart >> 1);
	return 0.5f*(convertor.floatPart + (x * convertor2.floatPart));
}

//////////////////////////////////////////////////////////////////////////
// Test function to calculate square roots
// Designed to ensure equality in all implementations
//
// Known issues:
//	Does not take initial cache misses into consideration
//////////////////////////////////////////////////////////////////////////
void testroot(float num_to_test, UINT NUM_ITERS)
{
	float answer = 0.0f;
	{
		__int64 FastStart, FastEnd, FastElapsed;
		// Get the time using assembly. VERY high precision
		_asm 
		{ 
			RDTSC
				mov DWORD PTR FastStart, eax
				mov DWORD PTR FastStart+4, edx 
		};

		UINT FastCalcs = 0;
		while( FastCalcs < NUM_ITERS)
		{
			FastSqrt(num_to_test);
			FastCalcs++;
		}
		answer = FastSqrt(num_to_test);
		_asm
		{ 
			RDTSC
				mov DWORD PTR FastEnd, eax
				mov DWORD PTR FastEnd+4, edx 
		};

		FastElapsed = FastEnd - FastStart;
		
		cout << "Fast:\t" << FastElapsed << "\t" << answer << '\n';
	}

	{
		__int64 SlowStart, SlowEnd, SlowElapsed;
		_asm 
		{ 
			RDTSC
				mov DWORD PTR SlowStart, eax
				mov DWORD PTR SlowStart+4, edx 
		};

		UINT SlowCalcs = 0;
		while( SlowCalcs < NUM_ITERS)
		{
			SlowSqrt(num_to_test);
			SlowCalcs++;
		}

		answer = SlowSqrt(num_to_test);

		_asm{ RDTSC
			mov DWORD PTR SlowEnd, eax
			mov DWORD PTR SlowEnd+4, edx 
		};

		SlowElapsed = SlowEnd - SlowStart;

		cout << "Slow:\t" << SlowElapsed << "\t" << answer << '\n';
	}

	{
		__int64 CarmStart, CarmEnd, CarmElapsed;

		_asm 
		{ 
			RDTSC
				mov DWORD PTR CarmStart, eax
				mov DWORD PTR CarmStart+4, edx 
		};

		UINT CarmCalcs = 0;
		while( CarmCalcs < NUM_ITERS)
		{
			CarmSqrt(num_to_test);
			CarmCalcs++;
		}

		answer = CarmSqrt(num_to_test);

		_asm
		{ 
			RDTSC
				mov DWORD PTR CarmEnd, eax
				mov DWORD PTR CarmEnd+4, edx 
		};

		CarmElapsed = CarmEnd - CarmStart;

		cout << "Carm:\t" << CarmElapsed << "\t" << answer << "\n\n";
	}
}

//////////////////////////////////////////////////////////////////////////
// Test function to calculate random square roots
// Designed to ensure equality in all implementations
//
// Known issues:
//	int to float typecast conversion has a small impact on clock cycles
//////////////////////////////////////////////////////////////////////////
void testrootrand(UINT NUM_ITERS)
{
	float answer = 0.0f;
	{
		__int64 FastStart, FastEnd, FastElapsed;
		_asm 
		{ 
			RDTSC
				mov DWORD PTR FastStart, eax
				mov DWORD PTR FastStart+4, edx 
		};

		UINT FastCalcs = 0;
		while( FastCalcs < NUM_ITERS)
		{
			FastSqrt((float)rand());
			FastCalcs++;
		}
		_asm
		{ 
			RDTSC
				mov DWORD PTR FastEnd, eax
				mov DWORD PTR FastEnd+4, edx 
		};

		FastElapsed = FastEnd - FastStart;

		cout << "Fast:\t" << FastElapsed << '\n';
	}

	{
		__int64 SlowStart, SlowEnd, SlowElapsed;
		_asm 
		{ 
			RDTSC
				mov DWORD PTR SlowStart, eax
				mov DWORD PTR SlowStart+4, edx 
		};

		UINT SlowCalcs = 0;
		while( SlowCalcs < NUM_ITERS)
		{
			SlowSqrt((float)rand());
			SlowCalcs++;
		}

		_asm{ RDTSC
			mov DWORD PTR SlowEnd, eax
			mov DWORD PTR SlowEnd+4, edx 
		};

		SlowElapsed = SlowEnd - SlowStart;

		cout << "Slow:\t" << SlowElapsed << '\n';
	}

	{
		__int64 CarmStart, CarmEnd, CarmElapsed;

		_asm 
		{ 
			RDTSC
				mov DWORD PTR CarmStart, eax
				mov DWORD PTR CarmStart+4, edx 
		};

		UINT CarmCalcs = 0;
		while( CarmCalcs < NUM_ITERS)
		{
			CarmSqrt((float)rand());
			CarmCalcs++;
		}

		_asm
		{ 
			RDTSC
				mov DWORD PTR CarmEnd, eax
				mov DWORD PTR CarmEnd+4, edx 
		};

		CarmElapsed = CarmEnd - CarmStart;

		cout << "Carm:\t" << CarmElapsed << "\n\n";
	}
}

int main()
{
	__int64 nTime;
	_asm
	{ 
		RDTSC
		mov DWORD PTR nTime, eax
		mov DWORD PTR nTime+4, edx 
	};

	srand((UINT)nTime);
	printf("Square Root Profiler:\t%i iterations\n\n", ITERATIONS);
	printf(" -root of 1000.0f\n\tTime\t\tResult\n");
	testroot(1000.0f, ITERATIONS);
	printf(" -root of 100.0f\n\tTime\t\tResult\n");
	testroot(100.0f, ITERATIONS);
	printf(" -root of 10.0f\n\tTime\t\tResult\n");
	testroot(10.0f, ITERATIONS);
	printf(" -root of 4.0f\n\tTime\t\tResult\n");
	testroot(4.0f, ITERATIONS);
	printf(" -random square root\n\tTime\n");
	testrootrand(ITERATIONS);

	return 0;
}


I will post my results in my reply.. [Edited by - Days on July 24, 2005 3:51:17 AM]
Advertisement
As you can see, I tested 4 floats, and a random (int typecasted to a float, just to better test, due to cache storing).

In debug mode the results were as follows:

Square Root Profiler: 10000000 iterations

-root of 1000.0f
Fast Algorithm:
Time- 1890110303 Result: 31.625
Slow Algorithm:
Time- 7065462318 Result: 31.6228
Carm Algorithm:
Time- 2752639711 Result: 30.9033

-root of 100.0f
Time- 1880284847 Result: 10.25
Slow Algorithm:
Time- 6358608847 Result: 10.0
Carm Algorithm:
Time- 2774583948 Result: 10.1902

-root of 10.0f
Time- 2026974744 Result: 3.25
Slow Algorithm:
Time- 6319140354 Result: 3.16228
Carm Algorithm:
Time- 2748562967 Result: 3.23561

-root of 4.0f
Time- 2302119787 Result: 2.0
Slow Algorithm:
Time- 6438465964 Result: 2.0
Carm Algorithm:
Time- 2764252675 Result: 1.95437

-random square root
Time- 2558229105
Slow Algorithm:
Time- 6435323820
Carm Algorithm:
Time- 3000144309


___________________________________

Now these results were on par for what I expected, but I knew that nothing is ever released in a debug build, and profiling is nearly useless in debug mode, so I switched over to release, and was floored by the results.

___________________________________

Square Root Profiler: 10000000 iterations

-root of 1000.0f
Fast Algorithm:
Time- 41967263 Result: 31.625
Slow Algorithm:
Time- 94 Result: 31.6228
Carm Algorithm:
Time- 102 Result: 30.9033

-root of 100.0f
Time- 40539866 Result: 10.25
Slow Algorithm:
Time- 102 Result: 10.0
Carm Algorithm:
Time- 102 Result: 10.1902

-root of 10.0f
Time- 40432715 Result: 3.25
Slow Algorithm:
Time- 102 Result: 3.16228
Carm Algorithm:
Time- 102 Result: 3.23561

-root of 4.0f
Time- 43093929 Result: 2.0
Slow Algorithm:
Time- 102 Result: 2.0
Carm Algorithm:
Time- 102 Result: 1.95437

-random square root
Time- 404257016
Slow Algorithm:
Time- 193824964
Carm Algorithm:
Time- 212256271


___________________________________

So from my results, in release mode, with the same square root, done 10 million times, the internal cache was amazingly effecient for the c++ functions, however my x86 asm code, which I was sure would outperform the other two methods, got completely smoked!

102 clock cycles to process 10 million square roots? I was sure that wouldn't be correct, but since the compiler unrolls loops, it knew that it only had to perform that loop once.

The random square root is a more accrurate implementation, as you should never have to call a sqrt function with the same number 10 million times, but even then, the system sqrt() function does a better job than even Carmack's square root.

I am not sure what to think about all of this, but I am glad I wrote the profiler. If anybody has any input on the matter, please reply with your results, and if you modifid any code, please let me know what you did.

Hope you enjoyed my little test.
Days
I want to draw your attention to the following segment of code:
while( FastCalcs < NUM_ITERS){    FastSqrt(num_to_test);    FastCalcs++;}

Why would the compiler leave this loop in the release build? The only variable that is modified is FastCalcs, and it is only used in the loop itself. That entire block can be cut right out without changing anything. And chances are, it was.

CM
Have you tried the SIMD versions?
Quote:Original post by Conner McCloud
Why would the compiler leave this loop in the release build? The only variable that is modified is FastCalcs, and it is only used in the loop itself. That entire block can be cut right out without changing anything. And chances are, it was.

CM


I wrote this through a trial-and-error process, and when I realised that it was cutting that loop out, I threw the random in, just to get a better sample.

I realise that it is disposing of that loop in the Slow and Carm versions, but why is the Fast version taking so long? You would think that it would cut that one out too.

Regardless, why in release, is the system sqrt() function faster than both the x86 asm and Carmacks square root while processing random numbers? No matter how many samples I do, it is always the same.

And no, I have not tried the SIMD versions of the FastRoot assembly, mainly because I am not an expert in asm, and it took me a while to get even this version correct. If you have a way to rewrite it, I would be grateful.

Days

Try taking a look at the disassembly to see what the optimizer is doing.
John BoltonLocomotive Games (THQ)Current Project: Destroy All Humans (Wii). IN STORES NOW!
Sse version (no assembly needed):
inline float SseSqrt(float f){	_mm_store_ss(&f, _mm_sqrt_ss(_mm_load_ss(&f)));	return f;}


Edit: Also consider that you can make 4 sqrt's at the same time using SSE in about the same time as one.
Incidentally, shouldn't you be using sqrtf?
the compiler doesn't optimize away the asm block because VS treats asm as a "black box" and makes no attempt to see inside or optimize anything around it. So it can't optimize away the loop because it thinks there might be side-effects going on inside. Depending on if the compiler is REALLY dumb, You might even be able to put in a asm { xor eax, eax } in the other loops to make the test more accurate (i.e. none of the versions will get context-specific optimizations, such as loop unrolling).

SSE is no problem, although I think that the SSE sqrt is actually slower than rsqrt and rcp!

You should test:

asm {
sqrtss xmm1, value
movss result, xmm1
};


against

asm {
rsqrtss xmm1, value
rcpss xmm2, xmm1
movss result, xmm2
};


it'd be interesting to see... (I'm not at a computer with a compiler, or I'd test it myself).

Also, note that GCC does use an asm syntax that allows you to communicate the side-effects that happen in your asm blocks so that the compiler can schedule and optimize around your asm. It is much more powerful-- you're free to mix and match asm and C... though of course it is slightly more complicated also. With MSVC you pretty much have to code your entire test loop in asm to get a good performance metric.

Also try eq's intrinsic method (note that you need to #include ). I'm interested to see the assmebly that it generates. MSVC is known to be really fiendishly terrible with intrinsics, so with a case as brain-dead simple as this, I'd like to know if it generates anything different from my first asm block above.

--Ajas
Quote:note that you need to #include

Oops, missed that one:
#include <xmmintrin.h>

One way to force the compiler to save your loop is this:
float sum = 0.0f;while( FastCalcs < NUM_ITERS){    sum += FastSqrt(num_to_test);    FastCalcs++;}printf("%f\n", sum); // printf is an external function and can't be optimized away, hence sum must be calculated

To get a more exact timing of the actual square root function, I'd also suggest you to subtract the time taken for an empty loop:
float sum = 0.0f;while( FastCalcs < NUM_ITERS){    sum += num_to_test;    FastCalcs++;}printf("%f\n", sum);

This topic is closed to new replies.

Advertisement