Surely this optimisation can't be right...

Started by
5 comments, last by MENTAL 21 years, 8 months ago
okay, according to my test application, using "Number |= (uint32) pow (2, Pos);" to turn on a bit is faster than using "Number |= 1 << Pos". The test does 1000000000 calculations, and every time i've run it the pow () method comes out tops. I was wondering if you could check out the source code for errors (knowing me there's something stupidly wrong with it).
    

// Bit Test.cpp : Defines the entry point for the console application.


#define WIN32_LEAN_AND_MEAN

#include <stdio.h>
#include <conio.h>
#include <windows.h>
#include <math.h>

typedef unsigned __int32 uint32;

const __int64 NUM_MAX = 1000000000;

int main(int argc, char* argv[])
{
	__int64 i;
	uint32 Pos;
	uint32 Number = 0;
	LARGE_INTEGER Time1, Time2, TimeT;

	Pos = 3;

	printf ("==================== Bit Test ====================\n\n");

	printf ("This is a test to compare two different ways of\n");
	printf ("turning on and off bits. %I64i operations will be\n", NUM_MAX);
	printf ("performed for each test.\n\n");

	printf ("Press any key to continue...");

	getch ();

	printf ("\n\nPerforming test for \"1 << BitPos\"... ");

	QueryPerformanceCounter (&TimeT);

	for (i = 0; i < NUM_MAX; i++)
	{
		Number |= 1 << Pos;
	}

	QueryPerformanceCounter (&Time1);

	Time1.QuadPart -= TimeT.QuadPart;

	printf ("completed.\n\n");

	printf ("Performing test for \"(uint32) pow (2, Pos)\"... ");

	Number = 0;

	QueryPerformanceCounter (&TimeT);

	for (i = 0; i < NUM_MAX; i++)
	{
		Number |= (uint32) pow (2, Pos);
	}

	QueryPerformanceCounter (&Time2);

	Time2.QuadPart -= TimeT.QuadPart;

	printf ("completed.\n\n");

	printf ("Time for method 1: %I64i\n", Time1.QuadPart);
	printf ("Time for method 2: %I64i\n", Time2.QuadPart);
	
	if (Time1.QuadPart < Time2.QuadPart)
		printf ("Method 1 is the quicker.\n\n");

	if (Time1.QuadPart > Time2.QuadPart)
		printf ("Method 2 is the quicker.\n\n");

	if (Time1.QuadPart == Time2.QuadPart)
		printf ("Both methods took exactly the same amount of time.\n\n");

	getch ();

	return 0;
}

    
Download the program here. the output is raw CPU cycles. ==================== Bit Test ==================== This is a test to compare two different ways of turning on and off bits. 1000000000 operations will be performed for each test. Press any key to continue... Performing test for "1 << BitPos"... completed. Performing test for "(uint32) pow (2, Pos)"... completed. Time for method 1: 20205700 Time for method 2: 19945321 Method 2 is the quicker. EDIT: corrected url [edited by - MENTAL on August 11, 2002 7:48:45 AM]
Advertisement
That''s a hell of a lot of loop overhead, so I''d think the majority of your time for both tests is spent updating the loop (which explains the similar times). Try to unroll it bigtime, say 32x. That will focus on the operation(s) inside the loop, not the loop update itself.
And for what it''s worth, these are the results I get:

Time for method 1: 4218462Time for method 2: 5367457Method 1 is the quicker. 


I tested several times and method 1 always comes out on top by 10-15%
Another factor that bloats the loop overhead is that you use __int64 as the counter. Correct me if I''m wrong, but this is not a supported data type by the CPU, so it''s emulated with code (emulate = slow). Every time you perform an operation on it, VC probably calls a library function (major slowdown). Even if it inlines it, the slowdown is still probably substantial. The point is, your counter only goes up to a billion, and this fits inside a 32-bit counter, so use that instead.

Also be careful that the compiler is not doing any optimizations that mess up the results... one obvious one is "Pos" which is always 3, so it probably doesn''t bother checking from memory and uses immediate operands. I think the keyword volatile forces it to do the memory read every time.

To sum up, try using a 32-bit counter, unrolling the loop, and disabling optimizations.
I took the liberty of doing the changes myself (and lowering NUM_MAX which was a ridiculous number. no way your test program actually used this btw, the test would take minutes/hours to run)

without unrolling:

method 1 --> ~0.00631 seconds
method 2 --> ~0.25993 seconds
ratio = 41.23

with unrolling:
method 1 --> ~0.0053344 seconds
method 2 --> ~0.2576543 seconds
ratio = 48.3

I think that's more like what you expected?

[edited by - foofightr on August 11, 2002 9:04:38 AM]
Hi

hehe this is cool..i was astonished at first too that method 2 is faster...but then i noticed that VC++'s speed optimization kicks in there fully

if you let the compiler generate a assembler listing you will see that in method 1 it does a "or esi, 8"
in method 2, it just calls pow and ftol once before entering the loop and stores the result in eax, ( since (uint32)pow(2, pos) is constant since pos stays at 3), and then in the loops just does a "or esi, eax", which seems to be slightly faster then the "or esi, 8"

if i run the thing in debug mode withoud speed optimization method 2 takes a multiple of the time method 1 uses

edit: sorry foofightr, didn't see that you already mentioned that it's a compiler optimization

------------
Runicsoft

[edited by - Burning_Ice on August 11, 2002 6:05:57 PM]
Time for method 1: 6570184
Time for method 2: 8334828
Method 1 is the quicker.

This topic is closed to new replies.

Advertisement