Nearest power of 2

Started by
134 comments, last by SiCrane 19 years, 10 months ago
"Turn off the compiler optimizations during the algorithm benchmarking!"
You should realize that you are the only one here. Benchmarks in debug mode, LOL !!! With exception handling on ? With PDB edit and continue ? What else ? With functions like ftol() called instead of being inlined !!! That's a non sense. Quake 3 has not been distributed and compiled in debug configuration !!!

I told you to use my benchmark system.

- Care has been taken so that the computations are all effectively done. No dead code removal is possible with the k+=func(input) …. printf("%d", k); stuff.<br><br>"and do use the previous results as a kind of unfair gains"<br>- With random consecutive numbers ? I'd like to see how !<br>- No prediction is possible from the inputs.<br><br>- Cache is a non issue, since every version benefits of L1. It's pointless to include the motherboard and buses in the measurements. This heavilly depends &#111;n each particular machine. Anyway any speed is always given as optimal, because a good coder manages the cache !<br><br>"Those gains are not handy, when we use the algorithm for real."<br><br>A good code runs slow as hell in debug with the max assertions possible and at full speed in release. For real I never take any debug speed into account. Debug configuration does not schedule instructions. Release options attempt to do it at max. A function is also meaningful by how it can be parallelize in broader, real life, contexts. Thus my X4 unrolling to simulate it somehow.<br><br>EDIT : Last and not the least. I always take a look at the asm output to see if anything unfair is being done by the CC. Dead code removal or insane optimization shortcuts are instantly detected this way. In the other direction things like calls to ftol that might slow down things for stupid reasons also get easilly culled.
"Coding math tricks in asm is more fun than Java"
Advertisement
I am sorry folks! It is a trivial error in the Critticall's code.

Remove the first line, please!

if (n>1048576) return 2097152;

It is obsolete. Or set mode to 2097152, not to 1048576, what is now.

Then run it again.

This for start!
Now, please run this.

#include <math.h>#include <stdio.h>#include <windows.h>#include <conio.h>// define bits. Don't go over 22, since the original code may fail.#define BITS 20unsigned NextPow2_Critticall( int n ) {/*if (n>1048576) return 2097152;if (n>524288) return 1048576;if (n>262144) return 524288;if (n>131072) return 262144;if (n>65536) return 131072;if (n>32768) return 65536;if (n>16384) return 32768;if (n>8192) return 16384;if (n>4096) return 8192;if (n>2048) return 4096;if (n>1024) return 2048;if (n>512) return 1024;if (n>256) return 512;if (n>128) return 256;if (n>64) return 128;if (n>32) return 64;if (n>16) return 32;if (n>8) return 16;if (n>4) return 8;if (n>2) return 4;if (n>1) return 2;if (n>=0) return 1;*/int actions=0;actions++;if (n>1048576) return actions+1;actions++;if (n>524288) return actions+1;actions++;if (n>262144) return actions+1;actions++;if (n>131072) return actions+1;actions++;if (n>65536) return actions+1;actions++;if (n>32768) return actions+1;actions++;if (n>16384) return actions+1;actions++;if (n>8192) return actions+1;actions++;if (n>4096) return actions+1;actions++;if (n>2048) return actions+1;actions++;if (n>1024) return actions+1;actions++;if (n>512) return actions+1;actions++;if (n>256) return actions+1;actions++;if (n>128) return actions+1;actions++;if (n>64) return actions+1;actions++;if (n>32) return actions+1;actions++;if (n>16) return actions+1;actions++;if (n>8) return actions+1;actions++;if (n>4) return actions+1;actions++;if (n>2) return actions+1;actions++;if (n>1) return actions+1;actions++;if (n>=0) return actions+1;return actions+1;}static const unsigned bit[0x100] = {	1, 2, 4, 4, 8, 8, 8, 8, 16, 16, 16, 16, 16, 16, 16, 16,		32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,		64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64,		64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64,		128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128,		128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128,		128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128,		128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128,		256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256,		256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256,		256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256,		256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256,		256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256,		256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256,		256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256,		256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256 };unsigned __stdcall nextpow2_DD3(unsigned x){/*	--x;	int shift = 0;	if( x & 0xFFFF0000)	{		x >>= 16;		shift = 16;	}	if( x & 0xFF00)	{		x >>= 8;		shift += 8;	}	return bit[x] << shift;*/int actions=0;	--x;actions++;	int shift = 0;actions++;	actions++;actions++;	if( x & 0xFFFF0000)	{		actions++;		x >>= 16;		actions++;		shift = 16;	}	actions++;actions++;	if( x & 0xFF00)	{		actions++;		x >>= 8;		actions++;		shift += 8;	}	actions++;actions++;	//return bit[x] << shift;	return actions+1;}int main(int argc, char* argv[]){	int mode=1;mode<<=BITS+1;	int t=GetTickCount();	LARGE_INTEGER c1;	LARGE_INTEGER c2;	int k;int n;	k = 0;	t=GetTickCount();	QueryPerformanceCounter(&c1);	for (n=0; n<100000000; n++) {		k += nextpow2_DD3(n%mode);	}	QueryPerformanceCounter(&c2);	printf("\n DD3       : %d useconds. %d=Actions taken." ,GetTickCount()-t, k);	printf(" %d precise c. ", c2.QuadPart-c1.QuadPart);	k = 0;	t=GetTickCount();	QueryPerformanceCounter(&c1);	for (n=0; n<100000000; n++) {		k += NextPow2_Critticall(n%mode);	}	QueryPerformanceCounter(&c2);	printf("\n Critticall: %d useconds.  %d=Actions taken.",GetTickCount()-t, k);	printf(" %d precise c. ", c2.QuadPart-c1.QuadPart);	printf("\n");	printf("\n");	system("pause");	return 0;}


Note, that mode is 2097152 now. So no need for removing the first line.

The original lines are disabled and the actions counting code is active.
I really don't understand what purpose that was supposed to have.
Sure using your action counting benchmark you showed that you did fewer operations but still the LUT runs quite much faster, the only thing you've accomplished is to show that your code simply isn't going to cut it on modern hardware that utilizes branchprediction and superscalar execution.

One rule of thumb that was quite widely known in optimization circles when the first pentium was released (and that still holds) was that you often could double performance by doubleling the number of instructions simply by breaking apart complex instructions to simpler ones that could execute in parallel. Another very important factor for writing highly efficent code on todays hardware is branch removal, branching often cost to much in terms of stalls.

You still haven't beaten the speed, you just keep yelling at us that your approach is the most efficent and optimal when clearly the only way you get the benchmarks to say that is when using a debug build could you please just for a moment try to understand that unless you can in a real enviorment (that is spelled release build) show that your solution somehow is faster using a reproducable benchmark and a good framework like the one Charles has provided, noone will consider your approach faster since it's clearly is twice as slow as the competition from the LUT, and also gets it ass whooped from the Logical version, both wich work up to 32bits.
HardDrop - hard link shell extension."Tread softly because you tread on my dreams" - Yeats
Not all compilers will execute your code faster. As you acknowledge, there is much more to do with your code. Several times more.

It is well suited for sequential benchmarking on certain compilers, but this is vague.

Your code uses two ifs and many other commands. It is optimal for the serial bench marking, though!

What doesn't mean, that is the best code in real life. It isn't.

Testing algorithms is more, than the naive approach you are advocating.
Take the bets. Under GCC, ICC Borland, whatever modern CC, the LUT will always be faster in release. I know Intel asm far enough to know that any compiler will produce the quasi-same code in release, in such small functions, because compilation here is trivial. Clock counted won't change. I also know the G4, G5 and I also bet with 100% certainty that the hierarchy remains the same and the orders of magnitude too. All modern hardwares are equivalent considering the simple and elementary instructions involved in the lut or bin search solutions. Only tricky operations like bsr, can greatly change.

LUT (6 cycles)is roughly twice faster than any bin search (11 cycles). And I benched your function by starting the first test at 1<<16, the RAND_MAX of Visual, thus favoring it at maximum (avoiding 16 tests for nothing !!!). Else, of course it's totally loosing (30-40 cycles).


BTW I could win one half cycle by slightly changing DD3. But under one cycle it's too much dependent on specific Intel compatible machines. It worked best on my Athlon with something like :

if(x > (1UL<<16UL)){
if(x > (1UL<<24UL)){ x>>24UL; ... }
}
else
{
if(x > (1UL<<8UL)){ x>>8UL; ... }
...
}

What doesn't mean, that is the best code in real life. It isn't.
The 8 bit LUT version is not only efficient in array processing. I could detail that once it's the the code cache (*) it will work at least at fast as in the benchmark, in 'real life'. It requries a certain knowledge of superscalar processors and branch predictions to see it though. But I let you think about it, just in case, it's rather trivial.

(*) It's necessarilly the case if a coder spends (wastes) time optimizing it. It means it's called quite a few billion times in his program.

Last if someone who started asm at the age of 12, 22 years ago is naive in such discussions, then you won't find many people able to follow you.



"Coding math tricks in asm is more fun than Java"
You may just broad your version to 64 bits interval and you will see how it breaks.

While Critticall's will still need 3 operations per number on average.

Mathematically solid algorithm always win on a longer run.
In the case of 1024 bit numbers, Critticall's (binary) algorithm would still need only two (2) operations for every second number. Three (3) on average!
Just for fun... crank up the BITS to 30 and make the needed changes to the Criticall routine, the LUT then wins by a factor of 3 using your own benchmark. The 64 and 1024 bits examples are just getting out of hand, you're clutching after straws and really its getting embarassing.

Couldn't you please try to listen to what I and Charles B are trying to tell you? And really when you need performance you need it on a particular platform or a range of platforms you don't need it on some imaginary 1024bit computer. I got nowhere near the experience that Charles has, but nontheless I do consider myself quite skilled when it comes to performance optimization and tuning, and one of the most important skills you can have when doing those things are the ability to try new solutions and see what works in the real world becuase that's where you need the performance you simply can't have dead slow software and a note in the manual saying "when you get a 1024bit Pentium9 this algorithm will be the best so stop whining about performance", really end users don't care if you got the mathimaticly best method they want the one that runs the best on their current system and clearly that's not the criticall code.
HardDrop - hard link shell extension."Tread softly because you tread on my dreams" - Yeats
The benchmarking of a loop, is not equal to the loop of benchmarks.

You have to switch off some default compiler optimizations, to get the real result.

This way, your compiler optimizes the generation of the number k and doesn't gives you a realistic picture.

DD3 algorithm is not preserved under the settings you have. If it was, it would mean more than 6 billion C++ operations per second, what is not feasible on today personal computers.

Bottom line.

Your compiler does some optimization on the whole program - on the sequence of calls of DD3 (or Critticall's, binary) algorithm. The measurement is flawed under these conditions.

In other words:

The generation of the variable k, had it _realy_ been optimized would take only several clock cycles for the fixed number of n - 100000000. Had this was the only variable, and not fixed to 10^8, the program would be a function from upper bound to k. A quite simple function again.












This topic is closed to new replies.

Advertisement