#### Archived

This topic is now archived and is closed to further replies.

# Nearest power of 2

This topic is 5600 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Woah !

There's a bug in visual : the bsr() function above works in debug build (thus not inlined) but fails in release build ! Don't know what to say ... (rdtsc() works ok in both builds with your framework)

[edited by - cnstrnd on June 13, 2004 11:21:51 AM]

##### Share on other sites
Damn it was to beautifull. I had tested many things with Visual 7.0, and found nothing really changed. Still I don''t remember all I have tried. And you put the doubt in my mind.

inline unsigned __fastcall bsr( unsigned x )
{__asm bsr eax,ecx; // bsr regOUT, regIN }

Yes, and no, this is not really a bug. Or it''s a known bug if you see what I mean. MSDN strongly warns against using __fastcall and inline asm. As you wrote it VCC (the Visual C Compiler) normally issues two warnings one for x being not used. And one because there is no return value.

You can write x; to avoid the first. And you can use the __declspec(naked) attribute to avoid the last. But then, normally, you are supposed to write the ret in asm, which prevents any inlining.

If you don''t do that, then I suppose that the optimizer just inlines the opcodes, without worrying about any syntactic meaning related to the C prototype. So ecx has any random value, inherited from any other previous register allocation. It (ecx) will not be set with x.

INLINE uint64 rdtsc() {__asm{ rdtsc }}
I think this works, if the default calling convention is stdcall. There is no input. Then it''s normal to return an int64 into eax (lo), edx(hi). You''ll just have a warning for no returned value I suppose.

But if you change the default calling convention it will certainly bug. But here it''s simple to solidify :
INLINE uint64 __stdcall rdtsc() {__asm{ rdtsc }}

##### Share on other sites
With default project configuration (warning level 3), VC doesn't raise anything. On the other hand, it does raise a C4100 warning about the unreferenced x parameter (nothing about the missing return value).

MSDN is kind of evasive on this subject. The __fastcall definition, which comes from the vc98core (VC6) doc, states : "Note: Future compiler versions may use different registers to store parameters."
So bad we can't simply write "bsr eax,x" with x being whatever the compiler sets it.

--

I don't think you need to 'solidify' the rdtsc call ... the returned value is ALWAYS (for integer types) in [EDX:]EAX.

[edited by - cnstrnd on June 13, 2004 5:10:20 PM]

##### Share on other sites
Right about rdtsc. Although I dont remember how cdecl returns integers. I hardly ever use it, never with asm. What I fear is that in some debug settings the output is supposed to be returned on the stack.

The fastest version with bsr is the __fastcall pure function despite the call ret0 sequence : 14-15 cycles. That''s so frustrating because, all Visual would really have to do is actually replace :

call _nextpow2_BSR
by the actual content (without ret) of _nextpow2_BSR

Since it''s so simple to do, there comes something to my mind. Write a simple postprocessing program that would just do that. That would be really solid and bug-proof. On the scale of a program written with my math lib this should probably boost.

I have simulated it by writing the asm as if Visual did this job. And it wins 3 cycles : 11-12 cycles.

Well but this remains conjectures. In real life, I''ll work with Visual for its convenient IDE. And I''ll release the product with gcc or the Intel CC. Well thanks to my portable libraries I use and write.

##### Share on other sites
Don''t know if it helps but here''s something you might find interesting : C--

##### Share on other sites
Just FYI for all of you bsr haters, you can write your own bsr replacement very easy, and it is generally just as fast or faster (varying greatly depending upon what kind of computer you're using). For instance, if you're using a PIII, bsr will kick the hell out of just about anything you can write (it's like 1-2 cycles or something). bsr isn't that fast on Athlons or P4s.

We ("we" being chess programming nerds) have beaten this issue to death multiple times. We use 64-bit values to represent the chessboard (sometimes) and since bsf/bsr are not so great on the newest CPUs, we invented lots of replacements. Here are a few of them that I've come across.

#include <cstdio>#include <cstdlib>#include <windows.h>const int times = 40000000;const unsigned t = 3456789;typedef unsigned (* fun)(unsigned);unsigned rand32 (){    return rand() ^ (rand() << 15) ^ (rand() << 30);}void test_func (const char * func_str, fun f){    srand(t);    unsigned x = 0;    unsigned start, stop;    unsigned total = 0;    printf("%s:  ", func_str);    start = GetTickCount();    // Check the first 'times' numbers    for (unsigned int i = 1; i <= times; i++)    {        x ^= f(i);    }    stop = GetTickCount();    printf("%-6u ", stop - start);    total += stop - start;    // Now check a bunch of random 32-bit numbers    start = GetTickCount();    for (int i = 0; i < times; i++)    {        x ^= f(rand32() | 1); // avoid zero    }    stop = GetTickCount();    printf("%-6u ", stop - start);    total += stop - start;    // Test random 24-bit numbers    start = GetTickCount();    for (int i = 0; i < times; i++)    {        x ^= f(rand32() & 0xFFFFFF | 1); // avoid zero    }    stop = GetTickCount();    printf("%-6u ", stop - start);    total += stop - start;    // Test random 16-bit numbers    start = GetTickCount();    for (int i = 0; i < times; i++)    {        x ^= f(rand32() & 0xFFFF | 1); // avoid zero    }    stop = GetTickCount();    printf("%-6u ", stop - start);    total += stop - start;    // Test random 8-bit numbers    start = GetTickCount();    for (int i = 0; i < times; i++)    {        x ^= f(rand32() & 0xFF | 1); // avoid zero    }    stop = GetTickCount();    printf("%-6u ", stop - start);    total += stop - start;    printf("%-3u   %u\n", x, total);}#define test(f) test_func(#f, f)unsigned bit_mask (int i){    return 1 << i;}void print_bits (unsigned x){    for (int i = 31; i >= 0; i--)    {        if (x & bit_mask(i))            printf("1");        else            printf("0");    }    printf("\n");}//----------------------------------------------------------------------------// 8-bit bsr lookup table, free for all functions to useunsigned bsr8bit[256] = {   -1, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3,    4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,    5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,    5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,    6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,    6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,    6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,    6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,    7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,    7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,    7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,    7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,    7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,    7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,    7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,    7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,};//----------------------------------------------------------------------------unsigned BSR_X (unsigned x){    return 0;}//----------------------------------------------------------------------------unsigned BSR_0 (unsigned x){    for (int i = 31; i >= 0; i--)    {        if (bit_mask(i) & x)            break;    }    return i;}//----------------------------------------------------------------------------unsigned BSR_1 (unsigned x){    _asm bsr eax, dword ptr [x];}//----------------------------------------------------------------------------unsigned BSR_2 (unsigned x){    unsigned long r, s, t, u, w;    unsigned long result;    unsigned *p;    p = bsr8bit;    t = x;    u = (unsigned short) x;    result = 0;        w = t >> 16;    if (u == t)        r = (unsigned char) t;    else {        t = w;        r = (unsigned char) w;        result += 16;    }    s = t >> 8;    if (t == r)        p += t;    else {        p += s;        result += 8;    }    return result + *p;}//----------------------------------------------------------------------------unsigned BSR_3 (unsigned n){    int i = 0;    if (n & 0xffff0000) {        i += 16;        n >>= 16;    }    if (n & 0xff00) {        i += 8;        n >>= 8;    }    return i + bsr8bit[n];}//----------------------------------------------------------------------------unsigned BSR_4 (unsigned x){    unsigned tmp;    tmp = x >> 24;    if (tmp)        return bsr8bit[tmp] + 24;    tmp = (x >> 16) & 0xff;    if (tmp)        return bsr8bit[tmp] + 16;    tmp = (x >> 8) & 0xff;    if (tmp)        return bsr8bit[tmp] + 8;    return bsr8bit[x & 0xff];}//----------------------------------------------------------------------------const unsigned magic = 0x077cb531;int table[32];void init_table (){    for (int i = 0; i < 32; i++)        table[(magic << i) >> 27] = i;}unsigned BSF_0 (unsigned x){    x &= -x;    x *= magic;    x >>= 27;    return table[x];}//----------------------------------------------------------------------------int main (){    init_table();    printf("        times  32     24     16     8      key   total\n");    test(BSR_X);    test(BSR_0);    test(BSR_1);    test(BSR_2);    test(BSR_3);    test(BSR_4);    test(BSF_0);    return 0;}

Here is the benchmark that I ran on my computer (Athlon 2400+, 2GHz). I'll explain. On the left we have the name of each function:

BSR_X - Baseline, does nothing.
BSR_0 - Dumb brute force, used to make sure our key is valid
BSR_1 - Assembly bsr
BSR_2 - I think this is a binary search
BSR_3 - Binary search (simplified) with lookup table
BSR_4 - Sequential search with lookup table
BSF_0 - Not BSR at all, BSF instead. It's just cool, so I added it.

The 'times', 32, 24, 16, 8, and 'total' columns are in milliseconds. 'times' used numbers 1 to the value of the variable 'times'. The 32, 24, 16, and 8 columns used 32-bit random numbers, 24-bit random numbers, etc. Usually with a 32-bit random number the high bit was in the high byte, and that wasn't quite so "random", so I tested the case where the high bit would lie in each of the four bytes. 'key' is a kind of hash key or checksum that makes sure the routines all produce the same results. 'total' is the total time, just the sum of the other times in that row.
        times  32     24     16     8      key   totalBSR_X:  160    1161   1222   1152   1142   0     4837BSR_0:  881    1612   2153   2804   3245   1     10695BSR_1:  330    1513   1492   1472   1412   1     6219BSR_2:  330    1452   1463   1402   1341   1     5988BSR_3:  241    1382   1482   1372   1192   1     5669BSR_4:  230    1392   1482   1452   1372   1     5928BSF_0:  391    1532   1532   1502   1432   6     6389

As you can see, BSR_1 (the assembly instruction) isn't the fastest. BSR_3 is the one I prefer. For me, this has also been the fastest in actual code. Other chess programmers say that bsr is in fact the fastest for them, but not by much. This is all very dependent upon what type of machine you're running on. The point is not that bsf/bsr suck. They are very competitive, and sometimes they are the fastest. The point is that there are reasonable alternatives if you want to avoid assembly language or maintain portability, and so on.

[edited by - Russell on June 14, 2004 1:38:42 AM]

##### Share on other sites
quote:
Original post by cnstrnd
Don''t know if it helps but here''s something you might find interesting : C--
I thought this was more interesting.

##### Share on other sites
EDIT: after reading russles post I realized that this is basicly the same as the b-search variant there, slight modification yields this instead:

Faster version using the 1kb LUT, this one clocks in at 6.2 that's
faster than the IEEE trick. Precomputing the table and storing it internally really isn't much trouble and really 1kb worth of storage is nothing =)
inline unsigned __stdcall nextpow2_DD3(unsigned x){	--x;	int shift = 0;	if( x & 0xFFFF0000)	{		x >>= 16;		shift += 16;	}	if( x & 0xFF00)	{		x >>= 8;		shift += 8;	}	return bit2[x] << shift;}

[edited by - DigitalDelusion on June 14, 2004 6:42:45 AM]

##### Share on other sites
@Russell:

Thanks, you gave the answer to this proposal I made earlier :
Probably your first one revisited with two consecutive tests and four cases. => 256 bytes or 1Kb lut.

Traducing your benchmarks into clock cycles, it''s easier to compare with what we already had with the bigger functions : nextpow2. 200 ms in your test is 10 cycles per iteration.

So I reiterate, using function pointers is not a good method for such small atomic functions. Here you mostly measure the indirect call and ret. Use inlines. If you have time try to merge my previous source code (rdtsc, the test macro etc...) and your functions. It will give much clearer differences between the choices. 11-12 against 13-14 is unclear, but without the call burden, 2-3 against 4-5 is not.

Basically 2,3,4 are the same. I think some CC might optimize 4 as a bin search. 3 is simply the cleanest code, and thus it lets the compiler do its job.

Still IEEE 754, is a standard, respected by most machines. So the IEEE trick can be made totally portable. Plus there is no limitation in inpu values. BTW the 23bits manissa issue discussed earlier can be solved easilly. All you need is a (float)int, store float, read int, shift. I does not use a lut. I am sure this competes on many machines.

Now I am sure the cache friendly, 8-bit lut, solution can be improved. I would eliminate the tests and replace it by logical arithmetic to get the bias to the 8 bit lut output :

For instance, using the sign bit :

// Just for clarity. This could be even better with
// one and replacing one shift.
bias = ( ((2UL<<16UL) - x ) >> 31UL )*16UL;

is the same as :
bias = (x >= (2UL<<16UL)) ? 16 : 0;

##### Share on other sites
// nextpow.cpp : Defines the entry point for the console application.//#include "stdafx.h"#include <math.h>#include <stdio.h>#include <windows.h>// define bits. Don't go over 22, since the original code may fail.#define BITS 21// //unsigned NextPow2( unsigned n ) {	int MantissaMask = (1<<23) - 1;	(*(float*)&n) = (float) n;	n = n + MantissaMask & ~MantissaMask;	n = (unsigned) (*(float*)&n);	return n;} int NextPow2_Critticall( int n ) {int p=0;int i=n;int limit=1;limit<<=(BITS-1);p=limit;//   ************* slightly modified Critticall's code begins     ****************while (p>8) {      p>>=1;if (i>p) {p=p+p;goto over;}}p=5;if (i>=p) {    p=8;} else {    p=2;    if (i>=p) {        if (p==i) {            goto over;        }        p=4;        goto over;    }    p=1;}over:;//   ************* slightly modified Critticall's  code ends ****************return p;}int main(int argc, char* argv[]){	int mode=1;mode<<=BITS;	int t=GetTickCount();	for (int n=0; n<100000000; n++) { 	    NextPow2(n%mode);	}	printf("time NextPow2 %d miliseconds. ",GetTickCount()-t);    system("pause");	t=GetTickCount();	for (n=0; n<100000000; n++) { 	    NextPow2_Critticall(n%mode);	}	printf("time NextPow2_Critticall %d miliseconds. ",GetTickCount()-t);    system("pause");		return 0;}

##### Share on other sites

This topic is 5600 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

This topic is now closed to further replies.

• ### Game Developer Survey

We are looking for qualified game developers to participate in a 10-minute online survey. Qualified participants will be offered a \$15 incentive for your time and insights. Click here to start!

• 16
• 13
• 9
• 11
• 15