Jump to content
  • Advertisement

Archived

This topic is now archived and is closed to further replies.

cnstrnd

Nearest power of 2

This topic is 5600 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Woah !

There's a bug in visual : the bsr() function above works in debug build (thus not inlined) but fails in release build ! Don't know what to say ... (rdtsc() works ok in both builds with your framework)

[edited by - cnstrnd on June 13, 2004 11:21:51 AM]

Share this post


Link to post
Share on other sites
Advertisement
Damn it was to beautifull. I had tested many things with Visual 7.0, and found nothing really changed. Still I don''t remember all I have tried. And you put the doubt in my mind.

inline unsigned __fastcall bsr( unsigned x )
{__asm bsr eax,ecx; // bsr regOUT, regIN }

Yes, and no, this is not really a bug. Or it''s a known bug if you see what I mean. MSDN strongly warns against using __fastcall and inline asm. As you wrote it VCC (the Visual C Compiler) normally issues two warnings one for x being not used. And one because there is no return value.

You can write x; to avoid the first. And you can use the __declspec(naked) attribute to avoid the last. But then, normally, you are supposed to write the ret in asm, which prevents any inlining.

If you don''t do that, then I suppose that the optimizer just inlines the opcodes, without worrying about any syntactic meaning related to the C prototype. So ecx has any random value, inherited from any other previous register allocation. It (ecx) will not be set with x.

INLINE uint64 rdtsc() {__asm{ rdtsc }}
I think this works, if the default calling convention is stdcall. There is no input. Then it''s normal to return an int64 into eax (lo), edx(hi). You''ll just have a warning for no returned value I suppose.

But if you change the default calling convention it will certainly bug. But here it''s simple to solidify :
INLINE uint64 __stdcall rdtsc() {__asm{ rdtsc }}

Share this post


Link to post
Share on other sites
With default project configuration (warning level 3), VC doesn't raise anything. On the other hand, it does raise a C4100 warning about the unreferenced x parameter (nothing about the missing return value).

MSDN is kind of evasive on this subject. The __fastcall definition, which comes from the vc98core (VC6) doc, states : "Note: Future compiler versions may use different registers to store parameters."
So bad we can't simply write "bsr eax,x" with x being whatever the compiler sets it.

--

I don't think you need to 'solidify' the rdtsc call ... the returned value is ALWAYS (for integer types) in [EDX:]EAX.

[edited by - cnstrnd on June 13, 2004 5:10:20 PM]

Share this post


Link to post
Share on other sites
Right about rdtsc. Although I dont remember how cdecl returns integers. I hardly ever use it, never with asm. What I fear is that in some debug settings the output is supposed to be returned on the stack.

The fastest version with bsr is the __fastcall pure function despite the call ret0 sequence : 14-15 cycles. That''s so frustrating because, all Visual would really have to do is actually replace :

call _nextpow2_BSR
by the actual content (without ret) of _nextpow2_BSR

Since it''s so simple to do, there comes something to my mind. Write a simple postprocessing program that would just do that. That would be really solid and bug-proof. On the scale of a program written with my math lib this should probably boost.

I have simulated it by writing the asm as if Visual did this job. And it wins 3 cycles : 11-12 cycles.

Well but this remains conjectures. In real life, I''ll work with Visual for its convenient IDE. And I''ll release the product with gcc or the Intel CC. Well thanks to my portable libraries I use and write.

Share this post


Link to post
Share on other sites
Just FYI for all of you bsr haters, you can write your own bsr replacement very easy, and it is generally just as fast or faster (varying greatly depending upon what kind of computer you're using). For instance, if you're using a PIII, bsr will kick the hell out of just about anything you can write (it's like 1-2 cycles or something). bsr isn't that fast on Athlons or P4s.

We ("we" being chess programming nerds) have beaten this issue to death multiple times. We use 64-bit values to represent the chessboard (sometimes) and since bsf/bsr are not so great on the newest CPUs, we invented lots of replacements. Here are a few of them that I've come across.


#include <cstdio>
#include <cstdlib>
#include <windows.h>

const int times = 40000000;
const unsigned t = 3456789;

typedef unsigned (* fun)(unsigned);

unsigned rand32 ()
{
return rand() ^ (rand() << 15) ^ (rand() << 30);
}

void test_func (const char * func_str, fun f)
{
srand(t);
unsigned x = 0;
unsigned start, stop;
unsigned total = 0;

printf("%s: ", func_str);

start = GetTickCount();
// Check the first 'times' numbers

for (unsigned int i = 1; i <= times; i++)
{
x ^= f(i);
}
stop = GetTickCount();
printf("%-6u ", stop - start);
total += stop - start;

// Now check a bunch of random 32-bit numbers

start = GetTickCount();
for (int i = 0; i < times; i++)
{
x ^= f(rand32() | 1); // avoid zero

}
stop = GetTickCount();
printf("%-6u ", stop - start);
total += stop - start;

// Test random 24-bit numbers

start = GetTickCount();
for (int i = 0; i < times; i++)
{
x ^= f(rand32() & 0xFFFFFF | 1); // avoid zero

}
stop = GetTickCount();
printf("%-6u ", stop - start);
total += stop - start;

// Test random 16-bit numbers

start = GetTickCount();
for (int i = 0; i < times; i++)
{
x ^= f(rand32() & 0xFFFF | 1); // avoid zero

}
stop = GetTickCount();
printf("%-6u ", stop - start);
total += stop - start;

// Test random 8-bit numbers

start = GetTickCount();
for (int i = 0; i < times; i++)
{
x ^= f(rand32() & 0xFF | 1); // avoid zero

}
stop = GetTickCount();
printf("%-6u ", stop - start);
total += stop - start;

printf("%-3u %u\n", x, total);
}

#define test(f) test_func(#f, f)

unsigned bit_mask (int i)
{
return 1 << i;
}

void print_bits (unsigned x)
{
for (int i = 31; i >= 0; i--)
{
if (x & bit_mask(i))
printf("1");
else
printf("0");
}
printf("\n");
}

//----------------------------------------------------------------------------


// 8-bit bsr lookup table, free for all functions to use


unsigned bsr8bit[256] = {
-1, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
};

//----------------------------------------------------------------------------


unsigned BSR_X (unsigned x)
{
return 0;
}

//----------------------------------------------------------------------------


unsigned BSR_0 (unsigned x)
{
for (int i = 31; i >= 0; i--)
{
if (bit_mask(i) & x)
break;
}
return i;
}

//----------------------------------------------------------------------------


unsigned BSR_1 (unsigned x)
{
_asm bsr eax, dword ptr [x];
}

//----------------------------------------------------------------------------


unsigned BSR_2 (unsigned x)
{
unsigned long r, s, t, u, w;
unsigned long result;
unsigned *p;

p = bsr8bit;
t = x;
u = (unsigned short) x;
result = 0;

w = t >> 16;
if (u == t)
r = (unsigned char) t;
else {
t = w;
r = (unsigned char) w;
result += 16;
}
s = t >> 8;
if (t == r)
p += t;
else {
p += s;
result += 8;
}
return result + *p;
}

//----------------------------------------------------------------------------


unsigned BSR_3 (unsigned n)
{
int i = 0;

if (n & 0xffff0000) {
i += 16;
n >>= 16;
}

if (n & 0xff00) {
i += 8;
n >>= 8;
}

return i + bsr8bit[n];
}

//----------------------------------------------------------------------------


unsigned BSR_4 (unsigned x)
{
unsigned tmp;
tmp = x >> 24;
if (tmp)
return bsr8bit[tmp] + 24;
tmp = (x >> 16) & 0xff;
if (tmp)
return bsr8bit[tmp] + 16;
tmp = (x >> 8) & 0xff;
if (tmp)
return bsr8bit[tmp] + 8;
return bsr8bit[x & 0xff];
}

//----------------------------------------------------------------------------


const unsigned magic = 0x077cb531;

int table[32];

void init_table ()
{
for (int i = 0; i < 32; i++)
table[(magic << i) >> 27] = i;
}

unsigned BSF_0 (unsigned x)
{
x &= -x;
x *= magic;
x >>= 27;
return table[x];
}

//----------------------------------------------------------------------------


int main ()
{
init_table();

printf(" times 32 24 16 8 key total\n");

test(BSR_X);
test(BSR_0);
test(BSR_1);
test(BSR_2);
test(BSR_3);
test(BSR_4);
test(BSF_0);

return 0;
}


Here is the benchmark that I ran on my computer (Athlon 2400+, 2GHz). I'll explain. On the left we have the name of each function:

BSR_X - Baseline, does nothing.
BSR_0 - Dumb brute force, used to make sure our key is valid
BSR_1 - Assembly bsr
BSR_2 - I think this is a binary search
BSR_3 - Binary search (simplified) with lookup table
BSR_4 - Sequential search with lookup table
BSF_0 - Not BSR at all, BSF instead. It's just cool, so I added it.

The 'times', 32, 24, 16, 8, and 'total' columns are in milliseconds. 'times' used numbers 1 to the value of the variable 'times'. The 32, 24, 16, and 8 columns used 32-bit random numbers, 24-bit random numbers, etc. Usually with a 32-bit random number the high bit was in the high byte, and that wasn't quite so "random", so I tested the case where the high bit would lie in each of the four bytes. 'key' is a kind of hash key or checksum that makes sure the routines all produce the same results. 'total' is the total time, just the sum of the other times in that row.

times 32 24 16 8 key total
BSR_X: 160 1161 1222 1152 1142 0 4837
BSR_0: 881 1612 2153 2804 3245 1 10695
BSR_1: 330 1513 1492 1472 1412 1 6219
BSR_2: 330 1452 1463 1402 1341 1 5988
BSR_3: 241 1382 1482 1372 1192 1 5669
BSR_4: 230 1392 1482 1452 1372 1 5928
BSF_0: 391 1532 1532 1502 1432 6 6389

As you can see, BSR_1 (the assembly instruction) isn't the fastest. BSR_3 is the one I prefer. For me, this has also been the fastest in actual code. Other chess programmers say that bsr is in fact the fastest for them, but not by much. This is all very dependent upon what type of machine you're running on. The point is not that bsf/bsr suck. They are very competitive, and sometimes they are the fastest. The point is that there are reasonable alternatives if you want to avoid assembly language or maintain portability, and so on.

[edited by - Russell on June 14, 2004 1:38:42 AM]

Share this post


Link to post
Share on other sites
EDIT: after reading russles post I realized that this is basicly the same as the b-search variant there, slight modification yields this instead:

Faster version using the 1kb LUT, this one clocks in at 6.2 that's
faster than the IEEE trick. Precomputing the table and storing it internally really isn't much trouble and really 1kb worth of storage is nothing =)

inline unsigned __stdcall nextpow2_DD3(unsigned x)
{
--x;
int shift = 0;
if( x & 0xFFFF0000)
{
x >>= 16;
shift += 16;
}
if( x & 0xFF00)
{
x >>= 8;
shift += 8;
}
return bit2[x] << shift;
}


[edited by - DigitalDelusion on June 14, 2004 6:42:45 AM]

Share this post


Link to post
Share on other sites
@Russell:

Thanks, you gave the answer to this proposal I made earlier :
Probably your first one revisited with two consecutive tests and four cases. => 256 bytes or 1Kb lut.

Traducing your benchmarks into clock cycles, it''s easier to compare with what we already had with the bigger functions : nextpow2. 200 ms in your test is 10 cycles per iteration.

So I reiterate, using function pointers is not a good method for such small atomic functions. Here you mostly measure the indirect call and ret. Use inlines. If you have time try to merge my previous source code (rdtsc, the test macro etc...) and your functions. It will give much clearer differences between the choices. 11-12 against 13-14 is unclear, but without the call burden, 2-3 against 4-5 is not.

Basically 2,3,4 are the same. I think some CC might optimize 4 as a bin search. 3 is simply the cleanest code, and thus it lets the compiler do its job.

Still IEEE 754, is a standard, respected by most machines. So the IEEE trick can be made totally portable. Plus there is no limitation in inpu values. BTW the 23bits manissa issue discussed earlier can be solved easilly. All you need is a (float)int, store float, read int, shift. I does not use a lut. I am sure this competes on many machines.


Now I am sure the cache friendly, 8-bit lut, solution can be improved. I would eliminate the tests and replace it by logical arithmetic to get the bias to the 8 bit lut output :

For instance, using the sign bit :

// Just for clarity. This could be even better with
// one and replacing one shift.
bias = ( ((2UL<<16UL) - x ) >> 31UL )*16UL;

is the same as :
bias = (x >= (2UL<<16UL)) ? 16 : 0;

Share this post


Link to post
Share on other sites


// nextpow.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"

#include <math.h>
#include <stdio.h>
#include <windows.h>

// define bits. Don't go over 22, since the original code may fail.
#define BITS 21
//
//

unsigned NextPow2( unsigned n ) {
int MantissaMask = (1<<23) - 1;
(*(float*)&n) = (float) n;
n = n + MantissaMask & ~MantissaMask;
n = (unsigned) (*(float*)&n);
return n;
}



int NextPow2_Critticall( int n ) {

int p=0;
int i=n;
int limit=1;limit<<=(BITS-1);
p=limit;
// ************* slightly modified Critticall's code begins ****************
while (p>8) {
p>>=1;if (i>p) {p=p+p;goto over;}
}
p=5;
if (i>=p) {
p=8;
} else {
p=2;
if (i>=p) {
if (p==i) {
goto over;
}
p=4;
goto over;
}
p=1;
}
over:;

// ************* slightly modified Critticall's code ends ****************


return p;

}

int main(int argc, char* argv[])
{

int mode=1;mode<<=BITS;
int t=GetTickCount();
for (int n=0; n<100000000; n++) {
NextPow2(n%mode);
}

printf("time NextPow2 %d miliseconds. ",GetTickCount()-t);
system("pause");

t=GetTickCount();
for (n=0; n<100000000; n++) {
NextPow2_Critticall(n%mode);
}

printf("time NextPow2_Critticall %d miliseconds. ",GetTickCount()-t);
system("pause");

return 0;
}

Share this post


Link to post
Share on other sites

This topic is 5600 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Guest
This topic is now closed to further replies.

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!