Fast integer log2?

Started by
19 comments, last by Charles B 19 years, 9 months ago
I'm not sure that it is really necessary. Step through it with a debugger and see what BSR leaves in EAX when it's 0. If it's still zero after executing the instruction then you don't need the test. I know BSR will clear ZF if the src is 0, but I'm not sure what it will leave in the dest.

EDIT: Did I miss something? What prolog code?

EDIT: Ahh, as in the stack setup, not the language prolog :)
It allocates room on the stack for any local variables.
"Voilà! In view, a humble vaudevillian veteran, cast vicariously as both victim and villain by the vicissitudes of Fate. This visage, no mere veneer of vanity, is a vestige of the vox populi, now vacant, vanished. However, this valorous visitation of a bygone vexation stands vivified, and has vowed to vanquish these venal and virulent vermin vanguarding vice and vouchsafing the violently vicious and voracious violation of volition. The only verdict is vengeance; a vendetta held as a votive, not in vain, for the value and veracity of such shall one day vindicate the vigilant and the virtuous. Verily, this vichyssoise of verbiage veers most verbose, so let me simply add that it's my very good honor to meet you and you may call me V.".....V
Advertisement
For the prolog-epilog code see this article on MSDN

Test eax,0 sets the zero flag if eax is equal to 0, allowing you to jz (jump-if-zero)
Quote:Original post by joanusdmentia
I'm not sure that it is really necessary.


It probably isn't, but when the spec says you get undefined results it's sometimes better to err on the side of caution.
just a quick note: replace test eax,0 with test eax,eax - it runs quicker on some cpus
If you use it in a SIMD (3DNow or SSE) context you might even get it faster by loading a float and getting its exponent.

For instance (2X int log2(float)) if a mm register contains a positive float it's nearly free. Only a shift is required.

Something similar has been discussed quite heavilly and recently here

(No need to read the end it turned to be silly. :)
"Coding math tricks in asm is more fun than Java"
Yeah, FPU looks to be the best on Athlons. BSR is actually quite slow there - it costs 10 clocks + 2 lost decode slots. Then again, shifts are bad on P4s.

This is the kind of thing where there's a good bit of wiggle room for asm optimization depending on platform, but the problem is, it's not worth selecting one approach at runtime (lest that overhead outweigh the gains). Instead, you'd need to inline the log2 into the higher level algorithm, and swap that out.
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
Charles B beat me to it, but look at the eariler thread, we discuessed the issue to death more or less and some very fast routines came out of it.
HardDrop - hard link shell extension."Tread softly because you tread on my dreams" - Yeats
Here's a few different versions for you: two plain integer ones, one and one fp/int trick.

Btw did you know there is a FPU assembler instruction to calculate y*log2(x)? It's called FYL2X
it works like this:
fld y
fld x
fyl2x
fstp [dest]

Ok, now on to the functions

// basic iterative version
unsigned int log2(unsigned int x)
{
int count = 0;

while(x>>=1)
count++;

return count;
}

// folding + 1 counting trick
unsigned int log2(unsigned int x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
x >>= 1;
x -= (x >> 1) & 0x55555555;
x = ((x >> 2) & 0x33333333) + (x & 0x33333333);
x = ((x >> 4) + x) & 0x0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
return x & 63;
}

// floating point trick
//This is accurate up to 2^25. After that, it gets these ones off by 1:
//2^25-1
//2^26-2 to 2^26-1
//2^27-4 to 2^27-1
//2^28-8 to 2^28-1
//2^29-16 to 2^29-1
//2^30-32 to 2^30-1
//2^31-64 to 2^31-1
//2^32-128 to 2^32-1

unsigned int log2_integer_approximate(unsigned int n){
*((float*)&n) = (float)n;
return ((n & (~((1<<23) - 1))) >> 23) - 127;
}

I figure you could make a really fast one with sse2 that calculates 4 of these at once. Unfortunately my processor doesn't have sse2 so if someone could test this for me that would be cool. Here is the asm pseudocode

__declspec(align(16)) int mask[4] = {0xff800000, 0xff800000, 0xff800000, 0xff800000};
__declspec(align(16)) int shifts[4] = {23, 23, 23, 23};
__declspec(align(16)) int subtracts[4] = {127, 127, 127, 127};

__asm{
movdqa xmm1, mask ; setup constants
movdqa xmm2, shifts
movdqa xmm3, subtracts
loop_label:
cvtdq2ps xmm0, [source + loop * 16] ; get numbers
andpd xmm0, xmm1 ; compute log2 approximation
psrld xmm0, xmm2
psubd xmm0, xmm3
movdqa [dest + loop * 16], xmm0 ; store result

if(termination-condition-not-met)
goto loop_label;
}
K then your SSE example is based on what I described earlier. Still why a and ? If you don't cope with negative inputs (the real domain of log2 is R*+), logical shifts will be enough.
"Coding math tricks in asm is more fun than Java"
Yes, you're right of course, the AND is not nessesary for unsigned numbers. I'm not quite sure why I left it in there :)

So the final version can do 4 integer log2 approximations in 4 sse2 instructions. Wow, that is massively fast, but I wonder, does this have a practical use? Does anyone need to calculate millions of log2's?

This topic is closed to new replies.

Advertisement