I'm not sure that it is really necessary. Step through it with a debugger and see what BSR leaves in EAX when it's 0. If it's still zero after executing the instruction then you don't need the test. I know BSR will clear ZF if the src is 0, but I'm not sure what it will leave in the dest.
EDIT: Did I miss something? What prolog code?
EDIT: Ahh, as in the stack setup, not the language prolog :)
It allocates room on the stack for any local variables.
Fast integer log2?
For the prolog-epilog code see this article on MSDN
Test eax,0 sets the zero flag if eax is equal to 0, allowing you to jz (jump-if-zero)
Test eax,0 sets the zero flag if eax is equal to 0, allowing you to jz (jump-if-zero)
Quote:Original post by joanusdmentia
I'm not sure that it is really necessary.
It probably isn't, but when the spec says you get undefined results it's sometimes better to err on the side of caution.
If you use it in a SIMD (3DNow or SSE) context you might even get it faster by loading a float and getting its exponent.
For instance (2X int log2(float)) if a mm register contains a positive float it's nearly free. Only a shift is required.
Something similar has been discussed quite heavilly and recently here
(No need to read the end it turned to be silly. :)
For instance (2X int log2(float)) if a mm register contains a positive float it's nearly free. Only a shift is required.
Something similar has been discussed quite heavilly and recently here
(No need to read the end it turned to be silly. :)
Yeah, FPU looks to be the best on Athlons. BSR is actually quite slow there - it costs 10 clocks + 2 lost decode slots. Then again, shifts are bad on P4s.
This is the kind of thing where there's a good bit of wiggle room for asm optimization depending on platform, but the problem is, it's not worth selecting one approach at runtime (lest that overhead outweigh the gains). Instead, you'd need to inline the log2 into the higher level algorithm, and swap that out.
This is the kind of thing where there's a good bit of wiggle room for asm optimization depending on platform, but the problem is, it's not worth selecting one approach at runtime (lest that overhead outweigh the gains). Instead, you'd need to inline the log2 into the higher level algorithm, and swap that out.
Charles B beat me to it, but look at the eariler thread, we discuessed the issue to death more or less and some very fast routines came out of it.
Here's a few different versions for you: two plain integer ones, one and one fp/int trick.
Btw did you know there is a FPU assembler instruction to calculate y*log2(x)? It's called FYL2X
it works like this:
fld y
fld x
fyl2x
fstp [dest]
Ok, now on to the functions
// basic iterative version
unsigned int log2(unsigned int x)
{
int count = 0;
while(x>>=1)
count++;
return count;
}
// folding + 1 counting trick
unsigned int log2(unsigned int x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
x >>= 1;
x -= (x >> 1) & 0x55555555;
x = ((x >> 2) & 0x33333333) + (x & 0x33333333);
x = ((x >> 4) + x) & 0x0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
return x & 63;
}
// floating point trick
//This is accurate up to 2^25. After that, it gets these ones off by 1:
//2^25-1
//2^26-2 to 2^26-1
//2^27-4 to 2^27-1
//2^28-8 to 2^28-1
//2^29-16 to 2^29-1
//2^30-32 to 2^30-1
//2^31-64 to 2^31-1
//2^32-128 to 2^32-1
unsigned int log2_integer_approximate(unsigned int n){
*((float*)&n) = (float)n;
return ((n & (~((1<<23) - 1))) >> 23) - 127;
}
I figure you could make a really fast one with sse2 that calculates 4 of these at once. Unfortunately my processor doesn't have sse2 so if someone could test this for me that would be cool. Here is the asm pseudocode
__declspec(align(16)) int mask[4] = {0xff800000, 0xff800000, 0xff800000, 0xff800000};
__declspec(align(16)) int shifts[4] = {23, 23, 23, 23};
__declspec(align(16)) int subtracts[4] = {127, 127, 127, 127};
__asm{
movdqa xmm1, mask ; setup constants
movdqa xmm2, shifts
movdqa xmm3, subtracts
loop_label:
cvtdq2ps xmm0, [source + loop * 16] ; get numbers
andpd xmm0, xmm1 ; compute log2 approximation
psrld xmm0, xmm2
psubd xmm0, xmm3
movdqa [dest + loop * 16], xmm0 ; store result
if(termination-condition-not-met)
goto loop_label;
}
Btw did you know there is a FPU assembler instruction to calculate y*log2(x)? It's called FYL2X
it works like this:
fld y
fld x
fyl2x
fstp [dest]
Ok, now on to the functions
// basic iterative version
unsigned int log2(unsigned int x)
{
int count = 0;
while(x>>=1)
count++;
return count;
}
// folding + 1 counting trick
unsigned int log2(unsigned int x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
x >>= 1;
x -= (x >> 1) & 0x55555555;
x = ((x >> 2) & 0x33333333) + (x & 0x33333333);
x = ((x >> 4) + x) & 0x0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
return x & 63;
}
// floating point trick
//This is accurate up to 2^25. After that, it gets these ones off by 1:
//2^25-1
//2^26-2 to 2^26-1
//2^27-4 to 2^27-1
//2^28-8 to 2^28-1
//2^29-16 to 2^29-1
//2^30-32 to 2^30-1
//2^31-64 to 2^31-1
//2^32-128 to 2^32-1
unsigned int log2_integer_approximate(unsigned int n){
*((float*)&n) = (float)n;
return ((n & (~((1<<23) - 1))) >> 23) - 127;
}
I figure you could make a really fast one with sse2 that calculates 4 of these at once. Unfortunately my processor doesn't have sse2 so if someone could test this for me that would be cool. Here is the asm pseudocode
__declspec(align(16)) int mask[4] = {0xff800000, 0xff800000, 0xff800000, 0xff800000};
__declspec(align(16)) int shifts[4] = {23, 23, 23, 23};
__declspec(align(16)) int subtracts[4] = {127, 127, 127, 127};
__asm{
movdqa xmm1, mask ; setup constants
movdqa xmm2, shifts
movdqa xmm3, subtracts
loop_label:
cvtdq2ps xmm0, [source + loop * 16] ; get numbers
andpd xmm0, xmm1 ; compute log2 approximation
psrld xmm0, xmm2
psubd xmm0, xmm3
movdqa [dest + loop * 16], xmm0 ; store result
if(termination-condition-not-met)
goto loop_label;
}
K then your SSE example is based on what I described earlier. Still why a and ? If you don't cope with negative inputs (the real domain of log2 is R*+), logical shifts will be enough.
Yes, you're right of course, the AND is not nessesary for unsigned numbers. I'm not quite sure why I left it in there :)
So the final version can do 4 integer log2 approximations in 4 sse2 instructions. Wow, that is massively fast, but I wonder, does this have a practical use? Does anyone need to calculate millions of log2's?
So the final version can do 4 integer log2 approximations in 4 sse2 instructions. Wow, that is massively fast, but I wonder, does this have a practical use? Does anyone need to calculate millions of log2's?
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement