Sign in to follow this  
Promit

Fast integer log2?

Recommended Posts

I'm looking for a really fast integer log2. It can assume that the input is always a power of 2, or it can simply round off non-pow2, or whatever. This is partly useful for me and mostly just curiosity to see what kind of wacky optimizations/fast routines you guys can come up with.

Share this post


Link to post
Share on other sites
x86 ASM opcode: BSR (bit scan reverse). Returns the index of the highest non-zero bit. I.e. the log base two, rounding down.
Results are undefined if the original number is zero.

Share this post


Link to post
Share on other sites
Damn, that was easy.

[EDIT] Uh...I'm not too good with assembly. If I tell it to put the result in eax, can I use that to not give the function a real return statement?

Share this post


Link to post
Share on other sites

DWORD dwNumber=something other than 0;
DWORD dwLog2;

__asm
{
mov dwLog2, dwNumber
bsr dwLog2
}


I'm not good with asm (well not x86 at least) either, but that would be my best guess. Maybe someone can point out any mistakes I made?

Share this post


Link to post
Share on other sites
Here is the code with stack prolog-epilog code:


#include <iostream>

using namespace std;


// logarithm function
__declspec (naked) int log2(int val)
{
// prepare stack
__asm {
push ebp
mov ebp, esp
sub esp, __LOCAL_SIZE
}

__asm {
mov eax,[val]
jz log2end
bsr eax,eax
log2end:
}

// Cleanup function
__asm
{
mov esp, ebp
pop ebp
ret
}

}

int main()
{

cout << "Result: " << log2(2) <<endl;

return 0;
}

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
The zero case will not work correctly for that however. mov doesn't set flags, you need a test as well

Share this post


Link to post
Share on other sites
I'm not sure that it is really necessary. Step through it with a debugger and see what BSR leaves in EAX when it's 0. If it's still zero after executing the instruction then you don't need the test. I know BSR will clear ZF if the src is 0, but I'm not sure what it will leave in the dest.

EDIT: Did I miss something? What prolog code?

EDIT: Ahh, as in the stack setup, not the language prolog :)
It allocates room on the stack for any local variables.

Share this post


Link to post
Share on other sites
Quote:
Original post by joanusdmentia
I'm not sure that it is really necessary.


It probably isn't, but when the spec says you get undefined results it's sometimes better to err on the side of caution.

Share this post


Link to post
Share on other sites
If you use it in a SIMD (3DNow or SSE) context you might even get it faster by loading a float and getting its exponent.

For instance (2X int log2(float)) if a mm register contains a positive float it's nearly free. Only a shift is required.

Something similar has been discussed quite heavilly and recently here

(No need to read the end it turned to be silly. :)

Share this post


Link to post
Share on other sites
Yeah, FPU looks to be the best on Athlons. BSR is actually quite slow there - it costs 10 clocks + 2 lost decode slots. Then again, shifts are bad on P4s.

This is the kind of thing where there's a good bit of wiggle room for asm optimization depending on platform, but the problem is, it's not worth selecting one approach at runtime (lest that overhead outweigh the gains). Instead, you'd need to inline the log2 into the higher level algorithm, and swap that out.

Share this post


Link to post
Share on other sites
Here's a few different versions for you: two plain integer ones, one and one fp/int trick.

Btw did you know there is a FPU assembler instruction to calculate y*log2(x)? It's called FYL2X
it works like this:
fld y
fld x
fyl2x
fstp [dest]

Ok, now on to the functions

// basic iterative version
unsigned int log2(unsigned int x)
{
int count = 0;

while(x>>=1)
count++;

return count;
}

// folding + 1 counting trick
unsigned int log2(unsigned int x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
x >>= 1;
x -= (x >> 1) & 0x55555555;
x = ((x >> 2) & 0x33333333) + (x & 0x33333333);
x = ((x >> 4) + x) & 0x0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
return x & 63;
}

// floating point trick
//This is accurate up to 2^25. After that, it gets these ones off by 1:
//2^25-1
//2^26-2 to 2^26-1
//2^27-4 to 2^27-1
//2^28-8 to 2^28-1
//2^29-16 to 2^29-1
//2^30-32 to 2^30-1
//2^31-64 to 2^31-1
//2^32-128 to 2^32-1

unsigned int log2_integer_approximate(unsigned int n){
*((float*)&n) = (float)n;
return ((n & (~((1<<23) - 1))) >> 23) - 127;
}

I figure you could make a really fast one with sse2 that calculates 4 of these at once. Unfortunately my processor doesn't have sse2 so if someone could test this for me that would be cool. Here is the asm pseudocode

__declspec(align(16)) int mask[4] = {0xff800000, 0xff800000, 0xff800000, 0xff800000};
__declspec(align(16)) int shifts[4] = {23, 23, 23, 23};
__declspec(align(16)) int subtracts[4] = {127, 127, 127, 127};

__asm{
movdqa xmm1, mask ; setup constants
movdqa xmm2, shifts
movdqa xmm3, subtracts
loop_label:
cvtdq2ps xmm0, [source + loop * 16] ; get numbers
andpd xmm0, xmm1 ; compute log2 approximation
psrld xmm0, xmm2
psubd xmm0, xmm3
movdqa [dest + loop * 16], xmm0 ; store result

if(termination-condition-not-met)
goto loop_label;
}

Share this post


Link to post
Share on other sites
K then your SSE example is based on what I described earlier. Still why a and ? If you don't cope with negative inputs (the real domain of log2 is R*+), logical shifts will be enough.

Share this post


Link to post
Share on other sites
Yes, you're right of course, the AND is not nessesary for unsigned numbers. I'm not quite sure why I left it in there :)

So the final version can do 4 integer log2 approximations in 4 sse2 instructions. Wow, that is massively fast, but I wonder, does this have a practical use? Does anyone need to calculate millions of log2's?

Share this post


Link to post
Share on other sites
Yes in terrain rendering for instance (say geomipmap with geomorphing), at each vertex or triangle, check if you need to subdivide (higher LOD)

metric error/view distance -> LOD

by doing :

LOD = _log2(err)-_log2(dist); // no need to remove the bias then

Or with more precision :
LOD = _log2(err/dist);

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this