Jump to content
  • Advertisement

Archived

This topic is now archived and is closed to further replies.

cnstrnd

Nearest power of 2

This topic is 5600 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Thx for the efforts. I'll do my own benchmarks. Some critics though on the method :

- Don't use rand() in the inner loop. You mostly benchmark rand(), not the function itself. Rather fill an array with rand values.

- Don't use function calls, in real life such small functions should be inlined. So use a template or define to benchmark the function. Not a function pointer.

- Give results in clock cycles per iteration with rdtsc. That's much more convenient. Independant of CPU frequency, and this immediatly compares to known instrcutions. This also detects if there something wrong in the compiler settings.

I am doing a quick test on Visual and gcc. I'll put the results here. I can even test the 3DNow trick, I am sure it's unbeatable.

Rem : you did not test the same FPU trick version. Shifts of variable length (1 << n) used to be very slow I wonder how it is now.

[edited by - Charles B on June 11, 2004 1:36:13 PM]

Share this post


Link to post
Share on other sites
Advertisement
Obviously if you want portability, you''re not necessarily going to be able to use the float trick, PowerPC, for example, doesn''t have single precision. On ARM (eg., GBA, PDA, mobile phone) naoztar''s would be the fastest:

SUB R0,R0,#1
OR R0,R0,R0,LSR#1
OR R0,R0,R0,LSR#2
OR R0,R0,R0,LSR#4
OR R0,R0,R0,LSR#8
OR R0,R0,R0,LSR#16
ADD R0,R0,#1

should be 7 cycles, 1 cycle per clock, according to official timings

Share this post


Link to post
Share on other sites
Here are my results with Visual in release on an Athlon:

naive : 52.75 cycles (Anonymous Poster 1)

recursive : 68.88 cycles (just the same less optimized)

logical : 11.007 cycles
I said : It might compete with the IEEE integer representation trick .
Confirmed, I couldn''t be more accurate.

IEEE trick : 8.97 cycles
I said : Note that this solution is faster than any other proposed but it''s not very good due to 2 store and load dependencies with the pointer casts. I suppose it runs in around 10-20 cycles. .
Confirmed, I was even a bit pessimistic. Note that you did not find it so speedy because Visual makes float to int conversions with call !!! to __ftol which is damn slow. Other compilers do not do it and optimize. There is fistp in the processor ! So I replaced ftoi it with an inline asm macro. This is one of the main optimizations one should do in any game code. Beware of the std library, specially the Visual implementation.

3DNow (IEEE trick) : 10,27
I said : However this can be done in great speed in SIMD, 3DNow or SSE. Because the integer representation of floating points is free. Not totally confirmed at first glance, but beyond the result, it remains true. Because :
- in a truely SIMD context, you don''t need two femms.
- Visual implements the intrisics badly. But gcc and the IntelC
don''t
- as I said if one codes the float to float version and starts with floats in MMX registers, then it''s two instructions : that is roughly one cycle in a global code context.

I was too lazy for the unrolled ifs, but it''s certain to be much slower. The recursive is interesting but highly depends on the compiler. Nonetheless it won''t probably be faster than the inline loop.

BSR : 19,5 clocks
I said :It was very slow. So on the first Pentium the IEEE trick was the fastest way to do. I wonder how fast it is now (say Pentium 4). What''s the latency ? . the problem here is with Visual again. It''s impossible to inline the such asm code efficiently. With gcc it would be better. Maybe then it competes with the two 3 other best results. I could write the whole loop in asm just to see.



PS : Maybe you could throw an eye to my "Horse power math lib(2)"

Believe me I know what I am doing I can always estimate a code speed in absolute, even a complex one, with a very good accuracy. No magic just ten years of doing cycle precise benchmarks of many many routines.

I''ll insert the code sample here later. This can be useful the next time you want accurate and relevant benchmarks.


Share this post


Link to post
Share on other sites
So basicly all the "fast" routines were inlined except for the bsr making it look slow as molasses. Im not saying that it''s not valid but doing some cycle counting of my own I found that just making a call takes somewhere close to 15 cycles on my Athlon so the question that arise is, in a complex application or for users of a mathlib that maybe is precompiled then I would actually say that the functionptr benchmarks are atleast as valid.

But as always, the world isn''t black or white and your results are the same as I got when I let the optimizer do its job just for the benchmark, the IEEE method then wins because it gets inlined. If it to has to have the burden of call overhead it looses to bsr on my machine.

Share this post


Link to post
Share on other sites
So basicly all the "fast" routines were inlined except for the bsr making it look slow as molasses. Im not saying that it's not valid but doing some cycle counting of my own I found that just making a call takes somewhere close to 15 cycles.

I hope you were not into polemics since I mentionned : the problem here is with Visual again (not bsr)

If you mean that it does not allow to compare bsr as such, there is the gcc inline asm solution for that. I can try it if you want.

15 cycles, if you use the stack (cdecl or stdcall) yes, if the compiler can't inline. But not with fastcall, passing elements in registers, that's rather around 6-8 cycles, only the call/ret overhead. I have tried various calling conventions and prototypes, I took the fastest one. I found no way to have a non buggy inlined version faster than the fastcall. I have not only benched, but analysed the asm ouput file, as I always do . This is because Visual does not allow anonymous registers in asm just like gcc. Visual is to blame not my method. GCC would make the BSR based function run perfectly. It could probably become the fastest choice, as I said. Probably around 8-12 cycles.

EDIT : TESTED : ANYWAY BSR ALONE (NO CALL, PURE ASM) COSTS 9-10 CYCLES. SO THIS CONFIRM IT'S STILL A SLOW INSTRUCTION => CAN'T BEAT THE IEEE TRICK HERE

So the question that arise is, in a complex application or for users of a mathlib that maybe is precompiled then I would actually say that the functionptr benchmarks are atleast as valid.

No, I totally disagree. I am writing the highest perf math lib possible. In any case I try to provide an inlined version. For such small functions, a compiler usually has an intrisic version (cos, sin, etc...), that is inlined in the end. Thus for an extended math lib anyone should do the same. It's pointless to pretend writing a speedy math lib without inlined functions.

Since, in this thread, we're comparing speeds, it's irrelevant to test the called function versions when inlining is possible and totally justified by the size of the code. Inlining brings a real life dimension to the algo. This his how the instructions can be scheduled inside the context of other instructions.

Last there is not much difference between 15 or 20 cycles (call overhead), but between 5 or 10 there is. So the absolute appreciation is only relevant comparing inlines.

But as always, the world isn't black or white.
I suppose you have noticed that I tried to do it as a pro. I have put every version in it's best possible implementation given the context of Visual C++. And then I added context independent judgements concerning 3DNow(femms) and bsr(call). The point it measuring the intrinsic efficiency of each algorithm, not poluted by stuff that results of the compiler wasting CPU time.

If it to has to have the burden of call overhead it looses to bsr on my machine.
That's why gcc is potentially superior to Visual, thanks to it's inline asm. It enables +20% +50% perfs in many cases compared to gcc. IntelCC is also someting you can try if you use many small and optimized functions.

Console application (test.c) using rdtsc and printing out the results :


#include <stdio.h>
#include <stdlib.h>
#include <conio.h>
#include <mm3dnow.h>

// Visual C

#define inline __forceinline

unsigned rdtsc(void)
{
unsigned timer;
#if defined(_MSC_VER)
/* Visual C++ inline assembly code */
__asm{
rdtsc
mov timer, eax
};
#elif defined(__GNUC__)
/* GCC inline assembly code */
register unsigned lo asm("eax"), hi asm("edx");
__asm volatile (".byte 15,49" : : : "eax", "edx");
hi;
timer = lo;
#endif
return timer;
}

#define nInnerLoops 128
#define nOuterLoops 128
static unsigned pArray[nInnerLoops];


#define test(f){ int i, j, k=0; unsigned tmin = nInnerLoops*100000UL; float timing; int *p = pArray; for(j=0; j<nOuterLoops; j++) { unsigned t=rdtsc(); for(i=0; i<nInnerLoops; i+=4) { k+=f(p[i+0]); k+=f(p[i+1]); k+=f(p[i+2]); k+=f(p[i+3]); } t=rdtsc()-t; tmin = (t < tmin) ? t : tmin; } timing = (float)tmin/(float)nInnerLoops; printf("%s : f(865)=%d, %f %d\n", #f, f(865), timing, k);}



// Naive version

inline unsigned nextpow2_Naive(unsigned n)
{
unsigned x;
if(n==0) return 0;
x=1;
while(x < n) x<<=1;
return x;
}



// Shifts

inline unsigned nextpow2_Logical(unsigned x)
{
--x;
x |= x >> 1;
x |= x >> 2;
x |= x >> 4;
x |= x >> 8;
x |= x >> 16;
return ++x;
}



inline unsigned nextpow2_Recursive(unsigned x)
{
unsigned y=((x-1)&x);
return y ? nextpow2_Recursive(y) : x<<1;
}



//inline int __fastcall _ftoi(float x)

inline int __stdcall _ftoi(float x)
{
int r;
__asm fld x
__asm fistp dword ptr r;
return r;
}



// IEEE trick

inline int nextpow2_IEEE(int x)
{
static const unsigned _MantissaMask = (1UL<<23UL) - 1UL;
unsigned irep;
*(float*)&irep = (float)x;
irep = (irep + _MantissaMask) & ~_MantissaMask;
x=(unsigned)_ftoi(*(float*)&irep);
return x;
}

/*
__declspec(naked)
int __fastcall nextpow2_BSR(int x)
{
__asm
{
dec ecx
mov eax, 2
bsr ecx, ecx
rol eax, cl
ret
}
}
*/



// Find a way for pure inline asm ???


inline unsigned __stdcall nextpow2_BSR(unsigned x)
{
unsigned shift;
__asm
{
mov ecx, x
dec ecx
//mov eax, 2

bsr ecx, ecx
mov shift, ecx
//rol eax, cl

}
return 2<<shift;
}



inline unsigned nextpow2_3DNow(unsigned x)
{
static const __m64 _MantissaMask64 = {0x00000000007FFFFFL};
register int y;
register __m64 t;
_m_femms();
t = _mm_cvtsi32_si64( x );
t = _m_pi2fd(t);
t = _m_paddd(t, _MantissaMask64);
t = _m_pandn(_MantissaMask64, t);
t = _m_pf2id(t);
y = _mm_cvtsi64_si32( t );
_m_femms();
return y;
}



int main(void)
{
/* Init array */
{ int i; for(i=0; i<nInnerLoops; i++) pArray[i]=rand(); }

printf("ftoi(1.1) = %d \n", _ftoi(1.1f) );

test(nextpow2_Naive);

test(nextpow2_Logical);
test(nextpow2_Recursive);
test(nextpow2_IEEE);
test(nextpow2_BSR);
test(nextpow2_3DNow);

getch();
}



[edited by - Charles B on June 12, 2004 6:31:16 AM]

[edited by - Charles B on June 12, 2004 7:19:03 AM]

Share this post


Link to post
Share on other sites
No, no polemics and I do consider your approach professional I just wanted to point out that for many cases the other benchmarks were valid and I do respect your opinion that they''re not given your context.

And I really have to look closer at GCC seems like they''ve done some great things to it and its support for good inline asm.

Share this post


Link to post
Share on other sites
No, no polemics

I know No problem. I like clutching at straws about asm and optimizations. Because I need to always gain more knwoledge and experience to do this math library. Even if the bsr solution is not the best, it's only because it's not been judged very useful by AMD and Intel. No doubt that they could have cabled it as a one clock latency instruction. It's not much more complex than add in elementary logic.

Still have check again. And it's really impossible to have a speedy inlined version. Because Visual puts the input value on the stack one way or another. A damn slow store to load forward dependency :

mov ecx, [src]
mov [esp+tmp], ecx // Cool STALL
mov ecx, [esp+tmp] // AGAIN
dec ecx
bsr ecx, ecx

That's pretty stupid but that's Visual. It costs 4-5 cycles I guess.

Gcc would simply do :
mov ecx, [src]
dec ecx
bsr ecx, ecx

Anyway, I have checked the Athlon Optimization guide, and it confirms my benchmark. BSR alone costs 10 cycles of latency. Which is full here since you use cl as input right after in the rol. This means that in any way, the bsr version costs more than 10 cycles. Probably 15 at best.

[edited by - Charles B on June 12, 2004 7:31:05 AM]

Share this post


Link to post
Share on other sites
ok, here''s my results:
ftoi(1.1) = 1
nextpow2_Naive : f(865)=1024, 43.382813 362651648
nextpow2_Logical : f(865)=1024, 11.882813 362651648
nextpow2_Recursive : f(865)=1024, 66.906250 362651648
nextpow2_IEEE : f(865)=1024, 8.125000 362651648
nextpow2_BSR : f(865)=1024, 14.710938 362651648
nextpow2_3DNow : f(865)=1024, 9.523438 362651648
nextpow2_DD : f(865)=1024, 6.109375 362651648


nextpow2_DD is my new vanilla C version =)

Share this post


Link to post
Share on other sites
Charles B.

When you talk about Visual, which one you refer to ?
Visual Studio 6 ?

Have you tried Visual .Net compiler ?
Microsoft Visual C++ Toolkit 2003 is freely available from Microsoft, it is the same compiler than Visual .Net (and I think can be used with the IDE from Visual Studio 6 : you just have to copy the new Include, Bin, Lib directories inside those of Visual Studio 6...)

Share this post


Link to post
Share on other sites
This is fun... Im now in the sub 5 range...

edit:
I discovred something uncanny while working on the new routine... the IEEE trick is fast but the results aren't always correct some example values:

i is the value sent to nextpow2, r is the result and c is the value
obtained by using the naive method.


IEEE mismatch: i=16777217 r=16777216 c=33554432
IEEE mismatch: i=33554433 r=33554432 c=67108864
IEEE mismatch: i=33554434 r=33554432 c=67108864
IEEE mismatch: i=67108865 r=67108864 c=134217728
IEEE mismatch: i=67108866 r=67108864 c=134217728
IEEE mismatch: i=67108867 r=67108864 c=134217728
IEEE mismatch: i=67108868 r=67108864 c=134217728
IEEE mismatch: i=134217729 r=134217728 c=268435456


Im guessing FPU precision is to blame here and as you can see the values are very high so it's probably not a problem in practive.

[edited by - DigitalDelusion on June 12, 2004 9:46:16 AM]

Share this post


Link to post
Share on other sites

This topic is 5600 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Guest
This topic is now closed to further replies.

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!