Archived

This topic is now archived and is closed to further replies.

Nearest power of 2

This topic is 5599 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

Recommended Posts

Original post by Bourla

When you talk about Visual, which one you refer to ?
Visual Studio 6 ?

Yes

Have you tried Visual .Net compiler ?
Microsoft Visual C++ Toolkit 2003 ...
Yes as soon as I have seen the news. I have tried some small test projects such as the one of my last post, benchmarked and analysed the output asm files. There is nothing really new, most of the main wastes I had surrounded in VS6 are still present in this compiler. They have these new target hardwares : Athlon, SSE, etc... But I don''t think this will bring much compared to my math lib that uses intrisics and does not rely on uncertain vectorization.

- Visual is reluctant to use op reg, mem while this often relaxes register pressure and thus enables perfect scheduling.
- Visual still has no performant inline asm such as gcc (or even the old Watcom !).
- Still many oher details that waste CPU power.

Share on other sites
quote:
Original post by cnstrnd
33 -> 64

wouldnt 33 -> 32? Or did i mis understand, becasue if i know correctly 32 is a power of 2.

Sharp Basic - Coming summer 2004!

Share on other sites
quote:
Original post by DigitalDelusion
This is fun... Im now in the sub 5 range...
Im guessing FPU precision is to blame

Good point. Right I forgot to mention it. But the integer input can not be higher than 2^23 (23 bits the mantissa mantissa size). Higher, the lower bits are crashed in the first int to float conversion. Which gives, one, then 2, then 4 errors, etc...

This could be solved by using doubles. And the code would be a bit longer thus slower, having to mask an int64 (or two int32). But you are right this is not a fully valid function. Well this is if it's in the doc.

But if this function is used to find a 2 power texture size that would be enough. Still this shows the best choice for a solid library is probably the "logical" version.

EDIT : tags

[edited by - Charles B on June 13, 2004 4:19:41 AM]

Share on other sites
quote:
Original post by DerAnged
quote:
Original post by cnstrnd
33 -> 64

wouldnt 33 -> 32? Or did i mis understand, becasue if i know correctly 32 is a power of 2.

Sharp Basic - Coming summer 2004!

He wants the next superior (i.e. greater) power of 2, not necessarily the closest power of 2.

Share on other sites
Well my routine clocks in at a solid 4.85 quite nice for the benchmark but sadly practicly unusable. The way I do it is using a 256kb LUT and simply doing it like this:
static unsigned bit[0x10000];inline unsigned __stdcall nextpow2_DD(unsigned x){	return --x > 0xFFFF ? bit[x >> 16] << 16 : bit[x];}

using a 1kb LUT i get 8.25 on par with IEEE but correct even for higher values and quite much faster than the Logical

static unsigned bit2[0x100];inline unsigned __stdcall nextpow2_DD2(unsigned x){	--x;	int shift = 0;	if( x > 0xFFFFFF)	{		x >>= 24;		shift = 24;	}	else if( x > 0xFFFF)	{		x >>= 16;		shift = 16;	}	else if( x > 0xFF)	{		x >>= 8;		shift = 8;	}	return bit2[x] << shift;}

the LUT:s are created like this:
//LUT for version 1	bit[0] = 1;	for(int i = 1; i != 0x10000; ++i)		bit[i] = 2 << topbit(i);//LUT for version 2 	bit2[0] = 1;	for(int i = 1; i != 0x100; ++i)		bit2[i] = 2 << topbit(i);//return the upper most set bit __declspec( naked) unsigned __fastcall topbit(unsigned u){	_asm	{		xor eax, eax		bsr ecx, ecx		mov eax, ecx		ret	}}

in a real application at least the smaller one could without problems be embedded to get rid of the initialization step.

edit: fixed LUT:s to work exactly like "Logical" version

[edited by - DigitalDelusion on June 12, 2004 3:36:18 PM]

Share on other sites
@Charles:

About BSR instruction, it's said to require 3 µOps on PII and PIII (Don't know about Athlon) and like you said bsr is not a lot more than an ADD. With the timing framework you submitted, I get about 11 cycles with something like this ...

inline unsigned __fastcall bsr( unsigned x ){	__asm	{		bsr eax,ecx // bsr regOUT, regIN	}}inline unsigned nextpow2_BSR2(unsigned x){	return 2<<bsr(--x); // 0 is a special case ...}

... which is 8 cycles less than the monolythic full-featured inline assembly function. I am sure you could get pretty good result with 'anonymous registers' under gcc. Since I don't have it here,

@DigitalDelusion:

Hey ! That looks like old days of hardcore rotozoom optimisation ! I am sure you'll manage to get negative cycle count !
me gets warm.

@Everybody:

Did you notice, when converting a integer to a float, it's faster to use a "signed int" because of VC bound checking (Still the 23bits mantissa issue) ? I learnt something.

[edited by - cnstrnd on June 12, 2004 8:27:33 PM]

[edited by - cnstrnd on June 13, 2004 10:58:27 AM]

Share on other sites
quote:
Original post by DigitalDelusion
doynax I''ve always loved compression and now how hard it is getting decent speed from huffies, so that''s quite intresting to hear, any good links to info on canocial huffman coders?

haven''t found much (good) info on canonical coders, the only decent
one was arturo campos'' tutorial.

i finally got the decoder working so it''d be interesting to discuss how to get some performance out of it, esp since this seems to be something of a black art.
send me a message at 213799814 (icq) or mail me at johan AT commsoft DOT se if you''re interested and i might even send some source your way

Share on other sites
quote:
Original post by cnstrnd
About BSR instruction, it's said to require 3 µOps on PII and PIII (Don't know about Athlon) and like you said bsr is not a lot more than an ADD. With the timing framework you submitted, I get about 11 cycles with something like this ...

inline unsigned __fastcall bsr( unsigned x )
{
__asm
{
bsr eax,ecx // bsr regOUT, regIN
}
}

Great if it works. I have tried this solution but it bugged. What happened is the compiler refusing to put something (the test array element) in ecx. So this was speedy but it did not compute anything valid.
So :
- Visual 6 or .Net ?
- are you sure it gives the right outputs results (right output asm) ?
- are you sure it won't bug in other contexts ? Using bsr anywhere else.

... which is 8 cycles less than the monolythic full-featured inline assembly function.

@Everybody:
Did you notice, when converting a integer to a float, it's faster to use a "signed int" because of VC bound checking (Still the 23bits mantissa issue) ? I learnt something.

Yes but there are so any rules to remember. In the K6 optimizations, Athlon optimizations, and Intel docs are full of such recommandations when one has to choose between signed or unsigned.

But the best for me is to take a look at the output asm code when something looks wrong to me. As you could see I am rather accurate at benchmarking a priori. So I know by intuition and quick computations when a FPS is abnormally low or when a compuatation takes too long.

For instance I had the IEEE function much too slow compared to my prediction. So I read the asm output, and I detected the bloody call to __ftol(). There are many awful things like this to avoid with C compilers. For instance atoi, toupper (exact name ? mem leak), are well known atrocious perf killers. Also misalignement is sometimes hard to detect.

@Digital
Good thing trying the lut option. That's really for fun this time since the init is tricky and the mem wasted too. But it is generally an option to examine in many cases of strategic routines. So, again, as if nextPowerOf2 was strategic, I am sure you can make something faster with luts. The smaller, the better. The end shift by an variable should be avoided, that is a new algo with luts has to be found. I'll try to think about it and find something even better.

Probably your first one revisited with two consecutive tests and four cases. => 256 bytes or 1Kb lut.

[edited by - Charles B on June 13, 2004 4:48:40 AM]

Share on other sites
@Charles:

I use VC7.1

The BSR function will work exactly as an RDTSC one would :

typedef unsigned __int64 uint64;#define INLINE __forceinlineINLINE uint64 rdtsc() {__asm{ rdtsc }}

It doesn''t give you this layer of abstraction one can have with anonymous register you describe : first two DWORD parameters are in ECX,EDX. Result is in AL,AX,EAX,EDX:EAX,RAX or whatever.

You said you were ''adding'' SSE intrinsics to the default set. Are they inline functions or something else ?

Cheers.

Share on other sites
You should refer to my "Horse power math lib (2)" thread. It explains it all. But shortly, I replace the gcc builtins (currently inefficient) by inline asm.

With Visual it''s impossible since it would freeze the register roles (this is what happens with the builtin implementation of the last gcc version). So for Visual 6/7 CC and Intel CC will mostly use the intrisics unless the function is big enough, then I''ll use naked and fastcall functions. The only functions worth using really inlined asm where the simple int/float func(float) functions as shown with my inline _fild() in the last posts.

It''s a good thing to know, that asm in Visual 7 is slightly more powerful though. I might use it efficiently in my library. I am pretty certain this rtdsc() code would not work on Visual 6. I''ll test that on the 6.0 and 7.0 though. Thanks for the tip.

Share on other sites

This topic is 5599 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

This topic is now closed to further replies.

• Game Developer Survey

We are looking for qualified game developers to participate in a 10-minute online survey. Qualified participants will be offered a \$15 incentive for your time and insights. Click here to start!

• 15
• 11
• 9
• 11
• 15