Jump to content
  • Advertisement

Archived

This topic is now archived and is closed to further replies.

micca

Optimations.. jmp bad?

This topic is 6196 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Some thoughts inspired by clipByte. ----------------------------------- Compilers and CPU's sometime strike back at inline ASM coders... OR what's is wrong?? Strange thing happens... on my Athlon XP Test different optimations ... speed ..size ..etc.. VC++ 6.0 proj if some one wants to inlight me... http://www.kulturteknologerna.se/Projext.zip
    
inline unsigned __int8 clipByte1 (int value)
{
    // possible to reduce... but we let the compiler take care of it...

    value = (0 & (-(int)(value < 0))) | (value & (-(int)!(value < 0)));
    value = (255 & (-(int)(value > 255))) | (value & (-(int)!(value > 255)));
    return value ;
}
/*
	mov	ecx, DWORD PTR _value$[esp-4]
	xor	eax, eax
	test	ecx, ecx
	setge	al
	neg	eax
	and	ecx, eax
	cmp	ecx, 255				; 000000ffH
	setle	al
	neg	al
	and	al, cl
	cmp	ecx, 255				; 000000ffH
	setg	cl
	neg	cl
	or	al, cl
	ret	0
*/

inline unsigned __int8 clipByte2 (int value)
{
     return value < 0 ? 0 : value > 255 ? 255 : value  ;
}
/* 
	mov	eax, DWORD PTR _value$[esp-4]
	test	eax, eax
	jge	SHORT $L1744
	xor	eax, eax
$L1744:
	cmp	eax, 255				; 000000ffH
	jle	SHORT $L1743
	mov	eax, 255				; 000000ffH
$L1743:
	ret	0
*/

// No jumps here

inline unsigned __int8 clipByte3 (int value)
{
        __asm // tryout diffrent instruction order !! 

        {
            mov     eax, value
            mov     ecx, 0
            cmp     eax, ecx
            cmovs   eax, ecx
            mov     ebx, 255
            cmp     eax, ebx
            cmova   eax, ebx
        }
}

// same as clipByte2 !!

inline unsigned __int8 clipByte4 (int value) 
{
        __asm 
        {
	        mov     eax, value
	        test    eax, eax
	        jge	    SHORT GEZERO
	        xor	    eax, eax
        GEZERO:
	        cmp	    eax, 255				; 000000ffH
	        jle	    SHORT LEFF
	        mov	    eax, 255				; 000000ffH
        LEFF:
        }
}
  

Some numbers i got, in ticks.

      
Timer       In loop             Outside loop
            ----------------    ----------------
clipByte    1    2    3    4    1    2    3    4
================================================
  
Atlon XP
VOL1 = 0
VOL2 = 0
PRE = 0
========
Opt u486
Opt size    7    0    4    0   12    5    7    6
Opt Speed   2    0    5    5   12    7   17   16  

uPPro
Opt size    7    0    4    0   12    5    7    6
Opt Speed   0    1    3    0   11    7    8    7

Blend
Opt size    7    0    4    0   12    5    7    6
Opt Speed   1    0    5    5   12    7    17  17
 

Atlon XP
VOL1 = 1 ;  memory stall 1
VOL2 = 0
PRE = 0
========
Opt u486
Opt size    6    10   5    5   14    6   11    9
Opt Speed   9    10   6    7   18    9   18   19  

uPPro
Opt size    6    10   5    5   14    6   11    9
Opt Speed   7     4   5    5   16    8   11    9

Blend
Opt size    6    10   5    5   14    6   11    9
Opt Speed   9    10   6    7   18    9   18   19
   

Opt for size is still a good bet.

It seams clipByte2 not is a bad choise...
does it succseds to jmp pass the stall now and then
 to be the faster solution?? ... jmp alwayes bad??

      
	    Atlon XP            PII
            ----------------    ----------------
clipByte    1    2    3    4    1    2    3    4
================================================
 
VOL1 = 1 ;  memory stall 1
VOL2 = 0
PRE = 0
uPPro
========
Opt size    14   6   11    9   17    7   13   10
Opt Speed   16   8   11    9   16    9   12   10
 

VOL1 = 1 ;  memory stall 1
VOL2 = 1 ;  memory stall 2
PRE = 0
uPPro
========
Opt size    14  12    9    7   17    8   12   10
Opt Speed   16   8   10    9   16    9   12   10
 

VOL1 = 1 ;  memory stall 1
VOL2 = 1 ;  memory stall 2
PRE = 1  ;  preemt ... let some other use the CPU
			and have some bad luck..
uPPro
========
Opt size    14  12   11   10   17    8   12   10
Opt Speed   16  15   11   10   16    9   12   10
  
So what is going on here? micca Edited by - micca on November 23, 2001 12:38:54 PM Edited by - micca on November 23, 2001 12:44:14 PM

Share this post


Link to post
Share on other sites
Advertisement
Yes that was what i thought...
but cmp jmp its faster here !!

what is the penalty?
loading from mem seams alot more critical

Share this post


Link to post
Share on other sites
According to my personal experience CMOVs aren''t as good as they are supossed to... sometimes if you have a well defined branched pattern, it''s better to leave the conditional jmps, since cmovs don''t benefit from the branch prediction logic, and jmps do

Matt

Share this post


Link to post
Share on other sites
Dude, Athon XPs are super-scalars. Writing optimized asm for them is, as Meyers would say, challenging .

On +586 it is often faster to do a more work if you can eliminate a jump and/or replace it with a loop.

Alignment makes a HUGE difference. I''ve seen a couple of nops take 30% off the execution time of small routines.

Isn''t there a single opcode to sign extend and clip a DWORD to a BYTE and vice-versa?

Share this post


Link to post
Share on other sites
The thing with clipByte1 is that the compiler is generating a mixture of code that manipulates both 8bit and 32bit registers at the same time - for example, %al and %eax. The CPU doesn''t like this very much (because of register renaming issues). It can add stalls of up to 6 cycles each time you do that.

So let''s say you do something like this:

mov al, 5
add eax, 10

The first instruction will cause the CPU to create a temporary register, aliased to "al" so that it can perform the operation now, and write it back later. It''s kind of like disk caching, but on CPU registers. Now at the second line, you access eax, of which al is part of. So now the CPU needs to read the new value of eax before it can do something with it, which causes the temporary al to be written back, which takes time. And tada! 6 cycle stall.

I suppose MSVC generated that code? Try it out with gcc (use -S to get the assembly output) and you may get better results.

Share this post


Link to post
Share on other sites
Hi again .. tested gcc it does a better job with :

    

inline unsigned __int8 clipByte1 (int value)
{
value = (0 & (-(int)(value < 0))) | (value & (-(int)!(value < 0)));
value = (255 & (-(int)(value > 255))) | (value & (-(int)!(value > 255)));
return value ;
}
/* MSVC
mov ecx, DWORD PTR _value$[esp-4]
xor eax, eax
test ecx, ecx
setge al
neg eax
and ecx, eax
cmp ecx, 255 ; 000000ffH
setle al
neg al
and al, cl
cmp ecx, 255 ; 000000ffH
setg cl
neg cl
or al, cl
ret 0
*/

/* GCC
/APP
STARTclipbyte1: ; STARTc
/NO_APP
xorl %ecx,%ecx
movl %ecx,%edx
cmpl $0,8(%ebp)
cmovge 8(%ebp),%edx
movl $255,%ebx
movl %edx,%eax
orl %ecx,%eax
cmpl $255,%edx
cmovle %eax,%ebx
/APP
STOPclipbyte1: ; STOPc
/NO_APP0
*/




On PII the later takes 11 ticks instead of 20, thats nice.

But the interessting thing is that on Atlon XP both run on 10 ticks !!!

So The CPU strike back dudes.


For the following GCC does worse thwn MSCV

  

inline unsigned __int8 clipByte2 (int value)
{
return value < 0 ? 0 : value > 255 ? 255 : value ;
}

/* MSCV
mov eax, DWORD PTR _value$[esp-4]
test eax, eax
jge SHORT $L1744
xor eax, eax
$L1744:
cmp eax, 255 ; 000000ffH
jle SHORT $L1743
mov eax, 255 ; 000000ffH
$L1743:
ret 0
*/

*/
GCC
/APP
STARTclipbyte2: ; STARTcb2
/NO_APP
testl %edx,%edx
jl L511
movl $255,%eax
cmpl $256,%edx
cmovl %edx,%eax
jmp L513
.p2align 4,,7
L511:
xorl %eax,%eax
L513:
/APP
STOPclipbyte2: ; STOPcb2
/NO_APP0




On PII the first run in 6 ticks and the later take 8.

On Atholn XP same ... that is 6 for MSVC and 8 for GCC.

Oki Thx for ur input..

/micca

Edited by - micca on November 25, 2001 7:50:53 AM

Share this post


Link to post
Share on other sites
try this:

  
// it''s better to return BYTE as DWORD.

inline DWORD ClipByte( int x )
{
return ( x | -(DWORD(x)>255) ) & ~(x>>31);
}

Share this post


Link to post
Share on other sites
Super. Serge.

I rearange it some ...

        
InlineCall FastCall unsigned int clipByte6 (int value)
{
value &= ~(value >> 31) ;
value |= (255 - value) >> 31 ;
return LOBYTE(value) ;
}


It as fast as. the asm.

But that is becouse the inline asm is just
impossible to get optimal when using inline asm in C.
(ok for me 8) ..and the copilers i used up to now).

Why aint the following possible ...and leting the
compiler to alocate the registers to use..?

  
InlineCall FastCall int clipByte3 (register a)
{
register b, c ;
__asm
{
xor b, b
mov c, 255
cmp a, b
cmovs a, b
cmp a c
cmova a c
}
return a;
}


Extended asm in gcc...

        
inline unsigned int clipByte4 (register int value)
{
asm
(
"cmpl %1, %0 \n\t" /
"cmovs %1, %0 \n\t" /
"cmpl %2, %0 \n\t" /
"cmova %2, %0 " /
: "=r" (value) : "r" (0), "r" (255)
);
return value ;
}


Make it possible ...but when optimasion on ..it fail.

/micca
[/source]

Edited by - micca on November 29, 2001 9:27:59 AM

Edited by - micca on November 29, 2001 9:29:27 AM

Edited by - micca on November 29, 2001 9:31:22 AM

Share this post


Link to post
Share on other sites

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!