#### Archived

This topic is now archived and is closed to further replies.

This topic is 6019 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Some thoughts inspired by clipByte. ----------------------------------- Compilers and CPU's sometime strike back at inline ASM coders... OR what's is wrong?? Strange thing happens... on my Athlon XP Test different optimations ... speed ..size ..etc.. VC++ 6.0 proj if some one wants to inlight me... http://www.kulturteknologerna.se/Projext.zip
  inline unsigned __int8 clipByte1 (int value) { // possible to reduce... but we let the compiler take care of it... value = (0 & (-(int)(value < 0))) | (value & (-(int)!(value < 0))); value = (255 & (-(int)(value > 255))) | (value & (-(int)!(value > 255))); return value ; } /* mov ecx, DWORD PTR _value$[esp-4] xor eax, eax test ecx, ecx setge al neg eax and ecx, eax cmp ecx, 255 ; 000000ffH setle al neg al and al, cl cmp ecx, 255 ; 000000ffH setg cl neg cl or al, cl ret 0 */ inline unsigned __int8 clipByte2 (int value) { return value < 0 ? 0 : value > 255 ? 255 : value ; } /* mov eax, DWORD PTR _value$[esp-4] test eax, eax jge SHORT $L1744 xor eax, eax$L1744: cmp eax, 255 ; 000000ffH jle SHORT $L1743 mov eax, 255 ; 000000ffH$L1743: ret 0 */ // No jumps here inline unsigned __int8 clipByte3 (int value) { __asm // tryout diffrent instruction order !! { mov eax, value mov ecx, 0 cmp eax, ecx cmovs eax, ecx mov ebx, 255 cmp eax, ebx cmova eax, ebx } } // same as clipByte2 !! inline unsigned __int8 clipByte4 (int value) { __asm { mov eax, value test eax, eax jge SHORT GEZERO xor eax, eax GEZERO: cmp eax, 255 ; 000000ffH jle SHORT LEFF mov eax, 255 ; 000000ffH LEFF: } } Some numbers i got, in ticks. Timer In loop Outside loop ---------------- ---------------- clipByte 1 2 3 4 1 2 3 4 ================================================ Atlon XP VOL1 = 0 VOL2 = 0 PRE = 0 ======== Opt u486 Opt size 7 0 4 0 12 5 7 6 Opt Speed 2 0 5 5 12 7 17 16 uPPro Opt size 7 0 4 0 12 5 7 6 Opt Speed 0 1 3 0 11 7 8 7 Blend Opt size 7 0 4 0 12 5 7 6 Opt Speed 1 0 5 5 12 7 17 17 Atlon XP VOL1 = 1 ; memory stall 1 VOL2 = 0 PRE = 0 ======== Opt u486 Opt size 6 10 5 5 14 6 11 9 Opt Speed 9 10 6 7 18 9 18 19 uPPro Opt size 6 10 5 5 14 6 11 9 Opt Speed 7 4 5 5 16 8 11 9 Blend Opt size 6 10 5 5 14 6 11 9 Opt Speed 9 10 6 7 18 9 18 19 Opt for size is still a good bet. It seams clipByte2 not is a bad choise... does it succseds to jmp pass the stall now and then to be the faster solution?? ... jmp alwayes bad?? Atlon XP PII ---------------- ---------------- clipByte 1 2 3 4 1 2 3 4 ================================================ VOL1 = 1 ; memory stall 1 VOL2 = 0 PRE = 0 uPPro ======== Opt size 14 6 11 9 17 7 13 10 Opt Speed 16 8 11 9 16 9 12 10 VOL1 = 1 ; memory stall 1 VOL2 = 1 ; memory stall 2 PRE = 0 uPPro ======== Opt size 14 12 9 7 17 8 12 10 Opt Speed 16 8 10 9 16 9 12 10 VOL1 = 1 ; memory stall 1 VOL2 = 1 ; memory stall 2 PRE = 1 ; preemt ... let some other use the CPU and have some bad luck.. uPPro ======== Opt size 14 12 11 10 17 8 12 10 Opt Speed 16 15 11 10 16 9 12 10 
So what is going on here? micca Edited by - micca on November 23, 2001 12:38:54 PM Edited by - micca on November 23, 2001 12:44:14 PM

##### Share on other sites
Conditional jumps are bad for your pipeline. The optimizer should know that.

##### Share on other sites
Yes that was what i thought...
but cmp jmp its faster here !!

what is the penalty?

##### Share on other sites
According to my personal experience CMOVs aren''t as good as they are supossed to... sometimes if you have a well defined branched pattern, it''s better to leave the conditional jmps, since cmovs don''t benefit from the branch prediction logic, and jmps do

Matt

##### Share on other sites
Dude, Athon XPs are super-scalars. Writing optimized asm for them is, as Meyers would say, challenging .

On +586 it is often faster to do a more work if you can eliminate a jump and/or replace it with a loop.

Alignment makes a HUGE difference. I''ve seen a couple of nops take 30% off the execution time of small routines.

Isn''t there a single opcode to sign extend and clip a DWORD to a BYTE and vice-versa?

##### Share on other sites
The thing with clipByte1 is that the compiler is generating a mixture of code that manipulates both 8bit and 32bit registers at the same time - for example, %al and %eax. The CPU doesn''t like this very much (because of register renaming issues). It can add stalls of up to 6 cycles each time you do that.

So let''s say you do something like this:

mov al, 5

The first instruction will cause the CPU to create a temporary register, aliased to "al" so that it can perform the operation now, and write it back later. It''s kind of like disk caching, but on CPU registers. Now at the second line, you access eax, of which al is part of. So now the CPU needs to read the new value of eax before it can do something with it, which causes the temporary al to be written back, which takes time. And tada! 6 cycle stall.

I suppose MSVC generated that code? Try it out with gcc (use -S to get the assembly output) and you may get better results.

##### Share on other sites
Hi again .. tested gcc it does a better job with :

  inline unsigned __int8 clipByte1 (int value){ value = (0 & (-(int)(value < 0))) | (value & (-(int)!(value < 0))); value = (255 & (-(int)(value > 255))) | (value & (-(int)!(value > 255))); return value ;}/* MSVC mov ecx, DWORD PTR _value$[esp-4] xor eax, eax test ecx, ecx setge al neg eax and ecx, eax cmp ecx, 255 ; 000000ffH setle al neg al and al, cl cmp ecx, 255 ; 000000ffH setg cl neg cl or al, cl ret 0*//* GCC/APP STARTclipbyte1: ; STARTc/NO_APP xorl %ecx,%ecx movl %ecx,%edx cmpl$0,8(%ebp) cmovge 8(%ebp),%edx movl $255,%ebx movl %edx,%eax orl %ecx,%eax cmpl$255,%edx cmovle %eax,%ebx/APP STOPclipbyte1: ; STOPc/NO_APP0*/

On PII the later takes 11 ticks instead of 20, thats nice.

But the interessting thing is that on Atlon XP both run on 10 ticks !!!

So The CPU strike back dudes.

For the following GCC does worse thwn MSCV

  inline unsigned __int8 clipByte2 (int value){ return value < 0 ? 0 : value > 255 ? 255 : value ;}/* MSCV mov eax, DWORD PTR _value$[esp-4] test eax, eax jge SHORT$L1744 xor eax, eax$L1744: cmp eax, 255 ; 000000ffH jle SHORT$L1743 mov eax, 255 ; 000000ffH$L1743: ret 0*/*/ GCC/APP STARTclipbyte2: ; STARTcb2/NO_APP testl %edx,%edx jl L511 movl$255,%eax cmpl \$256,%edx cmovl %edx,%eax jmp L513 .p2align 4,,7L511: xorl %eax,%eaxL513:/APP STOPclipbyte2: ; STOPcb2/NO_APP0

On PII the first run in 6 ticks and the later take 8.

On Atholn XP same ... that is 6 for MSVC and 8 for GCC.

Oki Thx for ur input..

/micca

Edited by - micca on November 25, 2001 7:50:53 AM

##### Share on other sites
try this:

  // it''s better to return BYTE as DWORD.inline DWORD ClipByte( int x ){ return ( x | -(DWORD(x)>255) ) & ~(x>>31);}

##### Share on other sites
Super. Serge.

I rearange it some ...

  InlineCall FastCall unsigned int clipByte6 (int value){ value &= ~(value >> 31) ; value |= (255 - value) >> 31 ; return LOBYTE(value) ; }

It as fast as. the asm.

But that is becouse the inline asm is just
impossible to get optimal when using inline asm in C.
(ok for me 8) ..and the copilers i used up to now).

Why aint the following possible ...and leting the
compiler to alocate the registers to use..?

  InlineCall FastCall int clipByte3 (register a){ register b, c ; __asm { xor b, b mov c, 255 cmp a, b cmovs a, b cmp a c cmova a c } return a;}

Extended asm in gcc...

  inline unsigned int clipByte4 (register int value){ asm ( "cmpl %1, %0 \n\t" / "cmovs %1, %0 \n\t" / "cmpl %2, %0 \n\t" / "cmova %2, %0 " / : "=r" (value) : "r" (0), "r" (255) ); return value ;}

Make it possible ...but when optimasion on ..it fail.

/micca
[/source]

Edited by - micca on November 29, 2001 9:27:59 AM

Edited by - micca on November 29, 2001 9:29:27 AM

Edited by - micca on November 29, 2001 9:31:22 AM

• 9
• 23
• 10
• 19