x86 assembly language question

Started by
21 comments, last by Washu 19 years, 7 months ago
also,i'd compare performance of rep with performance of loop.... rep might be actually slower :( .
Advertisement
Quote:Original post by Dmytry
also,i'd compare performance of rep with performance of loop.... rep might be actually slower :( .
Quote:From How to Optimize for the Pentium® Family of Microprocessors (pentopt.pdf) section 18.4

...


REP MOVSD and REP STOSD are quite fast if the repeat count is not too small. Always use
the DWORD version if possible, and make sure that both source and destination are aligned
by 8.

...

On PPro, P2 and P3, REP MOVS and REP STOS can perform fast by moving an entire
cache line at a time. This happens only when the following conditions are met:
• both source and destination must be aligned by 8
• direction must be forward (direction flag cleared)
• the count (ECX) must be greater than or equal to 64
• the difference between EDI and ESI must be numerically greater than or equal to 32
• the memory type for both source and destination must be either write-back or writecombining
(you can normally assume this).
Under these conditions, the number of uops issued is approximately 215+2*ECX for REP
MOVSD and 185+1.5*ECX for REP STOSD, giving a speed of approximately 5 bytes per clock
cycle for both instructions, which is almost 3 times as fast as when the above conditions are
not met.
I'm not entirely sure how long a loop (using either a jcc and sub or dec OR loop with simple pairable instructions (mov, sub) or complex instructions (stosd)) would take, but I'm very sure rep stosd is the fastest way to do it in this case.
Ra
As the quote Ra provided shows, the registers used with rep movsd are esi and edi.

Here's an example that uses this construct to copy arguments to a stack (Source).


DWORD Call_cdecl( const void* args, size_t sz, DWORD func ){    DWORD rc;               // here's our return value...    __asm    {        mov   ecx, sz       // get size of buffer        mov   esi, args     // get buffer        sub   esp, ecx      // allocate stack space        mov   edi, esp      // start of destination stack frame        shr   ecx, 2        // make it dwords        rep   movsd         // copy params to real stack        call  [func]        // call the function        mov   rc,  eax      // save the return value        add   esp, sz       // restore the stack pointer    }    return ( rc );}


"I thought what I'd do was, I'd pretend I was one of those deaf-mutes." - the Laughing Man
Then write it in C...

void* memset4(void* t, unsigned int val, size_t count) {	void *dst = t;	while(count--) {		*(unsigned int*)dst = val;		dst = (unsigned int*)dst + 1;	}	return t;}


and since i just KNOW you're going to say: "Well that's not optimized."

Oh?
004014BD B8 34 12 00 00   mov         eax,1234h 004014C2 B9 64 00 00 00   mov         ecx,64h 004014C7 8D 7C 24 08      lea         edi,[esp+8] 004014CB F3 AB            rep stos    dword ptr [edi] 


That's what the above code generates.

In time the project grows, the ignorance of its devs it shows, with many a convoluted function, it plunges into deep compunction, the price of failure is high, Washu's mirth is nigh.

Isn't that very compiler specific though? What compiler did you use for this?
Frederic FerlandStrategy First, Inc.http://www.strategyfirst.com
Oh...and i suppose __asm ISN'T? HRM????

I used Visual Studio .Net 2003 Enterprise Architect.

If your compiler generates anything but code that is VERY VERY similar to that...it's a piece of shit and you should probably upgrade. The VC++ Toolkit IS free.

In time the project grows, the ignorance of its devs it shows, with many a convoluted function, it plunges into deep compunction, the price of failure is high, Washu's mirth is nigh.

I believe memset is "rep stosb" and memcpy is "rep movsb". In other words they are both single-byte only (what the last 'b' stands for).
~CGameProgrammer( );Developer Image Exchange -- New Features: Upload screenshots of your games (size is unlimited) and upload the game itself (up to 10MB). Free. No registration needed.
Washu,

When I said "compiler specific", I wasn't refering to Microsoft extensions such as __asm. I know that's not portable and I don't really care since I won't be compiling my code with another compiler anyway. What I meant is that your C function might not compile to the same optimized asm code on another compiler such as gcc, Borland or whatever.

I tried compiling your C function and it does indeed produce optimized assembly code for the Release build target. I must admit that I am very surprised about this, but nonetheless, I will use inline assembly code anyway because the code generated for the Debug build target is much less optimized.

Oh, and what does "HRM????" mean by the way?
Frederic FerlandStrategy First, Inc.http://www.strategyfirst.com
Quote:Original post by i1977
I will use inline assembly code anyway because the code generated for the Debug build target is much less optimized.


Why do you want to optimise your debug build?
Quote:Original post by i1977
Washu,

When I said "compiler specific", I wasn't refering to Microsoft extensions such as __asm. I know that's not portable and I don't really care since I won't be compiling my code with another compiler anyway. What I meant is that your C function might not compile to the same optimized asm code on another compiler such as gcc, Borland or whatever.

I tried compiling your C function and it does indeed produce optimized assembly code for the Release build target. I must admit that I am very surprised about this, but nonetheless, I will use inline assembly code anyway because the code generated for the Debug build target is much less optimized.

Oh, and what does "HRM????" mean by the way?

ROFL! You think that the few nano-seconds gained by having an optimized debug build memset is going to help in debug mode? Well, it won't. All debug mode code it unoptimized. So your memset is not going to do anything to give you a speed up. In fact, it will make debugging harder because it won't have certain things built into it that my C function does.

In time the project grows, the ignorance of its devs it shows, with many a convoluted function, it plunges into deep compunction, the price of failure is high, Washu's mirth is nigh.

This topic is closed to new replies.

Advertisement