Sign in to follow this  
i1977

x86 assembly language question

Recommended Posts

Assembly language is a bit far in my head and I can't remember how to copy a 32 bit value multiple times in an array. I vaguely remember that there is a specific instruction for that where you have to put the count in cx (I think) the destination address in some other register and then call some instruction to do the work. Can someone please refresh my memory? Ex: const int nTimesToCopy = 512; const DWORD dwValue = 0xFF00FF00; __asm { // what goes here? }

Share this post


Link to post
Share on other sites
I believe rep stosd is what you're looking for. It'll take whatever value is in eax and put it at es:edi, add 4 to edi, and repeat that ecx times.

Share this post


Link to post
Share on other sites
Well, first make sure that EDI points to the destination array, load ECX with the length of the array, then load EAX with the value to load into the array. then just REP STOSD

Share this post


Link to post
Share on other sites
Something along the lines of:

const int nTimesToCopy = 512;
const DWORD dwValue = 0xFF00FF00;
unsigned char array[ nTimesToCopy * sizeof( dwValue )];

__asm
{
mov EDI, [array]
mov ECX, nTimesToCopy
mov EAX, dwValue
rep stosd
}

No guarantees that it will work as it's from memory and I think that you may need to set EDI to a DWORD PTR to array, but it gives you the general idea.

Share this post


Link to post
Share on other sites
Thanks! I'll try that.

Washu, the reason I can't use memset is that even though it takes an int as a parameter, it only copies 1 byte repeatedly, not 4 bytes like what I need to do.

Share this post


Link to post
Share on other sites
Hello !

Not sure you'll do better than a C version of your copy algorithm. Compiler generally do very good stuff when dealing with loops.
If you still want to go the assembly way, you may want to check the memcpy() code. It is very optimized. Maybe you just need to modify it so it can write 32 bit ints instead of bytes.

HTH,

Share this post


Link to post
Share on other sites
Quote:
Original post by i1977
Washu, the reason I can't use memset is that even though it takes an int as a parameter, it only copies 1 byte repeatedly, not 4 bytes like what I need to do.


Are you sure? I was under the impression that it would copy the largest blocks possible until the remaining number of bytes is smaller than said block. Then it would copy any remaining bytes the slow way.

Share this post


Link to post
Share on other sites
Hi smr,
memset() fills a block of memry with a single byte. Wether it copies 4 bytes in a row or not is simply a matter of optimisations. Since the OP wants to init a block of memory using 4 different bytes. This is completely different, and memset do not allow to do that.

Truly,

Share this post


Link to post
Share on other sites
Quote:
Original post by Dmytry
also,i'd compare performance of rep with performance of loop.... rep might be actually slower :( .
Quote:
From How to Optimize for the Pentium® Family of Microprocessors (pentopt.pdf) section 18.4

...


REP MOVSD and REP STOSD are quite fast if the repeat count is not too small. Always use
the DWORD version if possible, and make sure that both source and destination are aligned
by 8.

...

On PPro, P2 and P3, REP MOVS and REP STOS can perform fast by moving an entire
cache line at a time. This happens only when the following conditions are met:
• both source and destination must be aligned by 8
• direction must be forward (direction flag cleared)
• the count (ECX) must be greater than or equal to 64
• the difference between EDI and ESI must be numerically greater than or equal to 32
• the memory type for both source and destination must be either write-back or writecombining
(you can normally assume this).
Under these conditions, the number of uops issued is approximately 215+2*ECX for REP
MOVSD and 185+1.5*ECX for REP STOSD, giving a speed of approximately 5 bytes per clock
cycle for both instructions, which is almost 3 times as fast as when the above conditions are
not met.
I'm not entirely sure how long a loop (using either a jcc and sub or dec OR loop with simple pairable instructions (mov, sub) or complex instructions (stosd)) would take, but I'm very sure rep stosd is the fastest way to do it in this case.

Share this post


Link to post
Share on other sites
As the quote Ra provided shows, the registers used with rep movsd are esi and edi.

Here's an example that uses this construct to copy arguments to a stack (Source).



DWORD Call_cdecl( const void* args, size_t sz, DWORD func )
{
DWORD rc; // here's our return value...
__asm
{
mov ecx, sz // get size of buffer
mov esi, args // get buffer
sub esp, ecx // allocate stack space
mov edi, esp // start of destination stack frame
shr ecx, 2 // make it dwords
rep movsd // copy params to real stack
call [func] // call the function
mov rc, eax // save the return value
add esp, sz // restore the stack pointer
}
return ( rc );
}


Share this post


Link to post
Share on other sites
Then write it in C...


void* memset4(void* t, unsigned int val, size_t count) {
void *dst = t;
while(count--) {
*(unsigned int*)dst = val;
dst = (unsigned int*)dst + 1;
}
return t;
}


and since i just KNOW you're going to say: "Well that's not optimized."

Oh?

004014BD B8 34 12 00 00 mov eax,1234h
004014C2 B9 64 00 00 00 mov ecx,64h
004014C7 8D 7C 24 08 lea edi,[esp+8]
004014CB F3 AB rep stos dword ptr [edi]


That's what the above code generates.

Share this post


Link to post
Share on other sites
Oh...and i suppose __asm ISN'T? HRM????

I used Visual Studio .Net 2003 Enterprise Architect.

If your compiler generates anything but code that is VERY VERY similar to that...it's a piece of shit and you should probably upgrade. The VC++ Toolkit IS free.

Share this post


Link to post
Share on other sites
Washu,

When I said "compiler specific", I wasn't refering to Microsoft extensions such as __asm. I know that's not portable and I don't really care since I won't be compiling my code with another compiler anyway. What I meant is that your C function might not compile to the same optimized asm code on another compiler such as gcc, Borland or whatever.

I tried compiling your C function and it does indeed produce optimized assembly code for the Release build target. I must admit that I am very surprised about this, but nonetheless, I will use inline assembly code anyway because the code generated for the Debug build target is much less optimized.

Oh, and what does "HRM????" mean by the way?

Share this post


Link to post
Share on other sites
Quote:
Original post by i1977
I will use inline assembly code anyway because the code generated for the Debug build target is much less optimized.


Why do you want to optimise your debug build?

Share this post


Link to post
Share on other sites
Quote:
Original post by i1977
Washu,

When I said "compiler specific", I wasn't refering to Microsoft extensions such as __asm. I know that's not portable and I don't really care since I won't be compiling my code with another compiler anyway. What I meant is that your C function might not compile to the same optimized asm code on another compiler such as gcc, Borland or whatever.

I tried compiling your C function and it does indeed produce optimized assembly code for the Release build target. I must admit that I am very surprised about this, but nonetheless, I will use inline assembly code anyway because the code generated for the Debug build target is much less optimized.

Oh, and what does "HRM????" mean by the way?

ROFL! You think that the few nano-seconds gained by having an optimized debug build memset is going to help in debug mode? Well, it won't. All debug mode code it unoptimized. So your memset is not going to do anything to give you a speed up. In fact, it will make debugging harder because it won't have certain things built into it that my C function does.

Share this post


Link to post
Share on other sites
Quote:
Original post by CGameProgrammer
I believe memset is "rep stosb" and memcpy is "rep movsb". In other words they are both single-byte only (what the last 'b' stands for).


Hello,

this is stolen from VC6 memcpy


CopyUp:
test edi,11b ;U - destination dword aligned?
jnz short CopyLeadUp ;V - if we are not dword aligned already, align

shr ecx,2 ;U - shift down to dword count
and edx,11b ;V - trailing byte count

cmp ecx,8 ;U - test if small enough for unwind copy
jb short CopyUnwindUp ;V - if so, then jump

rep movsd ;N - move all of our dwords

jmp dword ptr TrailUpVec[edx*4] ;N - process trailing bytes



You can find it in ${Vs6Dir}\vc98\crt\intel\memcpy.asm.

On a similar fashion, memset also uses "rep stosd".

Both functions are heavily optimized for intel processors and takes some neat feature in account - pointer alignement, instruction pairing, and so on. This is why I think that an optimized version of a 4 byte memset should use these as base code.

HTH,

Share this post


Link to post
Share on other sites
Quote:
Original post by i1977
Thanks for your help guys! Even Washu, who apparently knows more about what I need and am trying to do. ;)

I do, my years of experience tells me that dropping to assembly language for something as simple as a memset4 is a silly idea. Especially when it will cost you in debuggability. (of course, if you don't use the debugger then you have far worse problems.) :)

Not to mention: premature optimization is the devil. Profile first.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this