Home » Community » Forums » General Programming » x86 assembly language question
  Intel sponsors gamedev.net search:   
[Control Panel] [Register] [Bookmarks] [Who's Online] [Active Topics] [Stats] [FAQ] [Search]

Add Forum to Favorites |  Send Topic To a Friend | View Forum FAQ | Track this topic


 Last Thread Next Thread 
 x86 assembly language question
Post New Topic  Post Reply 
Assembly language is a bit far in my head and I can't remember how to copy a 32 bit value multiple times in an array. I vaguely remember that there is a specific instruction for that where you have to put the count in cx (I think) the destination address in some other register and then call some instruction to do the work. Can someone please refresh my memory?

Ex:

const int nTimesToCopy = 512;
const DWORD dwValue = 0xFF00FF00;

__asm
{
// what goes here?
}


 User Rating: 1015   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

I believe rep stosd is what you're looking for. It'll take whatever value is in eax and put it at es:edi, add 4 to edi, and repeat that ecx times.

 User Rating: 1639   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

Well, first make sure that EDI points to the destination array, load ECX with the length of the array, then load EAX with the value to load into the array. then just REP STOSD

 User Rating: 1912   |  Rate This User  Send Private MessageView ProfileView Journal Report this Post to a Moderator | Link

rep will work but there is acutally a string copy instruction...but i can't remember it either... sorry :(

 User Rating: 1197   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

Of course, the real question is: Why not just use memset?

 User Rating: 1912   |  Rate This User  Send Private MessageView ProfileView Journal Report this Post to a Moderator | Link

Something along the lines of:

const int nTimesToCopy = 512;
const DWORD dwValue = 0xFF00FF00;
unsigned char array[ nTimesToCopy * sizeof( dwValue )];

__asm
{
mov EDI, [array]
mov ECX, nTimesToCopy
mov EAX, dwValue
rep stosd
}

No guarantees that it will work as it's from memory and I think that you may need to set EDI to a DWORD PTR to array, but it gives you the general idea.

 User Rating: 1015   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

Thanks! I'll try that.

Washu, the reason I can't use memset is that even though it takes an int as a parameter, it only copies 1 byte repeatedly, not 4 bytes like what I need to do.

 User Rating: 1015   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

Hello !

Not sure you'll do better than a C version of your copy algorithm. Compiler generally do very good stuff when dealing with loops.
If you still want to go the assembly way, you may want to check the memcpy() code. It is very optimized. Maybe you just need to modify it so it can write 32 bit ints instead of bytes.

HTH,


-- Emmanuel D. [blog, in French] [blog, very bad googlized translation] [NEW: English version of teh blog! (WIP)]

 User Rating: 1828   |  Rate This User  Send Private MessageView ProfileView JournalView GD Showcase Entries Report this Post to a Moderator | Link

Quote:
Original post by i1977
Washu, the reason I can't use memset is that even though it takes an int as a parameter, it only copies 1 byte repeatedly, not 4 bytes like what I need to do.


Are you sure? I was under the impression that it would copy the largest blocks possible until the remaining number of bytes is smaller than said block. Then it would copy any remaining bytes the slow way.

 User Rating: 1506   |  Rate This User  Send Private MessageView ProfileView Journal Report this Post to a Moderator | Link

Hi smr,
memset() fills a block of memry with a single byte. Wether it copies 4 bytes in a row or not is simply a matter of optimisations. Since the OP wants to init a block of memory using 4 different bytes. This is completely different, and memset do not allow to do that.

Truly,

-- Emmanuel D. [blog, in French] [blog, very bad googlized translation] [NEW: English version of teh blog! (WIP)]

 User Rating: 1828   |  Rate This User  Send Private MessageView ProfileView JournalView GD Showcase Entries Report this Post to a Moderator | Link

also,i'd compare performance of rep with performance of loop.... rep might be actually slower :( .

 User Rating: 1634   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

Quote:
Original post by Dmytry
also,i'd compare performance of rep with performance of loop.... rep might be actually slower :( .
Quote:
From How to Optimize for the Pentium® Family of Microprocessors (pentopt.pdf) section 18.4

...


REP MOVSD and REP STOSD are quite fast if the repeat count is not too small. Always use
the DWORD version if possible, and make sure that both source and destination are aligned
by 8.

...

On PPro, P2 and P3, REP MOVS and REP STOS can perform fast by moving an entire
cache line at a time. This happens only when the following conditions are met:
• both source and destination must be aligned by 8
• direction must be forward (direction flag cleared)
• the count (ECX) must be greater than or equal to 64
• the difference between EDI and ESI must be numerically greater than or equal to 32
• the memory type for both source and destination must be either write-back or writecombining
(you can normally assume this).
Under these conditions, the number of uops issued is approximately 215+2*ECX for REP
MOVSD and 185+1.5*ECX for REP STOSD, giving a speed of approximately 5 bytes per clock
cycle for both instructions, which is almost 3 times as fast as when the above conditions are
not met.
I'm not entirely sure how long a loop (using either a jcc and sub or dec OR loop with simple pairable instructions (mov, sub) or complex instructions (stosd)) would take, but I'm very sure rep stosd is the fastest way to do it in this case.

 User Rating: 1639   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

As the quote Ra provided shows, the registers used with rep movsd are esi and edi.

Here's an example that uses this construct to copy arguments to a stack (Source).


DWORD Call_cdecl( const void* args, size_t sz, DWORD func )
{
    DWORD rc;               // here's our return value...
    __asm
    {
        mov   ecx, sz       // get size of buffer
        mov   esi, args     // get buffer
        sub   esp, ecx      // allocate stack space
        mov   edi, esp      // start of destination stack frame
        shr   ecx, 2        // make it dwords
        rep   movsd         // copy params to real stack
        call  [func]        // call the function
        mov   rc,  eax      // save the return value
        add   esp, sz       // restore the stack pointer
    }
    return ( rc );
}




 User Rating: 1903   |  Rate This User  Send Private MessageView ProfileView Journal Report this Post to a Moderator | Link

Then write it in C...

void* memset4(void* t, unsigned int val, size_t count) {
	void *dst = t;
	while(count--) {
		*(unsigned int*)dst = val;
		dst = (unsigned int*)dst + 1;
	}
	return t;
}


and since i just KNOW you're going to say: "Well that's not optimized."

Oh?
004014BD B8 34 12 00 00   mov         eax,1234h 
004014C2 B9 64 00 00 00   mov         ecx,64h 
004014C7 8D 7C 24 08      lea         edi,[esp+8] 
004014CB F3 AB            rep stos    dword ptr [edi] 


That's what the above code generates.

 User Rating: 1912   |  Rate This User  Send Private MessageView ProfileView Journal Report this Post to a Moderator | Link

Isn't that very compiler specific though? What compiler did you use for this?

 User Rating: 1015   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

Oh...and i suppose __asm ISN'T? HRM????

I used Visual Studio .Net 2003 Enterprise Architect.

If your compiler generates anything but code that is VERY VERY similar to that...it's a piece of shit and you should probably upgrade. The VC++ Toolkit IS free.

 User Rating: 1912   |  Rate This User  Send Private MessageView ProfileView Journal Report this Post to a Moderator | Link

I believe memset is "rep stosb" and memcpy is "rep movsb". In other words they are both single-byte only (what the last 'b' stands for).

~CGameProgrammer( );
Developer Image Exchange -- New Features: Upload screenshots of your games (size is unlimited) and upload the game itself (up to 10MB). Free. No registration needed.

 User Rating: 1354   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

Washu,

When I said "compiler specific", I wasn't refering to Microsoft extensions such as __asm. I know that's not portable and I don't really care since I won't be compiling my code with another compiler anyway. What I meant is that your C function might not compile to the same optimized asm code on another compiler such as gcc, Borland or whatever.

I tried compiling your C function and it does indeed produce optimized assembly code for the Release build target. I must admit that I am very surprised about this, but nonetheless, I will use inline assembly code anyway because the code generated for the Debug build target is much less optimized.

Oh, and what does "HRM????" mean by the way?


 User Rating: 1015   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

Quote:
Original post by i1977
I will use inline assembly code anyway because the code generated for the Debug build target is much less optimized.


Why do you want to optimise your debug build?

 User Rating: 1478   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

Quote:
Original post by i1977
Washu,

When I said "compiler specific", I wasn't refering to Microsoft extensions such as __asm. I know that's not portable and I don't really care since I won't be compiling my code with another compiler anyway. What I meant is that your C function might not compile to the same optimized asm code on another compiler such as gcc, Borland or whatever.

I tried compiling your C function and it does indeed produce optimized assembly code for the Release build target. I must admit that I am very surprised about this, but nonetheless, I will use inline assembly code anyway because the code generated for the Debug build target is much less optimized.

Oh, and what does "HRM????" mean by the way?

ROFL! You think that the few nano-seconds gained by having an optimized debug build memset is going to help in debug mode? Well, it won't. All debug mode code it unoptimized. So your memset is not going to do anything to give you a speed up. In fact, it will make debugging harder because it won't have certain things built into it that my C function does.

 User Rating: 1912   |  Rate This User  Send Private MessageView ProfileView Journal Report this Post to a Moderator | Link

Quote:
Original post by CGameProgrammer
I believe memset is "rep stosb" and memcpy is "rep movsb". In other words they are both single-byte only (what the last 'b' stands for).


Hello,

this is stolen from VC6 memcpy

CopyUp:
        test    edi,11b         ;U - destination dword aligned?
        jnz     short CopyLeadUp ;V - if we are not dword aligned already, align

        shr     ecx,2           ;U - shift down to dword count
        and     edx,11b         ;V - trailing byte count

        cmp     ecx,8           ;U - test if small enough for unwind copy
        jb      short CopyUnwindUp ;V - if so, then jump

        rep     movsd           ;N - move all of our dwords

        jmp     dword ptr TrailUpVec[edx*4] ;N - process trailing bytes



You can find it in ${Vs6Dir}\vc98\crt\intel\memcpy.asm.

On a similar fashion, memset also uses "rep stosd".

Both functions are heavily optimized for intel processors and takes some neat feature in account - pointer alignement, instruction pairing, and so on. This is why I think that an optimized version of a 4 byte memset should use these as base code.

HTH,

-- Emmanuel D. [blog, in French] [blog, very bad googlized translation] [NEW: English version of teh blog! (WIP)]

 User Rating: 1828   |  Rate This User  Send Private MessageView ProfileView JournalView GD Showcase Entries Report this Post to a Moderator | Link

Thanks for your help guys! Even Washu, who apparently knows more about what I need and am trying to do. ;)

 User Rating: 1015   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

Quote:
Original post by i1977
Thanks for your help guys! Even Washu, who apparently knows more about what I need and am trying to do. ;)

I do, my years of experience tells me that dropping to assembly language for something as simple as a memset4 is a silly idea. Especially when it will cost you in debuggability. (of course, if you don't use the debugger then you have far worse problems.) :)

Not to mention: premature optimization is the devil. Profile first.

 User Rating: 1912   |  Rate This User  Send Private MessageView ProfileView Journal Report this Post to a Moderator | Link

All times are ET (US)

Post Reply
 Last Thread Next Thread 
Forum Rules:
You may not post new threads
You may post replies
You may not edit your posts
You may not use HTML in your posts
Jump To:
Administrative Options: