Archived

This topic is now archived and is closed to further replies.

jeffakew

fast memcpy

Recommended Posts

Hello Could anyone give me some help with improving the performance of the standard memcpy() function please. My game runs about 25% faster with out it. I know that memcpy works with bytes only but I dont know asm so I cant write a function to copy qwords so what can I do? Any help would be much appreciated thanks.

Share this post


Link to post
Share on other sites
you can words at a time but sorry I dont know how
what OS are you programming for and what are you trying to copy memory to? If its in dos I can help

Share this post


Link to post
Share on other sites
I don't if this helps but in C you could do this?

        

void copy (long *from, long *to, int nlongs)
{
int blocks = nlongs / 100;
int remainder = nlongs % 100;

while (blocks--) {
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
}

// do the rest


while (remainder--)
*to++ = *from++;
}


Edited by - bishop_pass on June 19, 2000 2:03:53 AM

Share this post


Link to post
Share on other sites
Hi
Thanks alot, I''m at work right now but when I get home I''ll try that function. I''m programming for win32 and I''m trying to copy a system memory buffer to VRAM just by locking the VRAM. The pitch of the VRAM memory is the same as system memory. Surely after the VRAM is locked it is just the same as dos?(I mean the same as in to write too). Thanks again for your help.

Share this post


Link to post
Share on other sites
Hi,

don''t use the function above.. it''s slow! He''s just copying a byte after another...

to copy fast (paste this stuff):

----- cut here ------

void mov2scr_32(unsigned char *source,unsigned char *dest,unsigned long count)
{
__asm
{
mov esi,source
mov edi,dest
mov ebx,count
mov edx,edi
and edx,11b
jz m2s_memaligned
mov ecx,4
sub ecx,edx
rep movsb
sub ebx,ecx

m2s_memaligned:
mov edx,ebx
and edx,11b
mov ecx,ebx
shr ecx,2
rep movsd
mov ecx,edx
rep movsb
}
}

------ cut here -----

just pass the memory pointers and the number of bytes to be copied to the function.. that''s all...
the function checks how many dwords to copy and how many bytes remain... so it uses fast dword copy if possible..

Share this post


Link to post
Share on other sites
Hi,thanks thats great it''s just what I looking for, when I get in I''ll convert my code to use it and let you know what happens.Once again thanks alot.

Share this post


Link to post
Share on other sites
My function is slow?

I ran some tests on both functions and copied more than 100 billion bytes in the tests to insure fairness.

My function was 6% faster on an AMD 350. It is also machine independent.

Share this post


Link to post
Share on other sites
I should note that my function does not do the initial verifying that the assembly version does, but this could be added. My function takes a number of long words to copy. It does not copy bytes at a time.

Also, Jacen/SE, you should note that my function incurs loop maintenance only every 400 bytes. It appears yours does maintenance every 4 bytes.

The fastest function would be a hybrid of the 2.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
memcpy only takes bytes AS A PARAMETER. That doesn''t mean it copies
them byte-by-byte internally. It most likely tries to move them
as fast as possible (i.e. DWORDs).

Share this post


Link to post
Share on other sites
Hi,

oki.. bishop you''re right.. it copies dwords.. sorry, hadn''t much time to read exactly...
BUT: what if you just need to copy - for example - 2 dwords??
Then your function is useless...

Share this post


Link to post
Share on other sites
Sorry again..
little correction: your function isn''t useless but your big loop is.. in this case your function won''t be faster than mine...

Share this post


Link to post
Share on other sites
Both functions work fine but there was''nt any speed difference so I''ll have try somthing else. Thanks for all your help. Sorry I didn''t reply yesterday but I could use the net at home.

Share this post


Link to post
Share on other sites
The following is an extract from How to optimize for the Pentium
family of microprocessors at http://www.agner.org/assem/ . Its written by Agner Fog, the guy who optimized Quake. It seems to indicate that the best approach in general is to use REP MOVSD.





27.8 Moving blocks of data (all processors)

There are several ways of moving blocks of data. The most common method is REP MOVSD, but under certain conditions other methods are faster.
On PPlain and PMMX it is faster to move 8 bytes at a time using floating point registers if the destination is not in the cache:

TOP: FILD QWORD PTR [ESI]
FILD QWORD PTR [ESI+8]
FXCH
FISTP QWORD PTR [EDI]
FISTP QWORD PTR [EDI+8]
ADD ESI, 16
ADD EDI, 16
DEC ECX
JNZ TOP
The source and destination should of course be aligned by 8. The extra time used by the slow FILD and FISTP instructions is compensated for by the fact that you only have to do half as many write operations. Note that this method is only advantageous on the PPlain and PMMX and only if the destination is not in the level 1 cache. You cannot use FLD and FSTP (without I) on arbitrary bit patterns because denormal numbers are handled slowly and certain bit patterns are not preserved unchanged.

On the PMMX processor it is faster to use MMX instructions to move eight bytes at a time if the destination is not in the cache:

TOP: MOVQ MM0,[ESI]
MOVQ [EDI],MM0
ADD ESI,8
ADD EDI,8
DEC ECX
JNZ TOP
There is no need to unroll this loop or optimize it further if cache misses are expected, because memory access is the bottleneck here, not instruction execution.

On PPro, PII and PIII processors the REP MOVSD instruction is particularly fast when the following conditions are met (see chapter 26.3):

both source and destination must be aligned by 8
direction must be forward (direction flag cleared)
the count (ECX) must be greater than or equal to 64
the difference between EDI and ESI must be numerically greater than or equal to 32
the memory type for both source and destination must be either writeback or write-combining (you can normally assume this).
On the PII it is faster to use MMX registers if the above conditions are not met and the destination is likely to be in the level 1 cache. The loop may be rolled out by two, and the source and destination should of course be aligned by 8.

On the PIII the fastest way of moving data is to use the MOVAPS instruction if the above conditions are not met or if the destination is in the level 1 or level 2 cache:

SUB EDI, ESI
TOP: MOVAPS XMM0, [ESI]
MOVAPS [ESI+EDI], XMM0
ADD ESI, 16
DEC ECX
JNZ TOP
Unlike FLD, MOVAPS can handle any bit pattern without problems. Remember that source and destination must be aligned by 16.
If the number of bytes to move is not divisible by 16 then you may round up to the nearest number divisible by 16 and put some extra space at the end of the destination buffer to receive the superfluous bytes. If this is not possible then you have to move the remaining bytes by other methods.

On the PIII you also have the option of writing directly to RAM memory without involving the cache by using the MOVNTQ or MOVNTPS instruction. This can be useful if you don''t want the destination to go into a cache. MOVNTPS is only slightly faster than MOVNTQ.

Share this post


Link to post
Share on other sites