Back to General and Gameplay Programming

would this be faster than memcpy()?

General and Gameplay Programming Programming

Started by caesar4 January 04, 2005 10:51 PM

16 comments, last by relsoft 19 years, 3 months ago

268

January 05, 2005 05:31 AM

Your algo is full of casting, shift, mods, sums, conditionals...do you really think it can be faster???
Professional programmers who wrote memcpy implementations are not idiots...

Samurai Jack

455

January 05, 2005 07:49 AM

I don't believe that memcpy uses movsd!
Maybe on some intel compilers but not on default.
First of all, you have to check if the size of data you would like to transfer is dividalbe with 4, and what if it isn-t? There has to be a backup plan!

The easiest and the slowest way is movsb.
with a data array of 16 bytes you have 4x movsb, and what if it's 19?

4xmovsb, 1xmovsw, 1movsb ? Do you think memcpy is so smart?

Emmanuel Deloget

1,382

January 05, 2005 08:10 AM

Quote:Original post by Samurai Jack
I don't believe that memcpy uses movsd!
Maybe on some intel compilers but not on default.
First of all, you have to check if the size of data you would like to transfer is dividalbe with 4, and what if it isn-t? There has to be a backup plan!

The easiest and the slowest way is movsb.
with a data array of 16 bytes you have 4x movsb, and what if it's 19?

4xmovsb, 1xmovsw, 1movsb ? Do you think memcpy is so smart?

(I suppose this is 4xmovsd, 1xmovsw, 1movsb)

The intel version if memcpy() which can be found in VS.NET and icc indeed is. It also add something that caesar4 forgot to add - and therefore is a lot faster - which is memory alignement. movsd with memory aligned on 4 byte boundary is a lot faster than a simple movsd.

And AFAIR gcc for x86 also use movsd. It would be silly to not use it.

@blizzard99: actually, they are not idiots. But intel's implementation is still full of logical operations.

This kind of function is well known and all the optimization you might make to speed them had been done by your compiler vendor before you.

Regards,

-- Emmanuel D. [blog, in French] [blog, very bad googlized translation]

Emmanuel Deloget

1,382

January 05, 2005 08:20 AM

Quote:Original post by Anonymous Poster
Nevertheless, before trying asm optimization, please just try to forger dwords--. Post-fixed operators are a really bad idea, because it pushes the initial value on the stack.

Try it :
++dwords;
--idest;
--isrc;
while(--dwords)
*++idest = *++isrc;

I know the compiler can guess it, but... it's better to be sure

Hum. That's not the right code. First, it is a pain to read. Second, these --idest and --isrc are just a ugly hack. Third, your compiler is not stupid.

You can use postfixed operators in the *++idest = *++isrc; because you want the pointers to be incremented AFTER the copy. In our case, the compiler will not create a temporary variable because the code do not say it will be stored. This is not a = b++; this is a++, b++;. About the while (dwords--) thing, you can translate this to direct assembly yourself:

mov ecx, dwordsor ecx, ecxjz endloopdec ecx

If you compiler cannot write this, update it ;)

-- Emmanuel D. [blog, in French] [blog, very bad googlized translation]

Promit

13,404

January 05, 2005 09:22 AM

The dword copying thing does not work too well unless you copy aligned dwords. So if it starts on, say, a 4x+3 boundary, you're going to have a real mess on your hands. So first you copy however many bytes you need to align the data (if you can assume data is aligned, it's really good). Then you copy as many dwords as you can, and then copy the bytes at the end.

I'm looking at the AMD memcpy implementation now, which appears to have four seperate copy modes: Tiny block copy, in-cache copy, out-of-cache copy, and block-prefetch copy. Each mode is for increasing sizes of memory. It skips the alignment stage for blocks of memory that are between 32k and 64k in size ("it appears to be slower"). Most of the stuff in this AMD code can't be written in C, because it's not only using MMX instructions, it also uses unrolled assembly loops and special instructions that request cache prefetches or skip the cache and do a direct memory read/write.

SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.

MaulingMonkey

1,729

January 05, 2005 11:50 AM

This is a nano-optimization, if you need this kind of optimization you're going to need to custom tailor it to your situation anyways. Your code will be slow if the addresses are not dword aligned - or may crash on some architectures (such as some of the "new" 64 bit processors).

Instead of premature optimization and worrying about wheither code is too slow or not, implement and find out (with a profiler) if code is too slow or not, and then - and only if it IS too slow - do you optimize.

memcpy is a generalized tool, and you can be assured that it is probably just about as optimized as such a generalized tool can get. Either that or you've got a cruddy setup which should be replaced ASAP.

chollida1

532

January 05, 2005 01:35 PM

The memcpy on my windows and OS X machines both have optimized paths for properly aligned data.

They copy the first few bytes until the address is aligned and then use the aligned data.

Cheers
Chris

CheersChris

relsoft

255

January 06, 2005 12:31 AM

Quote:Original post by Samurai Jack
I don't believe that memcpy uses movsd!
Maybe on some intel compilers but not on default.
First of all, you have to check if the size of data you would like to transfer is dividalbe with 4, and what if it isn-t? There has to be a backup plan!

The easiest and the slowest way is movsb.
with a data array of 16 bytes you have 4x movsb, and what if it's 19?

4xmovsb, 1xmovsw, 1movsb ? Do you think memcpy is so smart?

The back up plan is to check of course hence the "or" I wrote before the blit.

ie:

mov ecx, dword ptr[numbytes]
mov eax, ecx
shr ecx, 2
rep movsd
and eax, 3
mov ecx, eax
rep movsb

It works as I've been using it for years.

http://rel.betterwebber.com/junk.php?id=29

Yes, the above is an all software 3d render using TinyPTC and inline ASM.

Hi.

would this be faster than memcpy()?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

would this be faster than memcpy()?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines