would this be faster than memcpy()?

Started by
16 comments, last by relsoft 19 years, 3 months ago
Your algo is full of casting, shift, mods, sums, conditionals...do you really think it can be faster???
Professional programmers who wrote memcpy implementations are not idiots...
Advertisement
I don't believe that memcpy uses movsd!
Maybe on some intel compilers but not on default.
First of all, you have to check if the size of data you would like to transfer is dividalbe with 4, and what if it isn-t? There has to be a backup plan!

The easiest and the slowest way is movsb.
with a data array of 16 bytes you have 4x movsb, and what if it's 19?

4xmovsb, 1xmovsw, 1movsb ? Do you think memcpy is so smart?
Quote:Original post by Samurai Jack
I don't believe that memcpy uses movsd!
Maybe on some intel compilers but not on default.
First of all, you have to check if the size of data you would like to transfer is dividalbe with 4, and what if it isn-t? There has to be a backup plan!

The easiest and the slowest way is movsb.
with a data array of 16 bytes you have 4x movsb, and what if it's 19?

4xmovsb, 1xmovsw, 1movsb ? Do you think memcpy is so smart?

(I suppose this is 4xmovsd, 1xmovsw, 1movsb)

The intel version if memcpy() which can be found in VS.NET and icc indeed is. It also add something that caesar4 forgot to add - and therefore is a lot faster - which is memory alignement. movsd with memory aligned on 4 byte boundary is a lot faster than a simple movsd.

And AFAIR gcc for x86 also use movsd. It would be silly to not use it.

@blizzard99: actually, they are not idiots. But intel's implementation is still full of logical operations.

This kind of function is well known and all the optimization you might make to speed them had been done by your compiler vendor before you.

Regards,
Quote:Original post by Anonymous Poster
Nevertheless, before trying asm optimization, please just try to forger dwords--. Post-fixed operators are a really bad idea, because it pushes the initial value on the stack.

Try it :
++dwords;
--idest;
--isrc;
while(--dwords)
*++idest = *++isrc;

I know the compiler can guess it, but... it's better to be sure


Hum. That's not the right code. First, it is a pain to read. Second, these --idest and --isrc are just a ugly hack. Third, your compiler is not stupid.

You can use postfixed operators in the *++idest = *++isrc; because you want the pointers to be incremented AFTER the copy. In our case, the compiler will not create a temporary variable because the code do not say it will be stored. This is not a = b++; this is a++, b++;. About the while (dwords--) thing, you can translate this to direct assembly yourself:

mov ecx, dwordsor ecx, ecxjz endloopdec ecx


If you compiler cannot write this, update it ;)
The dword copying thing does not work too well unless you copy aligned dwords. So if it starts on, say, a 4x+3 boundary, you're going to have a real mess on your hands. So first you copy however many bytes you need to align the data (if you can assume data is aligned, it's really good). Then you copy as many dwords as you can, and then copy the bytes at the end.

I'm looking at the AMD memcpy implementation now, which appears to have four seperate copy modes: Tiny block copy, in-cache copy, out-of-cache copy, and block-prefetch copy. Each mode is for increasing sizes of memory. It skips the alignment stage for blocks of memory that are between 32k and 64k in size ("it appears to be slower"). Most of the stuff in this AMD code can't be written in C, because it's not only using MMX instructions, it also uses unrolled assembly loops and special instructions that request cache prefetches or skip the cache and do a direct memory read/write.
SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.
This is a nano-optimization, if you need this kind of optimization you're going to need to custom tailor it to your situation anyways. Your code will be slow if the addresses are not dword aligned - or may crash on some architectures (such as some of the "new" 64 bit processors).

Instead of premature optimization and worrying about wheither code is too slow or not, implement and find out (with a profiler) if code is too slow or not, and then - and only if it IS too slow - do you optimize.

memcpy is a generalized tool, and you can be assured that it is probably just about as optimized as such a generalized tool can get. Either that or you've got a cruddy setup which should be replaced ASAP.
The memcpy on my windows and OS X machines both have optimized paths for properly aligned data.

They copy the first few bytes until the address is aligned and then use the aligned data.

Cheers
Chris
CheersChris
Quote:Original post by Samurai Jack
I don't believe that memcpy uses movsd!
Maybe on some intel compilers but not on default.
First of all, you have to check if the size of data you would like to transfer is dividalbe with 4, and what if it isn-t? There has to be a backup plan!

The easiest and the slowest way is movsb.
with a data array of 16 bytes you have 4x movsb, and what if it's 19?

4xmovsb, 1xmovsw, 1movsb ? Do you think memcpy is so smart?



The back up plan is to check of course hence the "or" I wrote before the blit.

ie:

mov ecx, dword ptr[numbytes]
mov eax, ecx
shr ecx, 2
rep movsd
and eax, 3
mov ecx, eax
rep movsb

It works as I've been using it for years.

http://rel.betterwebber.com/junk.php?id=29

Yes, the above is an all software 3d render using TinyPTC and inline ASM.

Hi.

This topic is closed to new replies.

Advertisement