# would this be faster than memcpy()?

This topic is 4792 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

void *memcpy_1(void *dest, const void *src,int bytes){
int minor = bytes % 4;
int dwords = bytes >> 2;
int *idest = (int *)dest;
int *isrc = (int *)src;
while(dwords--)
*idest++ = *isrc++;

char *cdest = dest+bytes-minor;
char *csrc = src+bytes-minor;
while(minor--)
*cdest++ = *csrc++;

return dest;
}


for small memcpy this would probably act slower than the original, but for large copy operations would this be almost 4x faster because instead of moving a single byte, it moves 4? any asm optimizations are welcome

##### Share on other sites
I suggest you look at the actual implementation of memcpy. this is not faster...

##### Share on other sites
It depends on how your compiler implements memcpy. Some might have an SSE2 version. Some might use a for loop copying 3 bytes at a time (ok so thats unlikely).

The point is, is your code faster than memcpy? Which version of memcpy? Running on what processor?

If you really want to know the answer for your particlular compiler, profile it. In general, trying to micro-optimise things like memcpy if usually a waste of time. Profile your code, find where its actually slow, and optimise the algorithm.

Alan

##### Share on other sites
Probably just about any memcpy implementation already does the 4-bytes-at-time optimization, in addition to others, more platform specific tweaks, and is implemented directly in (inline) asm. It is usually safe to assume that standard library writers know what they are doing, and that standard functions are as about fast as they can get. (Additionally, compilers are allowed to recognize standard library function uses, and do extra optimization magic on them (for example, the compiler may know that sqrt() has no side-effects, and fold multiple calls with a same parameter into one and calculate the result at compile time, something it cannot always do with user-defined functions.).

##### Share on other sites
For your purposes you can probably assume that *nothing* is faster than memcpy.
This is probably not where you should be focusing your optimisation efforts, though it depends largely upon the context of what you are using it for. Tell us what you are doing with memcpy, and perhaps post some code, and we can help you make some real performance gains, perhaps at the algorithmic level first.

##### Share on other sites
Quote:
 Original post by iMalcFor your purposes you can probably assume that *nothing* is faster than memcpy.

Except for possibly AMD memcpy! (for SSE-capable processors). It is significantly faster, I use it (along with Doug Lea's malloc) with custom allocators for all my containers now, and it's been a beauty.

##### Share on other sites
I supposed memcopy does this:

mov ecx, dword ptr[numdwords]
;or this mov eax, dword ptr[numbytes] and check for powers of 2
rep movsd

So Memcopy should be faster.

##### Share on other sites
To my mind, if it could be faster than memcpy, it's not because it moves 4 bytes, but because your copy algorithm uses structures with size equal to the processor size.
Moreover, I tried it in a personnal work. Sometimes, it is faster (with standard optimization level in VS .NET 2003 - no SSE like optimizations). It really depends on the array size.

Nevertheless, before trying asm optimization, please just try to forger dwords--. Post-fixed operators are a really bad idea, because it pushes the initial value on the stack.

Try it :
++dwords;
--idest;
--isrc;
while(--dwords)
*++idest = *++isrc;

I know the compiler can guess it, but... it's better to be sure

##### Share on other sites
Quote:
Original post by ajas95
Quote:
 Original post by iMalcFor your purposes you can probably assume that *nothing* is faster than memcpy.

Except for possibly AMD memcpy! (for SSE-capable processors). It is significantly faster, I use it (along with Doug Lea's malloc) with custom allocators for all my containers now, and it's been a beauty.
Yes that's quite interesting, though by the sounds of it not totally portable. I wouldn't recommend it to the average programmer because they'll use it for everything and unnecessarily cut out some of the audience of their program because it wont run on their CPU.

caesar: It's great that you've come up with the DWORD copying mechanism, but many others have unfortunately already thought of that, and much more (e.g. Duff's Device) before you. memcpy probably already does what you wrote (except it would copy aligned DWORDS I imagine). Sorry if that sounds negative, but I there is a good change that you're looking in the wrong place for optimisation. Right now we know nothing about what you are using it for, or if it is indeed a performance bottleneck. e.g. Some people want a faster memcopy because their bubble sort is too slow with 1000 items, when the real solution is a different algorithm.

I'm quite possibly wrong but there is no way for me to know. Please tell us what you are doing that involves memcpying.
It is a good idea when posting about optimisation to include in your initial post what part of your program is performing too slowly, and what you have already tried etc. Oh and if you are just making a code library then that's cool, just let us know.

Our goal is the same as yours, to make your program run faster. It's just quite likely that there are better ways to achieve this that we can only think of through a better understanding of what you are doing.

Remember, the fastest instruction is the one that is never executed!

##### Share on other sites
There is really no point in optimising memcpy since the function is, as the name suggests, going to be limited by the memory bandwidth. RAM is significantly slower than the CPU so any extensive memory access is going to kill your performance. With small memory copies, the caches can hide this to some degree.

The best approach is to avoid doing the copies.

Skizz

##### Share on other sites
Your algo is full of casting, shift, mods, sums, conditionals...do you really think it can be faster???
Professional programmers who wrote memcpy implementations are not idiots...

##### Share on other sites
I don't believe that memcpy uses movsd!
Maybe on some intel compilers but not on default.
First of all, you have to check if the size of data you would like to transfer is dividalbe with 4, and what if it isn-t? There has to be a backup plan!

The easiest and the slowest way is movsb.
with a data array of 16 bytes you have 4x movsb, and what if it's 19?

4xmovsb, 1xmovsw, 1movsb ? Do you think memcpy is so smart?

##### Share on other sites
Quote:
 Original post by Samurai JackI don't believe that memcpy uses movsd!Maybe on some intel compilers but not on default.First of all, you have to check if the size of data you would like to transfer is dividalbe with 4, and what if it isn-t? There has to be a backup plan!The easiest and the slowest way is movsb.with a data array of 16 bytes you have 4x movsb, and what if it's 19?4xmovsb, 1xmovsw, 1movsb ? Do you think memcpy is so smart?

(I suppose this is 4xmovsd, 1xmovsw, 1movsb)

The intel version if memcpy() which can be found in VS.NET and icc indeed is. It also add something that caesar4 forgot to add - and therefore is a lot faster - which is memory alignement. movsd with memory aligned on 4 byte boundary is a lot faster than a simple movsd.

And AFAIR gcc for x86 also use movsd. It would be silly to not use it.

@blizzard99: actually, they are not idiots. But intel's implementation is still full of logical operations.

This kind of function is well known and all the optimization you might make to speed them had been done by your compiler vendor before you.

Regards,

##### Share on other sites
Quote:
 Original post by Anonymous PosterNevertheless, before trying asm optimization, please just try to forger dwords--. Post-fixed operators are a really bad idea, because it pushes the initial value on the stack.Try it :++dwords;--idest;--isrc;while(--dwords) *++idest = *++isrc; I know the compiler can guess it, but... it's better to be sure

Hum. That's not the right code. First, it is a pain to read. Second, these --idest and --isrc are just a ugly hack. Third, your compiler is not stupid.

You can use postfixed operators in the *++idest = *++isrc; because you want the pointers to be incremented AFTER the copy. In our case, the compiler will not create a temporary variable because the code do not say it will be stored. This is not a = b++; this is a++, b++;. About the while (dwords--) thing, you can translate this to direct assembly yourself:

mov ecx, dwordsor ecx, ecxjz endloopdec ecx

If you compiler cannot write this, update it ;)

##### Share on other sites
The dword copying thing does not work too well unless you copy aligned dwords. So if it starts on, say, a 4x+3 boundary, you're going to have a real mess on your hands. So first you copy however many bytes you need to align the data (if you can assume data is aligned, it's really good). Then you copy as many dwords as you can, and then copy the bytes at the end.

I'm looking at the AMD memcpy implementation now, which appears to have four seperate copy modes: Tiny block copy, in-cache copy, out-of-cache copy, and block-prefetch copy. Each mode is for increasing sizes of memory. It skips the alignment stage for blocks of memory that are between 32k and 64k in size ("it appears to be slower"). Most of the stuff in this AMD code can't be written in C, because it's not only using MMX instructions, it also uses unrolled assembly loops and special instructions that request cache prefetches or skip the cache and do a direct memory read/write.

##### Share on other sites
This is a nano-optimization, if you need this kind of optimization you're going to need to custom tailor it to your situation anyways. Your code will be slow if the addresses are not dword aligned - or may crash on some architectures (such as some of the "new" 64 bit processors).

Instead of premature optimization and worrying about wheither code is too slow or not, implement and find out (with a profiler) if code is too slow or not, and then - and only if it IS too slow - do you optimize.

memcpy is a generalized tool, and you can be assured that it is probably just about as optimized as such a generalized tool can get. Either that or you've got a cruddy setup which should be replaced ASAP.

##### Share on other sites
The memcpy on my windows and OS X machines both have optimized paths for properly aligned data.

They copy the first few bytes until the address is aligned and then use the aligned data.

Cheers
Chris

##### Share on other sites
Quote:
 Original post by Samurai JackI don't believe that memcpy uses movsd!Maybe on some intel compilers but not on default.First of all, you have to check if the size of data you would like to transfer is dividalbe with 4, and what if it isn-t? There has to be a backup plan!The easiest and the slowest way is movsb.with a data array of 16 bytes you have 4x movsb, and what if it's 19?4xmovsb, 1xmovsw, 1movsb ? Do you think memcpy is so smart?

The back up plan is to check of course hence the "or" I wrote before the blit.

ie:

mov ecx, dword ptr[numbytes]
mov eax, ecx
shr ecx, 2
rep movsd
and eax, 3
mov ecx, eax
rep movsb

It works as I've been using it for years.

http://rel.betterwebber.com/junk.php?id=29

Yes, the above is an all software 3d render using TinyPTC and inline ASM.