# would this be faster than memcpy()?

This topic is 4890 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

void *memcpy_1(void *dest, const void *src,int bytes){
int minor = bytes % 4;
int dwords = bytes >> 2;
int *idest = (int *)dest;
int *isrc = (int *)src;
while(dwords--)
*idest++ = *isrc++;

char *cdest = dest+bytes-minor;
char *csrc = src+bytes-minor;
while(minor--)
*cdest++ = *csrc++;

return dest;
}


for small memcpy this would probably act slower than the original, but for large copy operations would this be almost 4x faster because instead of moving a single byte, it moves 4? any asm optimizations are welcome

##### Share on other sites
I suggest you look at the actual implementation of memcpy. this is not faster...

##### Share on other sites
It depends on how your compiler implements memcpy. Some might have an SSE2 version. Some might use a for loop copying 3 bytes at a time (ok so thats unlikely).

The point is, is your code faster than memcpy? Which version of memcpy? Running on what processor?

If you really want to know the answer for your particlular compiler, profile it. In general, trying to micro-optimise things like memcpy if usually a waste of time. Profile your code, find where its actually slow, and optimise the algorithm.

Alan

##### Share on other sites
Probably just about any memcpy implementation already does the 4-bytes-at-time optimization, in addition to others, more platform specific tweaks, and is implemented directly in (inline) asm. It is usually safe to assume that standard library writers know what they are doing, and that standard functions are as about fast as they can get. (Additionally, compilers are allowed to recognize standard library function uses, and do extra optimization magic on them (for example, the compiler may know that sqrt() has no side-effects, and fold multiple calls with a same parameter into one and calculate the result at compile time, something it cannot always do with user-defined functions.).

##### Share on other sites
For your purposes you can probably assume that *nothing* is faster than memcpy.
This is probably not where you should be focusing your optimisation efforts, though it depends largely upon the context of what you are using it for. Tell us what you are doing with memcpy, and perhaps post some code, and we can help you make some real performance gains, perhaps at the algorithmic level first.

##### Share on other sites
Quote:
 Original post by iMalcFor your purposes you can probably assume that *nothing* is faster than memcpy.

Except for possibly AMD memcpy! (for SSE-capable processors). It is significantly faster, I use it (along with Doug Lea's malloc) with custom allocators for all my containers now, and it's been a beauty.

##### Share on other sites
I supposed memcopy does this:

mov ecx, dword ptr[numdwords]
;or this mov eax, dword ptr[numbytes] and check for powers of 2
rep movsd

So Memcopy should be faster.

##### Share on other sites
To my mind, if it could be faster than memcpy, it's not because it moves 4 bytes, but because your copy algorithm uses structures with size equal to the processor size.
Moreover, I tried it in a personnal work. Sometimes, it is faster (with standard optimization level in VS .NET 2003 - no SSE like optimizations). It really depends on the array size.

Nevertheless, before trying asm optimization, please just try to forger dwords--. Post-fixed operators are a really bad idea, because it pushes the initial value on the stack.

Try it :
++dwords;
--idest;
--isrc;
while(--dwords)
*++idest = *++isrc;

I know the compiler can guess it, but... it's better to be sure

##### Share on other sites
Quote:
Original post by ajas95
Quote:
 Original post by iMalcFor your purposes you can probably assume that *nothing* is faster than memcpy.

Except for possibly AMD memcpy! (for SSE-capable processors). It is significantly faster, I use it (along with Doug Lea's malloc) with custom allocators for all my containers now, and it's been a beauty.
Yes that's quite interesting, though by the sounds of it not totally portable. I wouldn't recommend it to the average programmer because they'll use it for everything and unnecessarily cut out some of the audience of their program because it wont run on their CPU.

caesar: It's great that you've come up with the DWORD copying mechanism, but many others have unfortunately already thought of that, and much more (e.g. Duff's Device) before you. memcpy probably already does what you wrote (except it would copy aligned DWORDS I imagine). Sorry if that sounds negative, but I there is a good change that you're looking in the wrong place for optimisation. Right now we know nothing about what you are using it for, or if it is indeed a performance bottleneck. e.g. Some people want a faster memcopy because their bubble sort is too slow with 1000 items, when the real solution is a different algorithm.

I'm quite possibly wrong but there is no way for me to know. Please tell us what you are doing that involves memcpying.
It is a good idea when posting about optimisation to include in your initial post what part of your program is performing too slowly, and what you have already tried etc. Oh and if you are just making a code library then that's cool, just let us know.

Our goal is the same as yours, to make your program run faster. It's just quite likely that there are better ways to achieve this that we can only think of through a better understanding of what you are doing.

Remember, the fastest instruction is the one that is never executed!

##### Share on other sites
There is really no point in optimising memcpy since the function is, as the name suggests, going to be limited by the memory bandwidth. RAM is significantly slower than the CPU so any extensive memory access is going to kill your performance. With small memory copies, the caches can hide this to some degree.

The best approach is to avoid doing the copies.

Skizz

• 10
• 17
• 9
• 14
• 41