New Mini-Article: Speeding up memcpy

Started by
44 comments, last by Jan Wassenberg 18 years, 4 months ago
Howdy. I've just written a Technical Report on speeding up memcpy. It presents source code and techniques behind an implementation that beats VC7.1's memcpy() by 7..300%, depending on transfer size. There are no special CPU requirement (runs on 'all' CPUs from the original Pentium MMX on) and it can easily be dropped into other projects. Hope it helps someone! Feedback is welcome. [Edited by - Jan Wassenberg on December 8, 2005 5:22:05 AM]
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
Advertisement
That is sweeeeeet.

I rely heavily on memcpy for cloning assets around so this will give a real performance boost in my project. Many thanks!
Winterdyne Solutions Ltd is recruiting - this thread for details!
The whole article is very interesting - a good read. The only negative point is that the code can't be used in anything else but a GPL'ed project.
Quote:
Source Code:
This is licensed under the GPL.

As a consequence, it will be harder to use. At least, I can't use it :S

But it is still a good work, and an impressive achievement ! [smile]
Looks quite interesting - what CPU types has it been tested on? It'd be interesting to see what it would do on a Celeron-type with a smaller on-board cache...

_winterdyne_: Not to get off topic here, but can I ask (without knowing anything about your project) why do you need to clone assets rather than to re-use pointers to the same asset? (Just curious - )
I'm developing a flexible MMO infrastructure - part of the mandate is to allow heavy asset reuse where possible but also to allow live editting. Portions of the descriptors for areas that are being editted are cloned before alteration if they're already in use elsewhere, and the new, altered asset compared against the old to generate a delta patch. Whether this will be done on the server or on a trusted superclient is still in the air.

The heavy asset reuse and checking should allow for download size to be minimised, and those delta packages can be sent to clients rather than distribute an entirely new asset. That's the plan anyway. :-)

Winterdyne Solutions Ltd is recruiting - this thread for details!
this looks like really nice work.
To add this to a C++ project, would I have to write a header decalring the function like this?

void* __declspec(naked) ia32_memcpy(void* dst, const void* src, size_t nbytes);
Very (very) nice article, I can probably use that knowledge for lots of stuff. I never thought of speeding up memcpy, well I did once when reading Andre LaMothe's books, but then I realized lots of his optimization tricks were horrible (today at least) and from that point I never thought of implementing my own memcpy.

[Edited by - CTar on November 29, 2005 9:19:10 AM]
Cool, it looks like you borrowed a bunch of tricks from an old AMD paper on increasing memcpy throughput:) Damn, where is that paper:)

Thanks for sharing.

Cheers
Chris
CheersChris
Just a quick thought, but does memset() suffer the same failings as memcpy()?

An adapted version of this could help optimise initialisation of large data structures, rather than using a for.. loop on smaller element sizes.

Edit: Or worse, several for loops on differing element types.
Winterdyne Solutions Ltd is recruiting - this thread for details!

This topic is closed to new replies.

Advertisement