New Mini-Article: Speeding up memcpy
Howdy. I've just written a Technical Report on speeding up memcpy.
It presents source code and techniques behind an implementation that beats VC7.1's memcpy() by 7..300%, depending on transfer size. There are no special CPU requirement (runs on 'all' CPUs from the original Pentium MMX on) and it can easily be dropped into other projects.
Hope it helps someone! Feedback is welcome.
[Edited by - Jan Wassenberg on December 8, 2005 5:22:05 AM]
That is sweeeeeet.
I rely heavily on memcpy for cloning assets around so this will give a real performance boost in my project. Many thanks!
I rely heavily on memcpy for cloning assets around so this will give a real performance boost in my project. Many thanks!
The whole article is very interesting - a good read. The only negative point is that the code can't be used in anything else but a GPL'ed project.
As a consequence, it will be harder to use. At least, I can't use it :S
But it is still a good work, and an impressive achievement ! [smile]
Quote:
Source Code:
This is licensed under the GPL.
As a consequence, it will be harder to use. At least, I can't use it :S
But it is still a good work, and an impressive achievement ! [smile]
Looks quite interesting - what CPU types has it been tested on? It'd be interesting to see what it would do on a Celeron-type with a smaller on-board cache...
_winterdyne_: Not to get off topic here, but can I ask (without knowing anything about your project) why do you need to clone assets rather than to re-use pointers to the same asset? (Just curious - )
_winterdyne_: Not to get off topic here, but can I ask (without knowing anything about your project) why do you need to clone assets rather than to re-use pointers to the same asset? (Just curious - )
I'm developing a flexible MMO infrastructure - part of the mandate is to allow heavy asset reuse where possible but also to allow live editting. Portions of the descriptors for areas that are being editted are cloned before alteration if they're already in use elsewhere, and the new, altered asset compared against the old to generate a delta patch. Whether this will be done on the server or on a trusted superclient is still in the air.
The heavy asset reuse and checking should allow for download size to be minimised, and those delta packages can be sent to clients rather than distribute an entirely new asset. That's the plan anyway. :-)
The heavy asset reuse and checking should allow for download size to be minimised, and those delta packages can be sent to clients rather than distribute an entirely new asset. That's the plan anyway. :-)
To add this to a C++ project, would I have to write a header decalring the function like this?
void* __declspec(naked) ia32_memcpy(void* dst, const void* src, size_t nbytes);
Very (very) nice article, I can probably use that knowledge for lots of stuff. I never thought of speeding up memcpy, well I did once when reading Andre LaMothe's books, but then I realized lots of his optimization tricks were horrible (today at least) and from that point I never thought of implementing my own memcpy.
[Edited by - CTar on November 29, 2005 9:19:10 AM]
[Edited by - CTar on November 29, 2005 9:19:10 AM]
Cool, it looks like you borrowed a bunch of tricks from an old AMD paper on increasing memcpy throughput:) Damn, where is that paper:)
Thanks for sharing.
Cheers
Chris
Thanks for sharing.
Cheers
Chris
Just a quick thought, but does memset() suffer the same failings as memcpy()?
An adapted version of this could help optimise initialisation of large data structures, rather than using a for.. loop on smaller element sizes.
Edit: Or worse, several for loops on differing element types.
An adapted version of this could help optimise initialisation of large data structures, rather than using a for.. loop on smaller element sizes.
Edit: Or worse, several for loops on differing element types.
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement