Sign in to follow this  
Jan Wassenberg

New Mini-Article: Speeding up memcpy

Recommended Posts

Jan Wassenberg    999
Howdy. I've just written a Technical Report on speeding up memcpy. It presents source code and techniques behind an implementation that beats VC7.1's memcpy() by 7..300%, depending on transfer size. There are no special CPU requirement (runs on 'all' CPUs from the original Pentium MMX on) and it can easily be dropped into other projects. Hope it helps someone! Feedback is welcome. [Edited by - Jan Wassenberg on December 8, 2005 5:22:05 AM]

Share this post


Link to post
Share on other sites
The whole article is very interesting - a good read. The only negative point is that the code can't be used in anything else but a GPL'ed project.
Quote:

Source Code:
This is licensed under the GPL.

As a consequence, it will be harder to use. At least, I can't use it :S

But it is still a good work, and an impressive achievement ! [smile]

Share this post


Link to post
Share on other sites
DJHoy    378
Looks quite interesting - what CPU types has it been tested on? It'd be interesting to see what it would do on a Celeron-type with a smaller on-board cache...

_winterdyne_: Not to get off topic here, but can I ask (without knowing anything about your project) why do you need to clone assets rather than to re-use pointers to the same asset? (Just curious - )

Share this post


Link to post
Share on other sites
_winterdyne_    530
I'm developing a flexible MMO infrastructure - part of the mandate is to allow heavy asset reuse where possible but also to allow live editting. Portions of the descriptors for areas that are being editted are cloned before alteration if they're already in use elsewhere, and the new, altered asset compared against the old to generate a delta patch. Whether this will be done on the server or on a trusted superclient is still in the air.

The heavy asset reuse and checking should allow for download size to be minimised, and those delta packages can be sent to clients rather than distribute an entirely new asset. That's the plan anyway. :-)

Share this post


Link to post
Share on other sites
Nitage    1107
To add this to a C++ project, would I have to write a header decalring the function like this?

void* __declspec(naked) ia32_memcpy(void* dst, const void* src, size_t nbytes);

Share this post


Link to post
Share on other sites
CTar    1134
Very (very) nice article, I can probably use that knowledge for lots of stuff. I never thought of speeding up memcpy, well I did once when reading Andre LaMothe's books, but then I realized lots of his optimization tricks were horrible (today at least) and from that point I never thought of implementing my own memcpy.

[Edited by - CTar on November 29, 2005 9:19:10 AM]

Share this post


Link to post
Share on other sites
chollida1    532
Cool, it looks like you borrowed a bunch of tricks from an old AMD paper on increasing memcpy throughput:) Damn, where is that paper:)

Thanks for sharing.

Cheers
Chris

Share this post


Link to post
Share on other sites
_winterdyne_    530
Just a quick thought, but does memset() suffer the same failings as memcpy()?

An adapted version of this could help optimise initialisation of large data structures, rather than using a for.. loop on smaller element sizes.

Edit: Or worse, several for loops on differing element types.

Share this post


Link to post
Share on other sites
snk_kid    1312
Quote:
Original post by Fruny
How does it compare with std::copy ?


Just highlight a point std::copy on modern compilers like VC++ 7.1 > * or G++ will dispatch at compile-time to use memmove (where ever possible) not memcpy because the input and output ranges are permitted to overlap.

So Jan Wassenberg see if you can do a new article on speeding up memmove, if you can [wink].

Share this post


Link to post
Share on other sites
Jan Wassenberg    999
Hey, thanks for the kind words :)

Quote:
The only negative point is that the code can't be used in anything else but a GPL'ed project.

hehe, true. I am willing to entertain discussion of an alternate license in exchange for other considerations :)

Quote:
Looks quite interesting - what CPU types has it been tested on? It'd be interesting to see what it would do on a Celeron-type with a smaller on-board cache...

I've also tested on a Pentium III (laptop) system and determined a 3% speed gain in the small transfer benchmark. As transfers get larger, the same kind of huge increases as on Athlon DDR systems are expected.
Prediction of performance on Celeron:
there are 2 Celeron versions, one based on Pentium II and a later one with a Pentium III core. Obviously the latter will do much better because it also supports SSE instructions (PREFETCH, MOVNTQ); the former is stuck with the MMX MOVQ copy technique. Celerons are basically the same processor only with half of the L2 cache disabled (due to manufacturing defect in the wafer). Therefore, performance of the software prefetch+MOVNTQ is expected to suffer (it would be better to switch to block prefetch earlier due to L2 cache thrashing). Unfortunately messing with the BP_THRESHOLD would penalize newer processors because they'd unnecessarily switch to block prefetching, so it's probably best to leave as-is.

Quote:
To add this to a C++ project, would I have to write a header decalring the function like this?
void* __declspec(naked) ia32_memcpy(void* dst, const void* src, size_t nbytes);

No, fortunately that's not necessary. __declspec(naked) only affects definitions, not declarations.
When declaring a function that's actually implemented in asm, no further work is needed; the asm code is implemented such that the compiler doesn't have to care what exactly it is calling (the calling convention matches).

Quote:
Cool, it looks like you borrowed a bunch of tricks from an old AMD paper on increasing memcpy throughput:) Damn, where is that paper:)

Referenced as [AMD1] and [AMD2] in the article/bibliography! :)

Quote:
Just a quick thought, but does memset() suffer the same failings as memcpy()?
An adapted version of this could help optimise initialisation of large data structures, rather than using a for.. loop on smaller element sizes.

Ah, yes - good idea! Zeroing e.g. the aforementioned transposition tables (if you're not marking them invalid) would be time-critical.
Should be easy to do via copy+paste if you need it immediately, but I'll think about a nicer way to do that.

Quote:
How does it compare with std::copy ?

Interesting question. I'll test that later tonight.

Share this post


Link to post
Share on other sites
Jan Wassenberg    999
Quote:
nutz you just missed my post :(

hehe :)

Fruny: results are in! See below.

Quote:
Just highlight a point std::copy on modern compilers like VC++ 7.1 > * or G++ will dispatch at compile-time to use memmove not memcpy because the input and output ranges are permitted to overlap.
So Jan Wassenberg see if you can do a new article on speeding up memmove, if you can .

Unfortunately this [std::copy] looks like a case of poor library design that cannot be saved by optimization. memcpy() gives the guarantee that its parameters are not aliased, while any implementation of memmove()/std::copy() will have to check if there is overlap. (or stick with the safe+slow method of copying backwards, but that sucks performance-wise because it foils hardware prefetches)

Therefore, std::copy will always have more work than a good memcpy() implementation. Which brings us to the next point: I just looked at VC7.1's code in more detail, and am absolutely amazed - memcpy == memmove! They both always do the copy-up and copy-down checks.
So what does that mean? Both std::copy and memcpy are much slower - and we now know why.

I've cleaned up the benchmark (code below) and it tells us: [for align=64, max_size=1024, max_misalign=63]

AMD : 15617949 (+7.6%)
VC7.1 : 18674264 (+28.6%)
Jan : 14518968 (+0.0%)
std::copy : 19266087 (+32.7%)
std::copy2: 18604376 (+28.1%)

(these figures are total # CPU clocks elapsed over transfers of all sizes and misalignments)
"Jan" is the new code; it performs quite well :) AMD hangs back a bit due to the problems explained in the article. Clearly the others (all memcpy derivatives and basically the same) cannot compete - and this is even disregarding SSE transfers which kick in at sizes of 64KB and above.

A note on std::copy: performance is expected to match memcpy but differences arise from how these implementations are called from the benchmark. std::copy2 reflects performance when calling it (i.e. memmove) directly; std::copy wraps it in a function and calls via function pointer (same treatment as for the others).

One additional caveat is that this benchmark favors the other implementations. Reason is that we must run things several times to rule out external interference, which thereby helps the implementations containing more conditional jumps because they are sure to be predicted correctly. Unfortunately, I don't see any way around this (increasing thread priority isn't fool-proof).

The source:


// call through a function pointer for convenience and to foil optimizations.
typedef void*(*PFmemcpy)(void* restrict, const void* restrict, size_t);

// allocated memory is aligned to this; benchmark adds misalign.
static const size_t ALIGN = 64;
// transfer size. loop starts at 1 (0 makes no sense).
static const size_t MAX_SIZE = 64*1024;
// misalignment within buffer. loop starts at 0 (i.e. aligned).
static const size_t MAX_MISALIGN = 63;

// perform the benchmark for <impl>, copying from src_mem to dst_mem.
// the execution time of transfers of all sizes and misalignments up to
// MAX_* are added together and returned [in CPU clocks].
static i64 benchmark_impl(PFmemcpy impl, void* dst_mem,
const void* src_mem)
{
i64 total_dt = 0;
for(size_t misalign = 0; misalign <= MAX_MISALIGN; misalign++)
{
for(size_t size = 1; size <= MAX_SIZE; size++)
{
void* dst = (char*)dst_mem + misalign;
const void* src = (const char*)src_mem + misalign;

// repeat several times and take the *best* value to avoid
// external interference (e.g. task switch)
i64 min_dt = LLONG_MAX;
for(int rep = 0; rep < 10; rep++)
{
const i64 t0 = rdtsc();
impl(dst, src, size);
const i64 t1 = rdtsc();

const i64 dt = t1-t0;
min_dt = MIN(dt, min_dt);
}
debug_assert(min_dt > 0); // shouldn't be negative or 0

total_dt += min_dt;
}
}

return total_dt;
}

static i64 benchmark_std_copy(void* dst_mem, const void* src_mem)
{
i64 total_dt = 0;
for(size_t misalign = 0; misalign <= MAX_MISALIGN; misalign++)
{
for(size_t size = 1; size <= MAX_SIZE; size++)
{
void* dst = (char*)dst_mem + misalign;
const void* src = (const char*)src_mem + misalign;

// repeat several times and take the *best* value to avoid
// external interference (e.g. task switch)
i64 min_dt = LLONG_MAX;
for(int rep = 0; rep < 10; rep++)
{
const i64 t0 = rdtsc();
std::copy<const u8*, u8*>((const u8*)src, (const u8*)src+size, (u8*)dst);
const i64 t1 = rdtsc();

const i64 dt = t1-t0;
min_dt = MIN(dt, min_dt);
}
debug_assert(min_dt > 0); // shouldn't be negative or 0

total_dt += min_dt;
}
}

return total_dt;
}


void memcpy_benchmark()
{
const size_t total_size = MAX_SIZE + ALIGN + MAX_MISALIGN+1;
void* dst_mem = malloc(total_size);
const void* src_mem = (const void*)malloc(total_size);
if(!dst_mem || !src_mem)
{
debug_warn("memcpy_benchmark failed; couldn't allocate buffers");
return;
}

// align pointers [just so that misalignment actually goes from 0..63,
// instead of e.g. 21..20 (mod 64)]
void* aligned_dst = (void*)round_up((uintptr_t)dst_mem, ALIGN);
const void* aligned_src = (const void*)round_up((uintptr_t)src_mem, ALIGN);

struct Contender
{
const char* name;
PFmemcpy impl;
i64 elapsed_clocks;
};
static struct Contender contenders[] =
{
{"AMD ", memcpy_amd},
{"VC7.1 ", memcpy},
{"Jan ", memcpy2},
{"std::copy ", std_copy_trampoline},
{"std::copy2", 0}
};
for(int i = 0; i < ARRAY_SIZE(contenders)-1; i++)
contenders[i].elapsed_clocks = benchmark_impl(contenders[i].impl, aligned_dst, aligned_src);
contenders[ARRAY_SIZE(contenders)-1].elapsed_clocks = benchmark_std_copy(aligned_dst, aligned_src);

// find minimum
i64 min_clocks = LLONG_MAX;
for(int i = 0; i < ARRAY_SIZE(contenders); i++)
min_clocks = MIN(min_clocks, contenders[i].elapsed_clocks);

// display results with delta to best value
debug_printf("MEMCPY BENCHMARKS\n");
for(int i = 0; i < ARRAY_SIZE(contenders); i++)
{
const float difference = (contenders[i].elapsed_clocks - min_clocks) / (float)min_clocks;
debug_printf(" %s: %I64d (+%.1f%%)\n", contenders[i].name, contenders[i].elapsed_clocks, difference*100.f);
}

free(dst_mem);
free((void*)src_mem);
}

Share this post


Link to post
Share on other sites
Promit    13246
I have this function that can be used to obtain a standard library style copy function that goes to memcpy rather than memmove:

//special copy (danger! Only for types for which memcpy is valid! Does not dispatch to memmove!)
template<typename InIt, typename OutIt>
inline OutIt MemCopy( InIt First, InIt Last, OutIt Dest )
{
//this implementation is based on VC 7.1's std::copy implementation
std::ptrdiff_t Diff = Last - First;
#ifdef _M_IX86
void* memcpy_amd( void* dest, const void* src, std::size_t n );
return ( (OutIt) memcpy_amd( &*Dest, &*First, Diff * sizeof(*First) ) + Diff );
#else
return ( (OutIt) memcpy( &*Dest, &*First, Diff * sizeof(*First) ) + Diff );
#endif
}


This was written to use AMD's memcpy implementation, but a quick change will make it call Jan's instead. Needless to say, you should drop it in a namespace and change the name to conform with your own conventions. Keep in mind that this does not include the iterator trait based dispatching code that std::copy does, so it won't correctly dispatch for types with a non-trivial copy constructor. In other words, don't use it on anything that memcpy wouldn't be safe on.

Share this post


Link to post
Share on other sites
nmi    978
In your paper you say that which copy function is best suited depends on the cpu type (like having support for MMX, SSE) and the size of the memory block.

I'm wondering if it would be possible to determine the correct algorithm depending on size in cases where the size of the memory block is constant, i.e. known at compile time.

So if there are up to 64 byte to copy, the mov instructions can be inline directly. If there are more bytes to copy or the size is only known at runtime you generate a call to the other assembler functions which will check the cpu type etc.

This will reduce the overhead for small blocks of data and will create the same code in the other cases.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster   
Guest Anonymous Poster
Another thing to remember is that the best method to use depends on whether you intend to use the data afterwards. If you use non-caching writes it will be slower to access immediately after than if you had used a standard memcpy.

ALWAYS profile your app before and after inserting things like this because in many cases you may end up slowing down your application even if the memcpy routine appears to be faster.

Share this post


Link to post
Share on other sites
nmi    978
Quote:
Original post by Anonymous Poster
Another thing to remember is that the best method to use depends on whether you intend to use the data afterwards. If you use non-caching writes it will be slower to access immediately after than if you had used a standard memcpy.

Maybe you missed that line which can be found in the paper:
- 64 KB (L1 data cache size; beyond this, techniques that avoid polluting the cache are a big win)

So for large amounts of data you should disable caching, or chances are high you will end up having the wrong data in your cache.

Quote:

ALWAYS profile your app before and after inserting things like this because in many cases you may end up slowing down your application even if the memcpy routine appears to be faster.


I think that's what Jan did.

Share this post


Link to post
Share on other sites
doynax    850
Quote:
Original post by nmi
Quote:

ALWAYS profile your app before and after inserting things like this because in many cases you may end up slowing down your application even if the memcpy routine appears to be faster.

I think that's what Jan did.
I think he meant that a synthetic benchmark may not be ideal, since the optimum method may depend on what you're going to do with the data afterwards. At least it's a good idea to be careful with explicit cache control, especially since many modern CPUs have more than 256 kB of L2 cache.
Also, how does the new function compare to small (especially fixed size) compiler intrinsics?

BTW doesn't the original (pre-XP) Athlons support MOVNTQ too, through the extended MMX instruction set? And what about 16-byte SSE transfers with MOVAPS/MOVDQA?

Share this post


Link to post
Share on other sites
Jan Wassenberg    999
Promit: interesting. Is the purpose familiarity/taste, or is there a pressing need for such a wrapper?

nmi:
Quote:
I'm wondering if it would be possible to determine the correct algorithm depending on size in cases where the size of the memory block is constant, i.e. known at compile time.
So if there are up to 64 byte to copy, the mov instructions can be inline directly. If there are more bytes to copy or the size is only known at runtime you generate a call to the other assembler functions which will check the cpu type etc.
This will reduce the overhead for small blocks of data and will create the same code in the other cases.

hm. I don't see a way to generate such code at compile-time. Is this a "wouldn't it be nice if" kind of thing? :)
BTW, overhead for small blocks is already quite low: the more complicated logic to select technique based on size only runs when size > 64. But yes, compiler-generated intrinsics are even better when available. For example, you could assign structs to each other rather than using memcpy.

AP:
Quote:
Quote:
Another thing to remember is that the best method to use depends on whether you intend to use the data afterwards. If you use non-caching writes it will be slower to access immediately after than if you had used a standard memcpy.

Maybe you missed that line which can be found in the paper:
- 64 KB (L1 data cache size; beyond this, techniques that avoid polluting the cache are a big win)
So for large amounts of data you should disable caching, or chances are high you will end up having the wrong data in your cache.

Let me expand on that.
The thresholds to the non-in-cache techniques are chosen at the limit where cache thrashing would otherwise result (i.e. cache size). Beyond that point, they are obviously beneficial, but they are not at all used otherwise; the implementation always tries to keep your data in-cache where it makes sense (i.e. it probably won't be displaced before being used).

Doynax:
Quote:
I think he meant that a synthetic benchmark may not be ideal, since the optimum method may depend on what you're going to do with the data afterwards.

Yes, measuring performance in context is always better.

Quote:
Also, how does the new function compare to small (especially fixed size) compiler intrinsics?
And what about 16-byte SSE transfers with MOVAPS/MOVDQA?

Good questions. Will investigate tomorrow; am dog tired.
I suspect on Athlons at least that MOVAPS will not help because IIRC internal datapaths are actually only 64 bits and larger accesses are split up. I'll give it a try, though. Unfortunately I lack SSE2 hardware to test MOVDQA.

Quote:
BTW doesn't the original (pre-XP) Athlons support MOVNTQ too, through the extended MMX instruction set?

Yes! Thanks for pointing that out. I will update the init routine and then you will have seriously boosted performance on old Athlons :)

Share this post


Link to post
Share on other sites
Zahlman    1682
Quote:
Original post by Jan Wassenberg
hm. I don't see a way to generate such code at compile-time. Is this a "wouldn't it be nice if" kind of thing? :)


Templating, with recursion and partial specialization to create 'ranges' of memory sizes?


// probably horribly broken, but gets the idea across
template <int size, int type>
inline void my_memcpy(char* src, char* dest) {
my_memcpy<size, type - 1>(src, dest);
}

template <int size>
inline void my_memcpy<0>(char* src, char* dest) {
my_smallblock_memcpy(src, dest, size);
}

template <int size>
inline void my_memcpy<64>(char* src, char* dest) {
my_largeblock_memcpy(src, dest, size);
}

// array copy.
template <int size>
inline void my_array_memcpy(char src[size]&, char dest[size]&) {
my_memcpy<size, size>(src, dest);
}

Share this post


Link to post
Share on other sites
snk_kid    1312
Quote:
Original post by Jan Wassenberg
... std::copy wraps it in a function and calls via function pointer (same treatment as for the others).


I didn't quite get what you mean here, there shouldn't be a need to use pointers to functions internally in generic algorithms unless you where referring to your version of std::copy or something.

Anyways bare in mind to others just as with memcpy/memmove you cannot use this code for arrays of NON POD-type, in C++ you should be using std::copy because it takes care of those "special cases" when you are dealing with arrays of POD-types, this is all transparent and incurs no cost in efficiency.

If you would like to use Jan's version of memcpy with something similar to std::copy to handle "special cases" transparently efficiently i've written some base code you can use, all you have to do is replace all occurrences of memmove (only 1 there [wink]) to a call to Jan's code.

The code depends on boost type traits & concept checking library, boost type traits can easily be replaced with TR1 type traits if your compiler supports TR1 or your own type traits. Concept checking can be removed if required.


#include <iterator>
#include <boost/type_traits/integral_constant.hpp>
#include <boost/type_traits/has_trivial_assign.hpp>
#include <boost/concept_check.hpp>

namespace custom_algorithm {

template < typename Tp >
struct is_iterator_wrapper : boost::false_type {
typedef is_iterator_wrapper<Tp> type;
};

// a little trick for G++ ;-)
#ifdef __GNUG__
template < typename Iterator, typename Container >
struct is_iterator_wrapper< std::__normal_iterator<Iterator, Container> >
: boost::true_type {
typedef is_iterator_wrapper<Iterator, Container> type;
};
#endif

template< typename InputIterator, typename OutputIterator >
inline OutputIterator copy__( InputIterator first, InputIterator last,
OutputIterator result, std::input_iterator_tag) {
for (; first != last; ++result, ++first)
*result = *first;
return result;
}

template< typename RandomAccessIterator, typename OutputIterator >
inline OutputIterator copy__( RandomAccessIterator first, RandomAccessIterator last,
OutputIterator result, std::random_access_iterator_tag) {

typedef typename std::iterator_traits<RandomAccessIterator>::difference_type
Distance;

for(Distance n = last - first; n > 0; --n) {
*result = *first;
++first;
++result;
}
return result;
}

template< typename Tp >
inline Tp* copy_trivial(const Tp* first, const Tp* last, Tp* result) {
std::memmove(result, first, sizeof(Tp) * (last - first));
return result + (last - first);
}

template< typename InputIterator, typename OutputIterator >
inline OutputIterator copy_aux2(InputIterator first, InputIterator last,
OutputIterator result, boost::false_type) {
typedef std::iterator_traits<InputIterator>::iterator_category itr_cat;
return copy__(first, last, result, itr_cat());
}

template< typename InputIterator, typename OutputIterator >
inline OutputIterator copy_aux2(InputIterator first, InputIterator last,
OutputIterator result, boost::true_type) {
typedef std::iterator_traits<InputIterator>::iterator_category itr_cat;
return copy__(first, last, result, itr_cat());
}

template< typename Tp >
inline Tp* copy_aux2(Tp* first, Tp* last, Tp* result, boost::true_type)
{ return copy_trivial(first, last, result); }

template< typename Tp >
inline Tp* copy_aux2(const Tp* first, const Tp* last, Tp* result, boost::true_type)
{ return copy_trivial(first, last, result); }

template< typename InputIterator, typename OutputIterator >
inline OutputIterator copy_ni2(InputIterator first, InputIterator last,
OutputIterator result, boost::true_type) {
typedef typename std::iterator_traits<InputIterator>::value_type
ValueType;
typedef typename boost::has_trivial_assign<ValueType>::type
Trivial;
return OutputIterator(copy_aux2(first, last, result.base(), Trivial()));
}

template< typename InputIterator, typename OutputIterator >
inline OutputIterator copy_ni2(InputIterator first, InputIterator last,
OutputIterator result, boost::false_type) {
typedef typename std::iterator_traits<InputIterator>::value_type
ValueType;
typedef typename boost::has_trivial_assign<ValueType>::type
Trivial;
return copy_aux2(first, last, result, Trivial());
}

template< typename InputIterator, typename OutputIterator >
inline OutputIterator copy_ni1(InputIterator first, InputIterator last,
OutputIterator result, boost::true_type) {

typedef typename is_iterator_wrapper<OutputIterator>::type is_wrapper;
return copy_ni2(first.base(), last.base(), result, is_wrapper());
}

template<typename InputIterator, typename OutputIterator>
inline OutputIterator copy_ni1(InputIterator first, InputIterator last,
OutputIterator result, boost::false_type) {
typedef typename is_iterator_wrapper<OutputIterator>::type is_wrapper;
return copy_ni2(first, last, result, is_wrapper());
}

template< typename InputIterator, typename OutputIterator >
inline OutputIterator copy( InputIterator first, InputIterator last,
OutputIterator result) {
using namespace boost;
// concept requirements
function_requires< InputIteratorConcept<InputIterator> >();
function_requires< OutputIteratorConcept<OutputIterator>,
typename std::iterator_traits<OutputIterator>::value_type >();
typedef typename is_iterator_wrapper<OutputIterator>::type is_wrapper;
return copy_ni1(first, last, result, is_wrapper());
}
};

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this