Back to General and Gameplay Programming

New Mini-Article: Speeding up memcpy

General and Gameplay Programming Programming

Started by Jan Wassenberg November 29, 2005 06:56 AM

44 comments, last by Jan Wassenberg 18 years, 4 months ago

Fruny

1,658

November 29, 2005 10:38 AM

How does it compare with std::copy ?

"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." — Brian W. Kernighan

snk_kid

1,312

November 29, 2005 10:43 AM

Quote:Original post by Fruny
How does it compare with std::copy ?

Just highlight a point std::copy on modern compilers like VC++ 7.1 > * or G++ will dispatch at compile-time to use memmove (where ever possible) not memcpy because the input and output ranges are permitted to overlap.

So Jan Wassenberg see if you can do a new article on speeding up memmove, if you can [wink].

Jan Wassenberg

1,000

Author

November 29, 2005 10:46 AM

Hey, thanks for the kind words :)

Quote:The only negative point is that the code can't be used in anything else but a GPL'ed project.

hehe, true. I am willing to entertain discussion of an alternate license in exchange for other considerations :)

Quote:Looks quite interesting - what CPU types has it been tested on? It'd be interesting to see what it would do on a Celeron-type with a smaller on-board cache...

I've also tested on a Pentium III (laptop) system and determined a 3% speed gain in the small transfer benchmark. As transfers get larger, the same kind of huge increases as on Athlon DDR systems are expected.
Prediction of performance on Celeron:
there are 2 Celeron versions, one based on Pentium II and a later one with a Pentium III core. Obviously the latter will do much better because it also supports SSE instructions (PREFETCH, MOVNTQ); the former is stuck with the MMX MOVQ copy technique. Celerons are basically the same processor only with half of the L2 cache disabled (due to manufacturing defect in the wafer). Therefore, performance of the software prefetch+MOVNTQ is expected to suffer (it would be better to switch to block prefetch earlier due to L2 cache thrashing). Unfortunately messing with the BP_THRESHOLD would penalize newer processors because they'd unnecessarily switch to block prefetching, so it's probably best to leave as-is.

Quote:To add this to a C++ project, would I have to write a header decalring the function like this?
void* __declspec(naked) ia32_memcpy(void* dst, const void* src, size_t nbytes);

No, fortunately that's not necessary. __declspec(naked) only affects definitions, not declarations.
When declaring a function that's actually implemented in asm, no further work is needed; the asm code is implemented such that the compiler doesn't have to care what exactly it is calling (the calling convention matches).

Quote:Cool, it looks like you borrowed a bunch of tricks from an old AMD paper on increasing memcpy throughput:) Damn, where is that paper:)

Referenced as [AMD1] and [AMD2] in the article/bibliography! :)

Quote:Just a quick thought, but does memset() suffer the same failings as memcpy()?
An adapted version of this could help optimise initialisation of large data structures, rather than using a for.. loop on smaller element sizes.

Ah, yes - good idea! Zeroing e.g. the aforementioned transposition tables (if you're not marking them invalid) would be time-critical.
Should be easy to do via copy+paste if you need it immediately, but I'll think about a nicer way to do that.

Quote:How does it compare with std::copy ?

Interesting question. I'll test that later tonight.

E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3

snk_kid

1,312

November 29, 2005 10:49 AM

Quote:Original post by Jan Wassenberg
Interesting question. I'll test that later tonight.

nutz you just missed my post :(

Jan Wassenberg

1,000

Author

November 29, 2005 02:42 PM

Quote:nutz you just missed my post :(

hehe :)

Fruny: results are in! See below.

Quote:Just highlight a point std::copy on modern compilers like VC++ 7.1 > * or G++ will dispatch at compile-time to use memmove not memcpy because the input and output ranges are permitted to overlap.
So Jan Wassenberg see if you can do a new article on speeding up memmove, if you can .

Unfortunately this [std::copy] looks like a case of poor library design that cannot be saved by optimization. memcpy() gives the guarantee that its parameters are not aliased, while any implementation of memmove()/std::copy() will have to check if there is overlap. (or stick with the safe+slow method of copying backwards, but that sucks performance-wise because it foils hardware prefetches)

Therefore, std::copy will always have more work than a good memcpy() implementation. Which brings us to the next point: I just looked at VC7.1's code in more detail, and am absolutely amazed - memcpy == memmove! They both always do the copy-up and copy-down checks.
So what does that mean? Both std::copy and memcpy are much slower - and we now know why.

I've cleaned up the benchmark (code below) and it tells us: [for align=64, max_size=1024, max_misalign=63]

  AMD       : 15617949 (+7.6%)  VC7.1     : 18674264 (+28.6%)  Jan       : 14518968 (+0.0%)  std::copy : 19266087 (+32.7%)  std::copy2: 18604376 (+28.1%)

(these figures are total # CPU clocks elapsed over transfers of all sizes and misalignments)
"Jan" is the new code; it performs quite well :) AMD hangs back a bit due to the problems explained in the article. Clearly the others (all memcpy derivatives and basically the same) cannot compete - and this is even disregarding SSE transfers which kick in at sizes of 64KB and above.

A note on std::copy: performance is expected to match memcpy but differences arise from how these implementations are called from the benchmark. std::copy2 reflects performance when calling it (i.e. memmove) directly; std::copy wraps it in a function and calls via function pointer (same treatment as for the others).

One additional caveat is that this benchmark favors the other implementations. Reason is that we must run things several times to rule out external interference, which thereby helps the implementations containing more conditional jumps because they are sure to be predicted correctly. Unfortunately, I don't see any way around this (increasing thread priority isn't fool-proof).

The source:

// call through a function pointer for convenience and to foil optimizations.typedef void*(*PFmemcpy)(void* restrict, const void* restrict, size_t);// allocated memory is aligned to this; benchmark adds misalign.static const size_t ALIGN = 64;// transfer size. loop starts at 1 (0 makes no sense).static const size_t MAX_SIZE = 64*1024;// misalignment within buffer. loop starts at 0 (i.e. aligned).static const size_t MAX_MISALIGN = 63;// perform the benchmark for <impl>, copying from src_mem to dst_mem.// the execution time of transfers of all sizes and misalignments up to// MAX_* are added together and returned [in CPU clocks].static i64 benchmark_impl(PFmemcpy impl, void* dst_mem,	const void* src_mem){	i64 total_dt = 0;	for(size_t misalign = 0; misalign <= MAX_MISALIGN; misalign++)	{		for(size_t size = 1; size <= MAX_SIZE; size++)		{			void* dst = (char*)dst_mem + misalign;			const void* src = (const char*)src_mem + misalign;			// repeat several times and take the *best* value to avoid			// external interference (e.g. task switch)			i64 min_dt = LLONG_MAX;			for(int rep = 0; rep < 10; rep++)			{				const i64 t0 = rdtsc();				impl(dst, src, size);				const i64 t1 = rdtsc();				const i64 dt = t1-t0;				min_dt = MIN(dt, min_dt);			}			debug_assert(min_dt > 0);	// shouldn't be negative or 0			total_dt += min_dt;		}	}	return total_dt;}static i64 benchmark_std_copy(void* dst_mem, const void* src_mem){	i64 total_dt = 0;	for(size_t misalign = 0; misalign <= MAX_MISALIGN; misalign++)			{		for(size_t size = 1; size <= MAX_SIZE; size++)		{			void* dst = (char*)dst_mem + misalign;			const void* src = (const char*)src_mem + misalign;			// repeat several times and take the *best* value to avoid			// external interference (e.g. task switch)			i64 min_dt = LLONG_MAX;			for(int rep = 0; rep < 10; rep++)			{				const i64 t0 = rdtsc();				std::copy<const u8*, u8*>((const u8*)src, (const u8*)src+size, (u8*)dst);				const i64 t1 = rdtsc();				const i64 dt = t1-t0;				min_dt = MIN(dt, min_dt);			}			debug_assert(min_dt > 0);	// shouldn't be negative or 0			total_dt += min_dt;		}	}	return total_dt;}void memcpy_benchmark(){	const size_t total_size = MAX_SIZE + ALIGN + MAX_MISALIGN+1;	void* dst_mem = malloc(total_size);	const void* src_mem = (const void*)malloc(total_size);	if(!dst_mem || !src_mem)	{		debug_warn("memcpy_benchmark failed; couldn't allocate buffers");		return;	}	// align pointers [just so that misalignment actually goes from 0..63,	// instead of e.g. 21..20 (mod 64)]	void* aligned_dst = (void*)round_up((uintptr_t)dst_mem, ALIGN);	const void* aligned_src = (const void*)round_up((uintptr_t)src_mem, ALIGN);	struct Contender	{		const char* name;		PFmemcpy impl;		i64 elapsed_clocks;	};	static struct Contender contenders[] =	{		{"AMD       ", memcpy_amd},		{"VC7.1     ", memcpy},		{"Jan       ", memcpy2},		{"std::copy ", std_copy_trampoline},		{"std::copy2", 0}	};	for(int i = 0; i < ARRAY_SIZE(contenders)-1; i++)		contenders.elapsed_clocks = benchmark_impl(contenders.impl, aligned_dst, aligned_src);	contenders[ARRAY_SIZE(contenders)-1].elapsed_clocks = benchmark_std_copy(aligned_dst, aligned_src);	// find minimum	i64 min_clocks = LLONG_MAX;	for(int i = 0; i < ARRAY_SIZE(contenders); i++)		min_clocks = MIN(min_clocks, contenders.elapsed_clocks);	// display results with delta to best value	debug_printf("MEMCPY BENCHMARKS\n");	for(int i = 0; i < ARRAY_SIZE(contenders); i++)	{		const float difference = (contenders.elapsed_clocks - min_clocks) / (float)min_clocks;		debug_printf("  %s: %I64d (+%.1f%%)\n", contenders.name, contenders.elapsed_clocks, difference*100.f);	}	free(dst_mem);	free((void*)src_mem);}

E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3

Promit

13,404

November 29, 2005 02:53 PM

I have this function that can be used to obtain a standard library style copy function that goes to memcpy rather than memmove:

	//special copy (danger! Only for types for which memcpy is valid! Does not dispatch to memmove!)	template<typename InIt, typename OutIt>	inline OutIt MemCopy( InIt First, InIt Last, OutIt Dest )	{		//this implementation is based on VC 7.1's std::copy implementation		std::ptrdiff_t Diff = Last - First;#ifdef _M_IX86		void* memcpy_amd( void* dest, const void* src, std::size_t n );				return ( (OutIt) memcpy_amd( &*Dest, &*First, Diff * sizeof(*First) ) + Diff );#else		return ( (OutIt) memcpy( &*Dest, &*First, Diff * sizeof(*First) ) + Diff );#endif	}

This was written to use AMD's memcpy implementation, but a quick change will make it call Jan's instead. Needless to say, you should drop it in a namespace and change the name to conform with your own conventions. Keep in mind that this does not include the iterator trait based dispatching code that std::copy does, so it won't correctly dispatch for types with a non-trivial copy constructor. In other words, don't use it on anything that memcpy wouldn't be safe on.

SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.

nmi

978

November 29, 2005 05:06 PM

In your paper you say that which copy function is best suited depends on the cpu type (like having support for MMX, SSE) and the size of the memory block.

I'm wondering if it would be possible to determine the correct algorithm depending on size in cases where the size of the memory block is constant, i.e. known at compile time.

So if there are up to 64 byte to copy, the mov instructions can be inline directly. If there are more bytes to copy or the size is only known at runtime you generate a call to the other assembler functions which will check the cpu type etc.

This will reduce the overhead for small blocks of data and will create the same code in the other cases.

Anonymous

November 29, 2005 05:10 PM

Another thing to remember is that the best method to use depends on whether you intend to use the data afterwards. If you use non-caching writes it will be slower to access immediately after than if you had used a standard memcpy.

ALWAYS profile your app before and after inserting things like this because in many cases you may end up slowing down your application even if the memcpy routine appears to be faster.

nmi

978

November 29, 2005 05:27 PM

Quote:Original post by Anonymous Poster
Another thing to remember is that the best method to use depends on whether you intend to use the data afterwards. If you use non-caching writes it will be slower to access immediately after than if you had used a standard memcpy.

Maybe you missed that line which can be found in the paper:
- 64 KB (L1 data cache size; beyond this, techniques that avoid polluting the cache are a big win)

So for large amounts of data you should disable caching, or chances are high you will end up having the wrong data in your cache.

Quote:
ALWAYS profile your app before and after inserting things like this because in many cases you may end up slowing down your application even if the memcpy routine appears to be faster.

I think that's what Jan did.

doynax

850

November 29, 2005 05:44 PM

Quote:Original post by nmi
Quote:
ALWAYS profile your app before and after inserting things like this because in many cases you may end up slowing down your application even if the memcpy routine appears to be faster.

I think that's what Jan did.

I think he meant that a synthetic benchmark may not be ideal, since the optimum method may depend on what you're going to do with the data afterwards. At least it's a good idea to be careful with explicit cache control, especially since many modern CPUs have more than 256 kB of L2 cache.
Also, how does the new function compare to small (especially fixed size) compiler intrinsics?

BTW doesn't the original (pre-XP) Athlons support MOVNTQ too, through the extended MMX instruction set? And what about 16-byte SSE transfers with MOVAPS/MOVDQA?

New Mini-Article: Speeding up memcpy

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

New Mini-Article: Speeding up memcpy

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines