# memset/memcpy

This topic is 4442 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

are these functions actually faster than doing it manually? I mean is memset(array,0,sizeof(array)) rly faster than setting everything in the array to 0 in a for loop? and memcpy(&array1,&array2,sizeof(array2)) faster than looping through and copying values over?

##### Share on other sites
yes it is really faster. but if you're skeptical just write a little test app where you initialize a huge array both ways and time the different solutions.

-me

##### Share on other sites
If you're using C++, look into std::copy and std::fill. If you're using C, probably best to stick with them.

##### Share on other sites
Moved to General Programming.

##### Share on other sites
memset/memcpy are likely to be the most efficient possible setting/copying implementation on your system, so the best a for loop will ever do is a draw. That wasn't always true thought, when Michael Abrash was working on Quake he found that the memcpy they had was a byte-by-byte copy which he beat easily.

##### Share on other sites
sorry, I didnt see a C++ section so I just put it in directX.
thx

##### Share on other sites
wait, when you say he beat it.... does that mean its possible to write a memcpy function thats faster than memcpy?

##### Share on other sites
Probably not these days. Once upon a time it was possible.

##### Share on other sites
On MSVC++ 2003, memset doesn't even perform a proper function call - a short assembly routine is inserted inline, which uses "rep stosd" to zero memory, a dword at a time.

Having said that, you _can_ beat it by using the FPU to fill 4 words at a time, or SSE to fill 8 words at a time.

##### Share on other sites
You should always prefer a standard library call to do these sorts of things, because that call provides more useful information to the compiler. The compiler can more easily be written to detect "oh, that's a function call that I have a builtin superfast replacement for" than "oh, that loop is *really* just doing X which I have a builtin superfast replacement for".

In C++, std::copy/std::fill should be used because (a) it provides the above sort of information (and further, template magic can be used to make it use different 'superfast replacements' depending on the types of data involved) and (b) it will *not* do "superfast replacements" - or even nearly-as-fast manual scribbling over memory - in cases where it is not safe to do so (e.g. with an array of non-POD types).

##### Share on other sites
Who cares if memcpy is faster than a custom loop. It's easier to type, understand, and maintain. If you think the perf of your app is bounded by the speed of memcpy then you have way bigger problems and it's time to start redesigning.

##### Share on other sites
Quote:
 Probably not these days. Once upon a time it was possible.

Actually it is precisely "these days" where huge gains are to be had due to more complex memory hierarchies. By taking advantage of the cache, you can just about redline your memory (reaching almost peak bandwidth). For comparison, VC7's memcpy manages about 1/3 (!) of that.

Quote:
 On MSVC++ 2003, memset doesn't even perform a proper function call - a short assembly routine is inserted inline, which uses "rep stosd" to zero memory, a dword at a time.

Direct-addressed function call overhead is basically zilch. In fact, since CPUs are getting faster and DRAMs much less so, it is worthwhile to trade CPU effort for increased memory throughput. That means worry about the actual copy code, not whether it's being inlined.

Quote:
 Having said that, you _can_ beat it by using the FPU to fill 4 words at a time, or SSE to fill 8 words at a time.

Yes! The keywords here are "non-temporal transfers" that do not pollute your cache, and prefetching data into cache in large blocks. There are SSE instructions for this.

Quote:
 Who cares if memcpy is faster than a custom loop. It's easier to type, understand, and maintain. If you think the perf of your app is bounded by the speed of memcpy then you have way bigger problems and it's time to start redesigning.

hm, the luxury of working on non-time-critical applications :) On the other hand you have copying/clearing huge chess transposition tables and/or copying data around in a file cache (in cases where zero-copy IO isn't possible) that can significantly impact performance.

Since this comes up rather often and most people don't know what all can be had from a simple drop-in memcpy() replacement, it looks like an article on the topic would be helpful.
For the record, the following is 3x as fast on large (>192KB) transfers, and gains 7% on average for small (<64B) copies.

;-------------------------------------------------------------------------------; fast general memcpy; Copyright (c) 2003-2005 Jan Wassenberg; licensed under GPL;-------------------------------------------------------------------------------; optimized for Athlon XP: 7.3% faster (cumulative) than VC7.1's memcpy over; all 1..64 byte transfer lengths and misalignments. approaches maximum; mem bandwidth (2000 MiB/s) for transfers >= 192KiB!; Pentium III performance: about 3% faster in above small buffer benchmark.;; disables specialized large transfer (> 64KiB) implementations if SSE; isn't available; we do assume MMX support, though (quite safe).; if memcpy size is greater than this,; .. it's too big for L1. use non-temporal instructions.UC_THRESHOLD	equ	64*1024; .. it also blows L2. pull chunks into L1 ("block prefetch").BP_THRESHOLD	equ	192*1024; maximum that can be copied by IC_MOVSD.; if you change this, be sure to expand the movs* table(s)!IC_SIZE		equ	67; size of one block prefetch chunk.; if you change this, make sure "push byte BP_SIZE/128" doesn't overflow!BP_SIZE		equ	8*1024; > ecx = size (<= IC_SIZE); x eax, ecx;; determined to be fastest approach by testing. a movsd table followed by; rep movsb is a bit smaller but 6.9% slower; everything else is much worse.%macro IC_MOVSD 0	mov		eax, ecx	shr		ecx, 2						; dword count	neg		ecx	add		ecx, %%movsd_table_end	jmp		ecxalign 8	movsd	movsd	movsd	movsd	movsd	movsd	movsd	movsd	movsd	movsd	movsd	movsd	movsd	movsd	movsd	movsd%%movsd_table_end:	and		eax, 3	neg		eax	add		eax, %%movsb_table_end	jmp		eax	movsb	movsb	movsb%%movsb_table_end:%endm; align destination address to multiple of 8.; not done for small transfers because it doesn't help IC_MOVSD.%macro IC_ALIGN 0	mov		eax, 8	sub		eax, edi	and		eax, byte 7					; eax = # misaligned bytes	sub		ecx, eax					; reduce copy count	neg		eax	add		eax, %%align_table_end	jmp		eaxalign 4	movsb	movsb	movsb	movsb	movsb	movsb	movsb	movsb%%align_table_end:%endm; > ecx = size; x edx%macro IC_MOVQ 0align 16	mov		edx, 64%%loop:	cmp		ecx, edx	jb		%%done	prefetchnta	[esi + (200*64/34+192)]	movq	mm0, [esi+0]	movq	mm1, [esi+8]	movq	[edi+0], mm0	movq	[edi+8], mm1	movq	mm2, [esi+16]	movq	mm3, [esi+24]	movq	[edi+16], mm2	movq	[edi+24], mm3	movq	mm0, [esi+32]	movq	mm1, [esi+40]	movq	[edi+32], mm0	movq	[edi+40], mm1	movq	mm2, [esi+48]	movq	mm3, [esi+56]	movq	[edi+48], mm2	movq	[edi+56], mm3	add		esi, edx	add		edi, edx	sub		ecx, edx	jmp		%%loop%%done:%endm; > ecx = size (> 64); x%macro UC_MOVNTQ 0	mov		edx, 64align 16%%1:	prefetchnta [esi + (200*64/34+192)]	movq	mm0,[esi+0]	add		edi, edx	movq	mm1,[esi+8]	add		esi, edx	movq	mm2,[esi-48]	movntq	[edi-64], mm0	movq	mm0,[esi-40]	movntq	[edi-56], mm1	movq	mm1,[esi-32]	movntq	[edi-48], mm2	movq	mm2,[esi-24]	movntq	[edi-40], mm0	movq	mm0,[esi-16]	movntq	[edi-32], mm1	movq	mm1,[esi-8]	movntq	[edi-24], mm2	movntq	[edi-16], mm0	sub		ecx, edx	movntq	[edi-8], mm1	cmp		ecx, edx	jae		%%1%endm; > ecx = size (> 8KiB); x eax, edx;; somewhat optimized for size (futile attempt to avoid near jump)%macro UC_BP_MOVNTQ 0%%prefetch_and_copy_chunk:	; touch each cache line within chunk in reverse order (prevents HW prefetch)	push	byte BP_SIZE/128			; # iterations	pop		eax	add		esi, BP_SIZEalign 8%%prefetch_chunk:	mov		edx, [esi-64]	mov		edx, [esi-128]	sub		esi, 128	dec		eax	jnz		%%prefetch_chunk	; copy 64 byte blocks	mov		eax, BP_SIZE/64				; # iterations (> signed 8 bit)	push	byte 64	pop		edxalign 8%%copy_block:	movq	mm0, [esi+ 0]	movq	mm1, [esi+ 8]	movq	mm2, [esi+16]	movq	mm3, [esi+24]	movq	mm4, [esi+32]	movq	mm5, [esi+40]	movq	mm6, [esi+48]	movq	mm7, [esi+56]	add		esi, edx	movntq	[edi+ 0], mm0	movntq	[edi+ 8], mm1	movntq	[edi+16], mm2	movntq	[edi+24], mm3	movntq	[edi+32], mm4	movntq	[edi+40], mm5	movntq	[edi+48], mm6	movntq	[edi+56], mm7	add		edi, edx	dec		eax	jnz		%%copy_block	sub		ecx, BP_SIZE	cmp		ecx, BP_SIZE	jae		%%prefetch_and_copy_chunk%endm[section .bss]; this is somewhat "clever". the 2 specialized transfer implementations; that use SSE are jumped to if transfer size is greater than a threshold.; we simply set the requested transfer size to 0 if the CPU doesn't; support SSE so that those are never reached (done by masking with this).sse_mask		resd	1__SECT__; void* __declspec(naked) ia32_memcpy(void* dst, const void* src, size_t nbytes); Return dst to make ia32_memcpy usable as a standard library memcpy drop-inglobal sym(ia32_memcpy)sym(ia32_memcpy):	push	edi	push	esi	mov		edi, [esp+8+4+0]			; dst	mov		esi, [esp+8+4+4]			; src	mov		ecx, [esp+8+4+8]			; nbytes	cmp		ecx, byte IC_SIZE	ja		.choose_larger_method.ic_movsd:	IC_MOVSD	mov		eax, [esp+8+4+0]			; return dst	pop		esi	pop		edi	ret.choose_larger_method:	IC_ALIGN	mov		eax, [sse_mask]	mov		edx, ecx	and		edx, eax					; edx = (SSE)? remaining_bytes : 0	cmp		edx, BP_THRESHOLD	jae		near .uc_bp_movntq	cmp		edx, UC_THRESHOLD	jae		.uc_movntq.ic_movq:	IC_MOVQ	emms	jmp		.ic_movsd.uc_movntq:	UC_MOVNTQ	sfence	emms	jmp		.ic_movsd.uc_bp_movntq:	UC_BP_MOVNTQ	sfence	jmp		.ic_movq

##### Share on other sites
Jan, how does your routine compare to the AMD memcpy implementation? It mentions a lot of similar tricks in the source code, such as the non temporal memory copies, in cache copies, block prefetch, etc. It's rather dated (VC6 days) but I'm curious to see how it does compared to yours anyway.

##### Share on other sites
Yes, for large transfers, performance is similar (same technique is used). Mine is better on smaller transfers (<64KB) due to the following:
1) straight-line (my conditional jump is correctly statically predicted)
2) MOVSB table is 7% faster than REP MOVSB
3) AMD's code incurs an EMMS and SFENCE even if not needed
4) less registers used => less spilling

Unfortunately I no longer have the benchmark values for my code vs amd_memcpy, but remember it not being a real contest (mostly due to #3).

Grr, having trouble with this and other posts - am often getting
Microsoft VBScript runtime error '800a0009' Subscript out of range: '[number: 1]' /community/forums/post_info.asp, line 229

It then works after about 10 retries; this has been happening the past week or so. Browser=Opera 8.5.