Sign in to follow this  
michael879

memset/memcpy

Recommended Posts

michael879    100
are these functions actually faster than doing it manually? I mean is memset(array,0,sizeof(array)) rly faster than setting everything in the array to 0 in a for loop? and memcpy(&array1,&array2,sizeof(array2)) faster than looping through and copying values over?

Share this post


Link to post
Share on other sites
Palidine    1315
yes it is really faster. but if you're skeptical just write a little test app where you initialize a huge array both ways and time the different solutions.

-me

Share this post


Link to post
Share on other sites
ZQJ    496
memset/memcpy are likely to be the most efficient possible setting/copying implementation on your system, so the best a for loop will ever do is a draw. That wasn't always true thought, when Michael Abrash was working on Quake he found that the memcpy they had was a byte-by-byte copy which he beat easily.

Share this post


Link to post
Share on other sites
bakery2k1    712
On MSVC++ 2003, memset doesn't even perform a proper function call - a short assembly routine is inserted inline, which uses "rep stosd" to zero memory, a dword at a time.

Having said that, you _can_ beat it by using the FPU to fill 4 words at a time, or SSE to fill 8 words at a time.

Share this post


Link to post
Share on other sites
Zahlman    1682
You should always prefer a standard library call to do these sorts of things, because that call provides more useful information to the compiler. The compiler can more easily be written to detect "oh, that's a function call that I have a builtin superfast replacement for" than "oh, that loop is *really* just doing X which I have a builtin superfast replacement for".

In C++, std::copy/std::fill should be used because (a) it provides the above sort of information (and further, template magic can be used to make it use different 'superfast replacements' depending on the types of data involved) and (b) it will *not* do "superfast replacements" - or even nearly-as-fast manual scribbling over memory - in cases where it is not safe to do so (e.g. with an array of non-POD types).

Share this post


Link to post
Share on other sites
Anon Mike    1098
Who cares if memcpy is faster than a custom loop. It's easier to type, understand, and maintain. If you think the perf of your app is bounded by the speed of memcpy then you have way bigger problems and it's time to start redesigning.

Share this post


Link to post
Share on other sites
Jan Wassenberg    999
Quote:
Probably not these days. Once upon a time it was possible.

Actually it is precisely "these days" where huge gains are to be had due to more complex memory hierarchies. By taking advantage of the cache, you can just about redline your memory (reaching almost peak bandwidth). For comparison, VC7's memcpy manages about 1/3 (!) of that.

Quote:
On MSVC++ 2003, memset doesn't even perform a proper function call - a short assembly routine is inserted inline, which uses "rep stosd" to zero memory, a dword at a time.

Direct-addressed function call overhead is basically zilch. In fact, since CPUs are getting faster and DRAMs much less so, it is worthwhile to trade CPU effort for increased memory throughput. That means worry about the actual copy code, not whether it's being inlined.

Quote:
Having said that, you _can_ beat it by using the FPU to fill 4 words at a time, or SSE to fill 8 words at a time.

Yes! The keywords here are "non-temporal transfers" that do not pollute your cache, and prefetching data into cache in large blocks. There are SSE instructions for this.

Quote:
Who cares if memcpy is faster than a custom loop. It's easier to type, understand, and maintain. If you think the perf of your app is bounded by the speed of memcpy then you have way bigger problems and it's time to start redesigning.

hm, the luxury of working on non-time-critical applications :) On the other hand you have copying/clearing huge chess transposition tables and/or copying data around in a file cache (in cases where zero-copy IO isn't possible) that can significantly impact performance.

Since this comes up rather often and most people don't know what all can be had from a simple drop-in memcpy() replacement, it looks like an article on the topic would be helpful.
For the record, the following is 3x as fast on large (>192KB) transfers, and gains 7% on average for small (<64B) copies.


;-------------------------------------------------------------------------------
; fast general memcpy
; Copyright (c) 2003-2005 Jan Wassenberg
; licensed under GPL
;-------------------------------------------------------------------------------

; optimized for Athlon XP: 7.3% faster (cumulative) than VC7.1's memcpy over
; all 1..64 byte transfer lengths and misalignments. approaches maximum
; mem bandwidth (2000 MiB/s) for transfers >= 192KiB!
; Pentium III performance: about 3% faster in above small buffer benchmark.
;
; disables specialized large transfer (> 64KiB) implementations if SSE
; isn't available; we do assume MMX support, though (quite safe).

; if memcpy size is greater than this,
; .. it's too big for L1. use non-temporal instructions.
UC_THRESHOLD equ 64*1024
; .. it also blows L2. pull chunks into L1 ("block prefetch").
BP_THRESHOLD equ 192*1024

; maximum that can be copied by IC_MOVSD.
; if you change this, be sure to expand the movs* table(s)!
IC_SIZE equ 67

; size of one block prefetch chunk.
; if you change this, make sure "push byte BP_SIZE/128" doesn't overflow!
BP_SIZE equ 8*1024


; > ecx = size (<= IC_SIZE)
; x eax, ecx
;
; determined to be fastest approach by testing. a movsd table followed by
; rep movsb is a bit smaller but 6.9% slower; everything else is much worse.
%macro IC_MOVSD 0
mov eax, ecx
shr ecx, 2 ; dword count
neg ecx
add ecx, %%movsd_table_end
jmp ecx
align 8
movsd
movsd
movsd
movsd
movsd
movsd
movsd
movsd
movsd
movsd
movsd
movsd
movsd
movsd
movsd
movsd
%%movsd_table_end:

and eax, 3
neg eax
add eax, %%movsb_table_end
jmp eax
movsb
movsb
movsb
%%movsb_table_end:
%endm


; align destination address to multiple of 8.
; not done for small transfers because it doesn't help IC_MOVSD.
%macro IC_ALIGN 0
mov eax, 8
sub eax, edi
and eax, byte 7 ; eax = # misaligned bytes
sub ecx, eax ; reduce copy count
neg eax
add eax, %%align_table_end
jmp eax
align 4
movsb
movsb
movsb
movsb
movsb
movsb
movsb
movsb
%%align_table_end:
%endm


; > ecx = size
; x edx
%macro IC_MOVQ 0
align 16
mov edx, 64
%%loop:
cmp ecx, edx
jb %%done
prefetchnta [esi + (200*64/34+192)]
movq mm0, [esi+0]
movq mm1, [esi+8]
movq [edi+0], mm0
movq [edi+8], mm1
movq mm2, [esi+16]
movq mm3, [esi+24]
movq [edi+16], mm2
movq [edi+24], mm3
movq mm0, [esi+32]
movq mm1, [esi+40]
movq [edi+32], mm0
movq [edi+40], mm1
movq mm2, [esi+48]
movq mm3, [esi+56]
movq [edi+48], mm2
movq [edi+56], mm3
add esi, edx
add edi, edx
sub ecx, edx
jmp %%loop
%%done:
%endm


; > ecx = size (> 64)
; x
%macro UC_MOVNTQ 0
mov edx, 64
align 16
%%1:
prefetchnta [esi + (200*64/34+192)]
movq mm0,[esi+0]
add edi, edx
movq mm1,[esi+8]
add esi, edx
movq mm2,[esi-48]
movntq [edi-64], mm0
movq mm0,[esi-40]
movntq [edi-56], mm1
movq mm1,[esi-32]
movntq [edi-48], mm2
movq mm2,[esi-24]
movntq [edi-40], mm0
movq mm0,[esi-16]
movntq [edi-32], mm1
movq mm1,[esi-8]
movntq [edi-24], mm2
movntq [edi-16], mm0
sub ecx, edx
movntq [edi-8], mm1
cmp ecx, edx
jae %%1
%endm


; > ecx = size (> 8KiB)
; x eax, edx
;
; somewhat optimized for size (futile attempt to avoid near jump)
%macro UC_BP_MOVNTQ 0
%%prefetch_and_copy_chunk:

; touch each cache line within chunk in reverse order (prevents HW prefetch)
push byte BP_SIZE/128 ; # iterations
pop eax
add esi, BP_SIZE
align 8
%%prefetch_chunk:
mov edx, [esi-64]
mov edx, [esi-128]
sub esi, 128
dec eax
jnz %%prefetch_chunk

; copy 64 byte blocks
mov eax, BP_SIZE/64 ; # iterations (> signed 8 bit)
push byte 64
pop edx
align 8
%%copy_block:
movq mm0, [esi+ 0]
movq mm1, [esi+ 8]
movq mm2, [esi+16]
movq mm3, [esi+24]
movq mm4, [esi+32]
movq mm5, [esi+40]
movq mm6, [esi+48]
movq mm7, [esi+56]
add esi, edx
movntq [edi+ 0], mm0
movntq [edi+ 8], mm1
movntq [edi+16], mm2
movntq [edi+24], mm3
movntq [edi+32], mm4
movntq [edi+40], mm5
movntq [edi+48], mm6
movntq [edi+56], mm7
add edi, edx
dec eax
jnz %%copy_block

sub ecx, BP_SIZE
cmp ecx, BP_SIZE
jae %%prefetch_and_copy_chunk
%endm


[section .bss]

; this is somewhat "clever". the 2 specialized transfer implementations
; that use SSE are jumped to if transfer size is greater than a threshold.
; we simply set the requested transfer size to 0 if the CPU doesn't
; support SSE so that those are never reached (done by masking with this).
sse_mask resd 1

__SECT__

; void* __declspec(naked) ia32_memcpy(void* dst, const void* src, size_t nbytes)
; Return dst to make ia32_memcpy usable as a standard library memcpy drop-in
global sym(ia32_memcpy)
sym(ia32_memcpy):
push edi
push esi

mov edi, [esp+8+4+0] ; dst
mov esi, [esp+8+4+4] ; src
mov ecx, [esp+8+4+8] ; nbytes

cmp ecx, byte IC_SIZE
ja .choose_larger_method

.ic_movsd:
IC_MOVSD
mov eax, [esp+8+4+0] ; return dst
pop esi
pop edi
ret

.choose_larger_method:
IC_ALIGN

mov eax, [sse_mask]
mov edx, ecx
and edx, eax ; edx = (SSE)? remaining_bytes : 0
cmp edx, BP_THRESHOLD
jae near .uc_bp_movntq
cmp edx, UC_THRESHOLD
jae .uc_movntq

.ic_movq:
IC_MOVQ
emms
jmp .ic_movsd

.uc_movntq:
UC_MOVNTQ
sfence
emms
jmp .ic_movsd

.uc_bp_movntq:
UC_BP_MOVNTQ
sfence
jmp .ic_movq

Share this post


Link to post
Share on other sites
Promit    13246
Jan, how does your routine compare to the AMD memcpy implementation? It mentions a lot of similar tricks in the source code, such as the non temporal memory copies, in cache copies, block prefetch, etc. It's rather dated (VC6 days) but I'm curious to see how it does compared to yours anyway.

Share this post


Link to post
Share on other sites
Jan Wassenberg    999
Yes, for large transfers, performance is similar (same technique is used). Mine is better on smaller transfers (<64KB) due to the following:
1) straight-line (my conditional jump is correctly statically predicted)
2) MOVSB table is 7% faster than REP MOVSB
3) AMD's code incurs an EMMS and SFENCE even if not needed
4) less registers used => less spilling

Unfortunately I no longer have the benchmark values for my code vs amd_memcpy, but remember it not being a real contest (mostly due to #3).


Grr, having trouble with this and other posts - am often getting
Microsoft VBScript runtime error '800a0009' 
Subscript out of range: '[number: 1]'
/community/forums/post_info.asp, line 229

It then works after about 10 retries; this has been happening the past week or so. Browser=Opera 8.5.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this