memset/memcpy
Who cares if memcpy is faster than a custom loop. It's easier to type, understand, and maintain. If you think the perf of your app is bounded by the speed of memcpy then you have way bigger problems and it's time to start redesigning.
Quote:Probably not these days. Once upon a time it was possible.
Actually it is precisely "these days" where huge gains are to be had due to more complex memory hierarchies. By taking advantage of the cache, you can just about redline your memory (reaching almost peak bandwidth). For comparison, VC7's memcpy manages about 1/3 (!) of that.
Quote:On MSVC++ 2003, memset doesn't even perform a proper function call - a short assembly routine is inserted inline, which uses "rep stosd" to zero memory, a dword at a time.
Direct-addressed function call overhead is basically zilch. In fact, since CPUs are getting faster and DRAMs much less so, it is worthwhile to trade CPU effort for increased memory throughput. That means worry about the actual copy code, not whether it's being inlined.
Quote:Having said that, you _can_ beat it by using the FPU to fill 4 words at a time, or SSE to fill 8 words at a time.
Yes! The keywords here are "non-temporal transfers" that do not pollute your cache, and prefetching data into cache in large blocks. There are SSE instructions for this.
Quote:Who cares if memcpy is faster than a custom loop. It's easier to type, understand, and maintain. If you think the perf of your app is bounded by the speed of memcpy then you have way bigger problems and it's time to start redesigning.
hm, the luxury of working on non-time-critical applications :) On the other hand you have copying/clearing huge chess transposition tables and/or copying data around in a file cache (in cases where zero-copy IO isn't possible) that can significantly impact performance.
Since this comes up rather often and most people don't know what all can be had from a simple drop-in memcpy() replacement, it looks like an article on the topic would be helpful.
For the record, the following is 3x as fast on large (>192KB) transfers, and gains 7% on average for small (<64B) copies.
;-------------------------------------------------------------------------------; fast general memcpy; Copyright (c) 2003-2005 Jan Wassenberg; licensed under GPL;-------------------------------------------------------------------------------; optimized for Athlon XP: 7.3% faster (cumulative) than VC7.1's memcpy over; all 1..64 byte transfer lengths and misalignments. approaches maximum; mem bandwidth (2000 MiB/s) for transfers >= 192KiB!; Pentium III performance: about 3% faster in above small buffer benchmark.;; disables specialized large transfer (> 64KiB) implementations if SSE; isn't available; we do assume MMX support, though (quite safe).; if memcpy size is greater than this,; .. it's too big for L1. use non-temporal instructions.UC_THRESHOLD equ 64*1024; .. it also blows L2. pull chunks into L1 ("block prefetch").BP_THRESHOLD equ 192*1024; maximum that can be copied by IC_MOVSD.; if you change this, be sure to expand the movs* table(s)!IC_SIZE equ 67; size of one block prefetch chunk.; if you change this, make sure "push byte BP_SIZE/128" doesn't overflow!BP_SIZE equ 8*1024; > ecx = size (<= IC_SIZE); x eax, ecx;; determined to be fastest approach by testing. a movsd table followed by; rep movsb is a bit smaller but 6.9% slower; everything else is much worse.%macro IC_MOVSD 0 mov eax, ecx shr ecx, 2 ; dword count neg ecx add ecx, %%movsd_table_end jmp ecxalign 8 movsd movsd movsd movsd movsd movsd movsd movsd movsd movsd movsd movsd movsd movsd movsd movsd%%movsd_table_end: and eax, 3 neg eax add eax, %%movsb_table_end jmp eax movsb movsb movsb%%movsb_table_end:%endm; align destination address to multiple of 8.; not done for small transfers because it doesn't help IC_MOVSD.%macro IC_ALIGN 0 mov eax, 8 sub eax, edi and eax, byte 7 ; eax = # misaligned bytes sub ecx, eax ; reduce copy count neg eax add eax, %%align_table_end jmp eaxalign 4 movsb movsb movsb movsb movsb movsb movsb movsb%%align_table_end:%endm; > ecx = size; x edx%macro IC_MOVQ 0align 16 mov edx, 64%%loop: cmp ecx, edx jb %%done prefetchnta [esi + (200*64/34+192)] movq mm0, [esi+0] movq mm1, [esi+8] movq [edi+0], mm0 movq [edi+8], mm1 movq mm2, [esi+16] movq mm3, [esi+24] movq [edi+16], mm2 movq [edi+24], mm3 movq mm0, [esi+32] movq mm1, [esi+40] movq [edi+32], mm0 movq [edi+40], mm1 movq mm2, [esi+48] movq mm3, [esi+56] movq [edi+48], mm2 movq [edi+56], mm3 add esi, edx add edi, edx sub ecx, edx jmp %%loop%%done:%endm; > ecx = size (> 64); x%macro UC_MOVNTQ 0 mov edx, 64align 16%%1: prefetchnta [esi + (200*64/34+192)] movq mm0,[esi+0] add edi, edx movq mm1,[esi+8] add esi, edx movq mm2,[esi-48] movntq [edi-64], mm0 movq mm0,[esi-40] movntq [edi-56], mm1 movq mm1,[esi-32] movntq [edi-48], mm2 movq mm2,[esi-24] movntq [edi-40], mm0 movq mm0,[esi-16] movntq [edi-32], mm1 movq mm1,[esi-8] movntq [edi-24], mm2 movntq [edi-16], mm0 sub ecx, edx movntq [edi-8], mm1 cmp ecx, edx jae %%1%endm; > ecx = size (> 8KiB); x eax, edx;; somewhat optimized for size (futile attempt to avoid near jump)%macro UC_BP_MOVNTQ 0%%prefetch_and_copy_chunk: ; touch each cache line within chunk in reverse order (prevents HW prefetch) push byte BP_SIZE/128 ; # iterations pop eax add esi, BP_SIZEalign 8%%prefetch_chunk: mov edx, [esi-64] mov edx, [esi-128] sub esi, 128 dec eax jnz %%prefetch_chunk ; copy 64 byte blocks mov eax, BP_SIZE/64 ; # iterations (> signed 8 bit) push byte 64 pop edxalign 8%%copy_block: movq mm0, [esi+ 0] movq mm1, [esi+ 8] movq mm2, [esi+16] movq mm3, [esi+24] movq mm4, [esi+32] movq mm5, [esi+40] movq mm6, [esi+48] movq mm7, [esi+56] add esi, edx movntq [edi+ 0], mm0 movntq [edi+ 8], mm1 movntq [edi+16], mm2 movntq [edi+24], mm3 movntq [edi+32], mm4 movntq [edi+40], mm5 movntq [edi+48], mm6 movntq [edi+56], mm7 add edi, edx dec eax jnz %%copy_block sub ecx, BP_SIZE cmp ecx, BP_SIZE jae %%prefetch_and_copy_chunk%endm[section .bss]; this is somewhat "clever". the 2 specialized transfer implementations; that use SSE are jumped to if transfer size is greater than a threshold.; we simply set the requested transfer size to 0 if the CPU doesn't; support SSE so that those are never reached (done by masking with this).sse_mask resd 1__SECT__; void* __declspec(naked) ia32_memcpy(void* dst, const void* src, size_t nbytes); Return dst to make ia32_memcpy usable as a standard library memcpy drop-inglobal sym(ia32_memcpy)sym(ia32_memcpy): push edi push esi mov edi, [esp+8+4+0] ; dst mov esi, [esp+8+4+4] ; src mov ecx, [esp+8+4+8] ; nbytes cmp ecx, byte IC_SIZE ja .choose_larger_method.ic_movsd: IC_MOVSD mov eax, [esp+8+4+0] ; return dst pop esi pop edi ret.choose_larger_method: IC_ALIGN mov eax, [sse_mask] mov edx, ecx and edx, eax ; edx = (SSE)? remaining_bytes : 0 cmp edx, BP_THRESHOLD jae near .uc_bp_movntq cmp edx, UC_THRESHOLD jae .uc_movntq.ic_movq: IC_MOVQ emms jmp .ic_movsd.uc_movntq: UC_MOVNTQ sfence emms jmp .ic_movsd.uc_bp_movntq: UC_BP_MOVNTQ sfence jmp .ic_movq
Jan, how does your routine compare to the AMD memcpy implementation? It mentions a lot of similar tricks in the source code, such as the non temporal memory copies, in cache copies, block prefetch, etc. It's rather dated (VC6 days) but I'm curious to see how it does compared to yours anyway.
Yes, for large transfers, performance is similar (same technique is used). Mine is better on smaller transfers (<64KB) due to the following:
1) straight-line (my conditional jump is correctly statically predicted)
2) MOVSB table is 7% faster than REP MOVSB
3) AMD's code incurs an EMMS and SFENCE even if not needed
4) less registers used => less spilling
Unfortunately I no longer have the benchmark values for my code vs amd_memcpy, but remember it not being a real contest (mostly due to #3).
Grr, having trouble with this and other posts - am often getting
It then works after about 10 retries; this has been happening the past week or so. Browser=Opera 8.5.
1) straight-line (my conditional jump is correctly statically predicted)
2) MOVSB table is 7% faster than REP MOVSB
3) AMD's code incurs an EMMS and SFENCE even if not needed
4) less registers used => less spilling
Unfortunately I no longer have the benchmark values for my code vs amd_memcpy, but remember it not being a real contest (mostly due to #3).
Grr, having trouble with this and other posts - am often getting
Microsoft VBScript runtime error '800a0009' Subscript out of range: '[number: 1]' /community/forums/post_info.asp, line 229
It then works after about 10 retries; this has been happening the past week or so. Browser=Opera 8.5.
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement