# New Mini-Article: Speeding up memcpy

This topic is 4881 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

I've tweaked the benchmark a bit. All implementations are now called via function pointers, so std::copy is no longer advantaged. Also, I have adjusted the misalignment values to more accurately reflect reality (averaging over 0..63 unduly favors my implementation).
With that and one additional improvement (replacing MOVSB table with a plain loop), the results are as follows:
SMALL TRANSFERS---------------64:  AMD      : 90014 (+22.0%)  VC7.1    : 73765 (+0.0%)  Jan      : 74793 (+1.4%)  std::copy: 79794 (+8.2%)  intrinsic: 85894 (+16.4%)128:  AMD      : 194942 (+18.6%)  VC7.1    : 166969 (+1.6%)  Jan      : 164314 (+0.0%)  std::copy: 177892 (+8.3%)  intrinsic: 188542 (+14.7%)256:  AMD      : 426550 (+17.9%)  VC7.1    : 394309 (+9.0%)  Jan      : 361846 (+0.0%)  std::copy: 416561 (+15.1%)  intrinsic: 443454 (+22.6%)512:  AMD      : 967694 (+16.1%)  VC7.1    : 1013445 (+21.5%)  Jan      : 833830 (+0.0%)  std::copy: 1053348 (+26.3%)  intrinsic: 1151986 (+38.2%)2048:  AMD      : 6537103 (+9.2%)  VC7.1    : 9311588 (+55.6%)  Jan      : 5985845 (+0.0%)  std::copy: 9471943 (+58.2%)  intrinsic: 10965431 (+83.2%)8192:  AMD      : 66095286 (+4.1%)  VC7.1    : 121148415 (+90.9%)  Jan      : 63470328 (+0.0%)  std::copy: 121783488 (+91.9%)  intrinsic: 145574481 (+129.4%)LARGE TRANSFERS---------------65536:  AMD      : 62353 (+31.4%)  VC7.1    : 67971 (+43.2%)  Jan      : 47464 (+0.0%)  std::copy: 68006 (+43.3%)  intrinsic: 101206 (+113.2%)131072:  AMD      : 124065 (+0.1%)  VC7.1    : 135725 (+9.5%)  Jan      : 123908 (+0.0%)  std::copy: 135772 (+9.6%)  intrinsic: 202230 (+63.2%)196608:  AMD      : 203983 (+0.0%)  VC7.1    : 853990 (+318.7%)  Jan      : 205426 (+0.7%)  std::copy: 809232 (+296.7%)  intrinsic: 862066 (+322.6%)262144:  AMD      : 493363 (+0.0%)  VC7.1    : 1097035 (+122.4%)  Jan      : 504223 (+2.2%)  std::copy: 1110072 (+125.0%)  intrinsic: 1139694 (+131.0%)393216:  AMD      : 743397 (+0.0%)  VC7.1    : 1673832 (+125.2%)  Jan      : 760964 (+2.4%)  std::copy: 1661200 (+123.5%)  intrinsic: 1720660 (+131.5%)524288:  AMD      : 1049521 (+0.0%)  VC7.1    : 2255720 (+114.9%)  Jan      : 1060874 (+1.1%)  std::copy: 2198060 (+109.4%)  intrinsic: 2318947 (+121.0%)1048576:  AMD      : 2112846 (+0.0%)  VC7.1    : 4589070 (+117.2%)  Jan      : 2127714 (+0.7%)  std::copy: 4505785 (+113.3%)  intrinsic: 4794404 (+126.9%)

Analysis: I see potential for improvement on < 64 bytes: could do alignment, and combine that with the IC_MOVSD block. The new code wins handily everywhere except there and in the block prefetch range (> 192KB), where it hangs back by 1..2% vs. the AMD implementation. This is probably due to some stall or screwup in my code; I'll do pipeline analysis later.

Unfortunately I'm running out of time, heading out of town for the week (-> updates will be less frequent).
Quick replies though:
Zahlman: (template specialization): hm, that looks like it'd work. However, it only avoids one well-predicted jump, so it's probably not worth the effort.

snk_kid:
Quote:
 I didn't quite get what you mean here, there shouldn't be a need to use pointers to functions internally in generic algorithms unless you where referring to your version of std::copy or something.

Yes, this was just an artifact of how the benchmark is calling std::copy.

255:
Quote:
 Any chance of this going into glibc and/or gcc?

That would be an honor, but I have no idea how to go about submitting it :)

AP:
Quote:
 So basically you rehashed a bunch of knowledge already known for about 5+ years now...

This is a surprising to hear. I clearly stated in the article what is new, and that includes 5..20% gains on the old code. I wonder why you are saying this, and why you have to hide under cover of anonymity. Anyway, if you have nothing productive to add, sit down and shut up. Thank you.

AP2:
Quote:
 A few samples of what 'block sizes' correspond to these widly varying speedups (7%..300%) ????I would assume better the bigger the block is (if you using SIMD moves).

Sure; see above. Yeah.

Promit:
Quote:
 Familiarity. That function functions as a drop-in replacement for std::copy when used with PoD types.

Gotcha. For me, it's the opposite :)

AP3:
Quote:
 Is there an .obj file available anywhere, or am I going to have to bite the bullet and install nasm to get this to work :)

Sorry, there is currently no .obj. The code is still under development (thought up a few improvements today), but I'll post one when it's stabilized.

##### Share on other sites
Wanting to try out your code I grabbed the latest version of nasm and plopped you asm code into my project but I am having a problem with:
global sym(ia32_memcpy)
sym(ia32_memcpy):

Nasm gives:
c:\Tools\Lib\memcpy.asm:221: identifier expected after GLOBAL
c:\Tools\Lib\memcpy.asm:222: parser: instruction expected

It dosnt seem to like 'sym'.

##### Share on other sites
Ah, sorry - that wasn't included in the code. Definition is as follows:
; Usage:; use sym(ia32_cap) instead of _ia32_cap - on relevant platforms, sym() will add; the underlines automagically, on others it won't%ifdef DONT_USE_UNDERLINE%define sym(a) a%else%define sym(a) _ %+ a%endif

It's wrapped around all symbols visible to the C program and takes care of adding underscore (VC wants it, GCC doesn't). Too bad ASM isn't source/binary compatible between those 2 compilers otherwise. *sigh*

##### Share on other sites
Quote:
 Original post by Jan WassenbergAh, sorry - that wasn't included in the code. Definition is as follows:; Usage:; use sym(ia32_cap) instead of _ia32_cap - on relevant platforms, sym() will add; the underlines automagically, on others it won't%ifdef DONT_USE_UNDERLINE%define sym(a) a%else%define sym(a) _ %+ a%endifIt's wrapped around all symbols visible to the C program and takes care of adding underscore (VC wants it, GCC doesn't). Too bad ASM isn't source/binary compatible between those 2 compilers otherwise. *sigh*

Couldn't that be solved by simply making the sym macro expand to both versions?

Alternativly choosing one internal convention and exposing it under both mangled and unmangled name? Shouldn't 'external' or similar be able to do that?

Long time since I used nasm though. Seems like good results.

##### Share on other sites
nice one jan, can u submit this article to gamedev (perhaps they should implement flipcodes COTD)
doesnt matter if its short.
this way me and others can find it in future

##### Share on other sites
Update: Fixed a bug in the benchmark; only now are real-world transfer sizes and misaligns reflected correctly.
With that, the unrolled copy technique employed by VC7.1 memcpy for <32 bytes comes out on top [on my P3 laptop].
An improved version (3% faster, smaller code footprint) of it has been developed.

I had a rather nifty idea on the train yesterday. MMX copy can only handle 64 bytes; VC bridges the gap between 32 and that with rep movsd (slow). This is because expanding the table of instructions to handle 64 bytes would end up costing a lot of icache.
As an alternative, we can "re-enter" the copy code with minimal overhead and thereby handle the full 63 bytes that way. This comes out 7% faster in that range; the previously mentioned speedups still hold for larger transfers.

I will update the article and explain this in more detail.

DigitalDelusion:
Quote:
 Couldn't that be solved by simply making the sym macro expand to both versions?Alternativly choosing one internal convention and exposing it under both mangled and unmangled name? Shouldn't 'external' or similar be able to do that?

Yes, that's what I was doing before; our Unix dude fixed/changed it to the above. We also call into C functions from asm, so that may be why sym() is necessary. (we could expose both symbols to C, but not the reverse; asm code must reference exactly what the particular compiler generates)

zedzeek: thanks! Good idea, I will submit it after making the above changes and drawing pretty graphs of the speedup (when back on a real computer :) )

##### Share on other sites
Update: extensive measurements and further optimization have resulted in even larger gains. Current code outperforms VC memcpy by 20% (!) even on the non-MMX and SSE codepaths. Exact figure for maximum speedup is 358%.
Moral of the story: micro-optimization in integer code can indeed bring huge gains :)

Article has been updated, adding information on benchmark+optimizations and graphs of speed vs. transfer_size. Again, feedback is very welcome!

##### Share on other sites
Can I suggest you to propose the article to gamedev? You just have to add more background (why it may be useful to optimize memcpy and maybe other thought about optimizations) and it will be great!

Regards (and congratulation for your achievement)

##### Share on other sites
Thanks. What exactly would you propose adding? There is already a very short introduction on why to optimize memcpy at all (2 use cases), and the micro-optimization techniques that were successfully applied are listed.

I will submit this article after waiting a bit for more suggestions/tips.

##### Share on other sites
Quote:
 Original post by Jan WassenbergI will submit this article after waiting a bit for more suggestions/tips.
A somewhat automated test package would be nice, I'm sure there'd be plenty of gamedev users (including me) willing to submit P4 results.

I'm still quite interested in finding out how 16-bit SSE transfers perform, especially MOVAPS on SSE2 equipped machines. Besides, you'd get rid of the EMMS instruction which I assume is a bottleneck for small(ish) transfers.

Oh, and precompiled object files for the NASM-illiterate. I guess I could post some myself but binaries are obviously more trustworthy from the author himself.

• 17
• 9
• 15
• 60