New Mini-Article: Speeding up memcpy

Started by
44 comments, last by Jan Wassenberg 18 years, 4 months ago
Nice article.

I would be interested to see how various memcpy methods performs comparing to copying loop written in plain C++ (with maximal optimizations), and same loop with some unrolling.
(Sorry if benchmark already does it)
Advertisement
Quote:Original post by PaulCesar
Completely Brilliant, Makes me sick really :)


I second that. Jan, you're just too good :)
Let me add my voice to the congratulations; well done, sir!
To win one hundred victories in one hundred battles is not the acme of skill. To subdue the enemy without fighting is the acme of skill.
Have got a chance to examine yours ^_^.
There're some places I haven't set my feet into, but the rest is great.

Thank you.
--> The great thing about Object Oriented code is that it can make small, simple problems look like large, complex ones <--
Quote:A somewhat automated test package would be nice, I'm sure there'd be plenty of gamedev users (including me) willing to submit P4 results.

Sure, I can upload the test rig. It was ripped out of the 0ad codebase, so there are a bit of dependencies (zlib and dbghelp) and bloat in there, but hey. Be advised that it runs at REALTIME priority and thus completely locks up your system for the duration (estimate 45sec..2min)! Output goes into stdout.txt in the executable's directory.

A 0ad teammate has already tested on P4; thanks, though! Numbers are as follows:
TINY TRANSFERS---------------8:  Jan      : 674 (+0.0%)  VC7.1    : 683 (+1.3%)  std::copy: 709 (+5.2%)  intrinsic: 731 (+8.5%)12:  Jan      : 672 (+0.0%)  VC7.1    : 686 (+2.1%)  std::copy: 712 (+6.0%)  intrinsic: 720 (+7.1%)16:  Jan      : 674 (+0.0%)  VC7.1    : 676 (+0.3%)  std::copy: 717 (+6.4%)  intrinsic: 717 (+6.4%)24:  Jan      : 686 (+0.0%)  VC7.1    : 706 (+2.9%)  std::copy: 728 (+6.1%)  intrinsic: 721 (+5.1%)26:  Jan      : 737 (+0.7%)  VC7.1    : 739 (+1.0%)  std::copy: 732 (+0.0%)  intrinsic: 767 (+4.8%)32:  Jan      : 741 (+2.1%)  VC7.1    : 736 (+1.4%)  std::copy: 766 (+5.5%)  intrinsic: 726 (+0.0%)35:  Jan      : 762 (+0.0%)  VC7.1    : 771 (+1.2%)  std::copy: 796 (+4.5%)  intrinsic: 773 (+1.4%)37:  Jan      : 743 (+0.0%)  VC7.1    : 782 (+5.2%)  std::copy: 829 (+11.6%)  intrinsic: 798 (+7.4%) 40:  Jan      : 737 (+0.0%)  VC7.1    : 817 (+10.9%)  std::copy: 836 (+13.4%)  intrinsic: 813 (+10.3%)41:  Jan      : 751 (+0.0%)  VC7.1    : 789 (+5.1%)  std::copy: 837 (+11.5%)  intrinsic: 845 (+12.5%) 42:  Jan      : 742 (+0.0%)  VC7.1    : 788 (+6.2%)  std::copy: 838 (+12.9%)  intrinsic: 813 (+9.6%)43:  Jan      : 808 (+0.0%)  VC7.1    : 856 (+5.9%)  std::copy: 861 (+6.6%)  intrinsic: 819 (+1.4%)50:  Jan      : 812 (+0.0%)  VC7.1    : 814 (+0.2%)  std::copy: 826 (+1.7%)  intrinsic: 836 (+3.0%) 60:  Jan      : 796 (+0.0%)  VC7.1    : 810 (+1.8%)  std::copy: 851 (+6.9%)  intrinsic: 843 (+5.9%)MMX TRANSFERS---------------64:  Jan      : 768 (+0.0%)  VC7.1    : 841 (+9.5%)  std::copy: 850 (+10.7%)  intrinsic: 872 (+13.5%)128:  Jan      : 853 (+0.0%)  VC7.1    : 966 (+13.2%)  std::copy: 998 (+17.0%)  intrinsic: 1034 (+21.2%)256:  Jan      : 998 (+0.0%)  VC7.1    : 1172 (+17.4%)  std::copy: 1236 (+23.8%)  intrinsic: 1210 (+21.2%)512:  Jan      : 1282 (+0.0%)  VC7.1    : 1490 (+16.2%)  std::copy: 1527 (+19.1%)  intrinsic: 1601 (+24.9%)1024:  Jan      : 1879 (+0.0%)  VC7.1    : 2198 (+17.0%)  std::copy: 2246 (+19.5%)  intrinsic: 2443 (+30.0%)2048:  Jan      : 3191 (+0.0%)  VC7.1    : 3517 (+10.2%)  std::copy: 3582 (+12.3%)  intrinsic: 4047 (+26.8%)4096:  Jan      : 5648 (+0.0%)  VC7.1    : 6184 (+9.5%)  std::copy: 6411 (+13.5%)  intrinsic: 7315 (+29.5%)LARGE TRANSFERS---------------65536:  Jan      : 63286 (+0.0%)  VC7.1    : 83653 (+32.2%)  std::copy: 83957 (+32.7%)  intrinsic: 98750 (+56.0%)98304:  Jan      : 78322 (+0.0%)  VC7.1    : 105671 (+34.9%)  std::copy: 103310 (+31.9%)  intrinsic: 141624 (+80.8%)131072:  Jan      : 105282 (+0.0%)  VC7.1    : 125678 (+19.4%)  std::copy: 125907 (+19.6%)  intrinsic: 179672 (+70.7%)196608:  Jan      : 166292 (+0.0%)  VC7.1    : 171921 (+3.4%)  std::copy: 168715 (+1.5%)  intrinsic: 256996 (+54.5%)262144:  Jan      : 230008 (+0.0%)  VC7.1    : 285204 (+24.0%)  std::copy: 278741 (+21.2%)  intrinsic: 391790 (+70.3%)393216:  Jan      : 337867 (+0.0%)  VC7.1    : 944732 (+179.6%)  std::copy: 992164 (+193.7%)  intrinsic: 1054227 (+212.0%)524288:  Jan      : 502399 (+0.0%)  VC7.1    : 1261112 (+151.0%)  std::copy: 1258292 (+150.5%)  intrinsic: 1411586 (+181.0%)1048576:  Jan      : 1681793 (+0.0%)  VC7.1    : 2552749 (+51.8%)  std::copy: 2563882 (+52.4%)  intrinsic: 2863032 (+70.2%)

Gains are less marked than on Athlon and Pentium III, but the new code still comes out a good bit ahead.

Quote:I'm still quite interested in finding out how 16-bit SSE transfers perform, especially MOVAPS on SSE2 equipped machines. Besides, you'd get rid of the EMMS instruction which I assume is a bottleneck for small(ish) transfers.

Yes, that's still on TODO. I would expect large gains since P4 L2 cache access is wider than 64-bit MOVQ size.
Unfortunately I don't see myself capable of writing this due to lack of SSE2 hardware for testing (which must be done often during development). Any takers? Should be easy to integrate :)

Quote:Oh, and precompiled object files for the NASM-illiterate. I guess I could post some myself but binaries are obviously more trustworthy from the author himself.

Ah, thanks for the reminder. Find it in the above package (ia32.asm.obj, which also contains some other stuff that we don't need).


Quote:I would be interested to see how various memcpy methods performs comparing to copying loop written in plain C++ (with maximal optimizations), and same loop with some unrolling.

Yes, that'd be interesting. Of course such a loop has no hope of competing above 64 bytes; results for smaller transfers are:
8:  Jan      : 96 (+0.0%)  VC7.1    : 103 (+7.3%)  loop     : 96 (+0.0%)  uloop    : 98 (+2.1%)12:  Jan      : 96 (+0.0%)  VC7.1    : 104 (+8.3%)  loop     : 98 (+2.1%)  uloop    : 105 (+9.4%)16:  Jan      : 99 (+0.0%)  VC7.1    : 105 (+6.1%)  loop     : 102 (+3.0%)  uloop    : 102 (+3.0%)24:  Jan      : 102 (+0.0%)  VC7.1    : 107 (+4.9%)  loop     : 107 (+4.9%)  uloop    : 106 (+3.9%)26:  Jan      : 104 (+0.0%)  VC7.1    : 112 (+7.7%)  loop     : 112 (+7.7%)  uloop    : 110 (+5.8%)32:  Jan      : 104 (+0.0%)  VC7.1    : 124 (+19.2%)  loop     : 124 (+19.2%)  uloop    : 110 (+5.8%)35:  Jan      : 109 (+0.0%)  VC7.1    : 131 (+20.2%)  loop     : 133 (+22.0%)  uloop    : 114 (+4.6%)37:  Jan      : 107 (+0.0%)  VC7.1    : 128 (+19.6%)  loop     : 131 (+22.4%)  uloop    : 118 (+10.3%)40:  Jan      : 106 (+0.0%)  VC7.1    : 129 (+21.7%)  loop     : 131 (+23.6%)  uloop    : 114 (+7.5%)41:  Jan      : 108 (+0.0%)  VC7.1    : 129 (+19.4%)  loop     : 133 (+23.1%)  uloop    : 116 (+7.4%)42:  Jan      : 109 (+0.0%)  VC7.1    : 132 (+21.1%)  loop     : 137 (+25.7%)  uloop    : 118 (+8.3%)43:  Jan      : 112 (+0.0%)  VC7.1    : 133 (+18.8%)  loop     : 140 (+25.0%)  uloop    : 120 (+7.1%)50:  Jan      : 111 (+0.0%)  VC7.1    : 134 (+20.7%)  loop     : 143 (+28.8%)  uloop    : 119 (+7.2%)60:  Jan      : 112 (+0.0%)  VC7.1    : 135 (+20.5%)  loop     : 145 (+29.5%)  uloop    : 127 (+13.4%)

We see that both can't really keep up, especially with unaligned/uneven byte counts (the tail jump table is unbeatable). Not as bad as could be expected, though, especially for smaller sizes. This is due to less overhead vs ia32_memcpy and memcpy.
BTW, both compile down to loops containing dword MOV instructions, followed by a byte loop. uloop is unrolled 2x.


Thanks again for the kind words - I am flattered :)
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
Article has been submitted.
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3

This topic is closed to new replies.

Advertisement