Jump to content
  • Advertisement
Sign in to follow this  
Jan Wassenberg

New Mini-Article: Speeding up memcpy

This topic is 5090 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Nice article.

I would be interested to see how various memcpy methods performs comparing to copying loop written in plain C++ (with maximal optimizations), and same loop with some unrolling.
(Sorry if benchmark already does it)

Share this post


Link to post
Share on other sites
Advertisement
Quote:
Original post by PaulCesar
Completely Brilliant, Makes me sick really :)


I second that. Jan, you're just too good :)

Share this post


Link to post
Share on other sites
Quote:
A somewhat automated test package would be nice, I'm sure there'd be plenty of gamedev users (including me) willing to submit P4 results.

Sure, I can upload the test rig. It was ripped out of the 0ad codebase, so there are a bit of dependencies (zlib and dbghelp) and bloat in there, but hey. Be advised that it runs at REALTIME priority and thus completely locks up your system for the duration (estimate 45sec..2min)! Output goes into stdout.txt in the executable's directory.

A 0ad teammate has already tested on P4; thanks, though! Numbers are as follows:

TINY TRANSFERS
---------------

8:
Jan : 674 (+0.0%)
VC7.1 : 683 (+1.3%)
std::copy: 709 (+5.2%)
intrinsic: 731 (+8.5%)

12:
Jan : 672 (+0.0%)
VC7.1 : 686 (+2.1%)
std::copy: 712 (+6.0%)
intrinsic: 720 (+7.1%)

16:
Jan : 674 (+0.0%)
VC7.1 : 676 (+0.3%)
std::copy: 717 (+6.4%)
intrinsic: 717 (+6.4%)

24:
Jan : 686 (+0.0%)
VC7.1 : 706 (+2.9%)
std::copy: 728 (+6.1%)
intrinsic: 721 (+5.1%)

26:
Jan : 737 (+0.7%)
VC7.1 : 739 (+1.0%)
std::copy: 732 (+0.0%)
intrinsic: 767 (+4.8%)

32:
Jan : 741 (+2.1%)
VC7.1 : 736 (+1.4%)
std::copy: 766 (+5.5%)
intrinsic: 726 (+0.0%)

35:
Jan : 762 (+0.0%)
VC7.1 : 771 (+1.2%)
std::copy: 796 (+4.5%)
intrinsic: 773 (+1.4%)

37:
Jan : 743 (+0.0%)
VC7.1 : 782 (+5.2%)
std::copy: 829 (+11.6%)
intrinsic: 798 (+7.4%)

40:
Jan : 737 (+0.0%)
VC7.1 : 817 (+10.9%)
std::copy: 836 (+13.4%)
intrinsic: 813 (+10.3%)

41:
Jan : 751 (+0.0%)
VC7.1 : 789 (+5.1%)
std::copy: 837 (+11.5%)
intrinsic: 845 (+12.5%)

42:
Jan : 742 (+0.0%)
VC7.1 : 788 (+6.2%)
std::copy: 838 (+12.9%)
intrinsic: 813 (+9.6%)

43:
Jan : 808 (+0.0%)
VC7.1 : 856 (+5.9%)
std::copy: 861 (+6.6%)
intrinsic: 819 (+1.4%)

50:
Jan : 812 (+0.0%)
VC7.1 : 814 (+0.2%)
std::copy: 826 (+1.7%)
intrinsic: 836 (+3.0%)

60:
Jan : 796 (+0.0%)
VC7.1 : 810 (+1.8%)
std::copy: 851 (+6.9%)
intrinsic: 843 (+5.9%)

MMX TRANSFERS
---------------

64:
Jan : 768 (+0.0%)
VC7.1 : 841 (+9.5%)
std::copy: 850 (+10.7%)
intrinsic: 872 (+13.5%)

128:
Jan : 853 (+0.0%)
VC7.1 : 966 (+13.2%)
std::copy: 998 (+17.0%)
intrinsic: 1034 (+21.2%)

256:
Jan : 998 (+0.0%)
VC7.1 : 1172 (+17.4%)
std::copy: 1236 (+23.8%)
intrinsic: 1210 (+21.2%)

512:
Jan : 1282 (+0.0%)
VC7.1 : 1490 (+16.2%)
std::copy: 1527 (+19.1%)
intrinsic: 1601 (+24.9%)

1024:
Jan : 1879 (+0.0%)
VC7.1 : 2198 (+17.0%)
std::copy: 2246 (+19.5%)
intrinsic: 2443 (+30.0%)

2048:
Jan : 3191 (+0.0%)
VC7.1 : 3517 (+10.2%)
std::copy: 3582 (+12.3%)
intrinsic: 4047 (+26.8%)

4096:
Jan : 5648 (+0.0%)
VC7.1 : 6184 (+9.5%)
std::copy: 6411 (+13.5%)
intrinsic: 7315 (+29.5%)

LARGE TRANSFERS
---------------

65536:
Jan : 63286 (+0.0%)
VC7.1 : 83653 (+32.2%)
std::copy: 83957 (+32.7%)
intrinsic: 98750 (+56.0%)

98304:
Jan : 78322 (+0.0%)
VC7.1 : 105671 (+34.9%)
std::copy: 103310 (+31.9%)
intrinsic: 141624 (+80.8%)

131072:
Jan : 105282 (+0.0%)
VC7.1 : 125678 (+19.4%)
std::copy: 125907 (+19.6%)
intrinsic: 179672 (+70.7%)

196608:
Jan : 166292 (+0.0%)
VC7.1 : 171921 (+3.4%)
std::copy: 168715 (+1.5%)
intrinsic: 256996 (+54.5%)

262144:
Jan : 230008 (+0.0%)
VC7.1 : 285204 (+24.0%)
std::copy: 278741 (+21.2%)
intrinsic: 391790 (+70.3%)

393216:
Jan : 337867 (+0.0%)
VC7.1 : 944732 (+179.6%)
std::copy: 992164 (+193.7%)
intrinsic: 1054227 (+212.0%)

524288:
Jan : 502399 (+0.0%)
VC7.1 : 1261112 (+151.0%)
std::copy: 1258292 (+150.5%)
intrinsic: 1411586 (+181.0%)

1048576:
Jan : 1681793 (+0.0%)
VC7.1 : 2552749 (+51.8%)
std::copy: 2563882 (+52.4%)
intrinsic: 2863032 (+70.2%)



Gains are less marked than on Athlon and Pentium III, but the new code still comes out a good bit ahead.

Quote:
I'm still quite interested in finding out how 16-bit SSE transfers perform, especially MOVAPS on SSE2 equipped machines. Besides, you'd get rid of the EMMS instruction which I assume is a bottleneck for small(ish) transfers.

Yes, that's still on TODO. I would expect large gains since P4 L2 cache access is wider than 64-bit MOVQ size.
Unfortunately I don't see myself capable of writing this due to lack of SSE2 hardware for testing (which must be done often during development). Any takers? Should be easy to integrate :)

Quote:
Oh, and precompiled object files for the NASM-illiterate. I guess I could post some myself but binaries are obviously more trustworthy from the author himself.

Ah, thanks for the reminder. Find it in the above package (ia32.asm.obj, which also contains some other stuff that we don't need).


Quote:
I would be interested to see how various memcpy methods performs comparing to copying loop written in plain C++ (with maximal optimizations), and same loop with some unrolling.

Yes, that'd be interesting. Of course such a loop has no hope of competing above 64 bytes; results for smaller transfers are:

8:
Jan : 96 (+0.0%)
VC7.1 : 103 (+7.3%)
loop : 96 (+0.0%)
uloop : 98 (+2.1%)

12:
Jan : 96 (+0.0%)
VC7.1 : 104 (+8.3%)
loop : 98 (+2.1%)
uloop : 105 (+9.4%)

16:
Jan : 99 (+0.0%)
VC7.1 : 105 (+6.1%)
loop : 102 (+3.0%)
uloop : 102 (+3.0%)

24:
Jan : 102 (+0.0%)
VC7.1 : 107 (+4.9%)
loop : 107 (+4.9%)
uloop : 106 (+3.9%)

26:
Jan : 104 (+0.0%)
VC7.1 : 112 (+7.7%)
loop : 112 (+7.7%)
uloop : 110 (+5.8%)

32:
Jan : 104 (+0.0%)
VC7.1 : 124 (+19.2%)
loop : 124 (+19.2%)
uloop : 110 (+5.8%)

35:
Jan : 109 (+0.0%)
VC7.1 : 131 (+20.2%)
loop : 133 (+22.0%)
uloop : 114 (+4.6%)

37:
Jan : 107 (+0.0%)
VC7.1 : 128 (+19.6%)
loop : 131 (+22.4%)
uloop : 118 (+10.3%)

40:
Jan : 106 (+0.0%)
VC7.1 : 129 (+21.7%)
loop : 131 (+23.6%)
uloop : 114 (+7.5%)

41:
Jan : 108 (+0.0%)
VC7.1 : 129 (+19.4%)
loop : 133 (+23.1%)
uloop : 116 (+7.4%)

42:
Jan : 109 (+0.0%)
VC7.1 : 132 (+21.1%)
loop : 137 (+25.7%)
uloop : 118 (+8.3%)

43:
Jan : 112 (+0.0%)
VC7.1 : 133 (+18.8%)
loop : 140 (+25.0%)
uloop : 120 (+7.1%)

50:
Jan : 111 (+0.0%)
VC7.1 : 134 (+20.7%)
loop : 143 (+28.8%)
uloop : 119 (+7.2%)

60:
Jan : 112 (+0.0%)
VC7.1 : 135 (+20.5%)
loop : 145 (+29.5%)
uloop : 127 (+13.4%)


We see that both can't really keep up, especially with unaligned/uneven byte counts (the tail jump table is unbeatable). Not as bad as could be expected, though, especially for smaller sizes. This is due to less overhead vs ia32_memcpy and memcpy.
BTW, both compile down to loops containing dword MOV instructions, followed by a byte loop. uloop is unrolled 2x.


Thanks again for the kind words - I am flattered :)

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!