With that and one additional improvement (replacing MOVSB table with a plain loop), the results are as follows:
SMALL TRANSFERS---------------64: AMD : 90014 (+22.0%) VC7.1 : 73765 (+0.0%) Jan : 74793 (+1.4%) std::copy: 79794 (+8.2%) intrinsic: 85894 (+16.4%)128: AMD : 194942 (+18.6%) VC7.1 : 166969 (+1.6%) Jan : 164314 (+0.0%) std::copy: 177892 (+8.3%) intrinsic: 188542 (+14.7%)256: AMD : 426550 (+17.9%) VC7.1 : 394309 (+9.0%) Jan : 361846 (+0.0%) std::copy: 416561 (+15.1%) intrinsic: 443454 (+22.6%)512: AMD : 967694 (+16.1%) VC7.1 : 1013445 (+21.5%) Jan : 833830 (+0.0%) std::copy: 1053348 (+26.3%) intrinsic: 1151986 (+38.2%)2048: AMD : 6537103 (+9.2%) VC7.1 : 9311588 (+55.6%) Jan : 5985845 (+0.0%) std::copy: 9471943 (+58.2%) intrinsic: 10965431 (+83.2%)8192: AMD : 66095286 (+4.1%) VC7.1 : 121148415 (+90.9%) Jan : 63470328 (+0.0%) std::copy: 121783488 (+91.9%) intrinsic: 145574481 (+129.4%)LARGE TRANSFERS---------------65536: AMD : 62353 (+31.4%) VC7.1 : 67971 (+43.2%) Jan : 47464 (+0.0%) std::copy: 68006 (+43.3%) intrinsic: 101206 (+113.2%)131072: AMD : 124065 (+0.1%) VC7.1 : 135725 (+9.5%) Jan : 123908 (+0.0%) std::copy: 135772 (+9.6%) intrinsic: 202230 (+63.2%)196608: AMD : 203983 (+0.0%) VC7.1 : 853990 (+318.7%) Jan : 205426 (+0.7%) std::copy: 809232 (+296.7%) intrinsic: 862066 (+322.6%)262144: AMD : 493363 (+0.0%) VC7.1 : 1097035 (+122.4%) Jan : 504223 (+2.2%) std::copy: 1110072 (+125.0%) intrinsic: 1139694 (+131.0%)393216: AMD : 743397 (+0.0%) VC7.1 : 1673832 (+125.2%) Jan : 760964 (+2.4%) std::copy: 1661200 (+123.5%) intrinsic: 1720660 (+131.5%)524288: AMD : 1049521 (+0.0%) VC7.1 : 2255720 (+114.9%) Jan : 1060874 (+1.1%) std::copy: 2198060 (+109.4%) intrinsic: 2318947 (+121.0%)1048576: AMD : 2112846 (+0.0%) VC7.1 : 4589070 (+117.2%) Jan : 2127714 (+0.7%) std::copy: 4505785 (+113.3%) intrinsic: 4794404 (+126.9%)
Analysis: I see potential for improvement on < 64 bytes: could do alignment, and combine that with the IC_MOVSD block. The new code wins handily everywhere except there and in the block prefetch range (> 192KB), where it hangs back by 1..2% vs. the AMD implementation. This is probably due to some stall or screwup in my code; I'll do pipeline analysis later.
Unfortunately I'm running out of time, heading out of town for the week (-> updates will be less frequent).
Quick replies though:
Zahlman: (template specialization): hm, that looks like it'd work. However, it only avoids one well-predicted jump, so it's probably not worth the effort.
snk_kid:
Quote:I didn't quite get what you mean here, there shouldn't be a need to use pointers to functions internally in generic algorithms unless you where referring to your version of std::copy or something.
Yes, this was just an artifact of how the benchmark is calling std::copy.
255:
Quote:Any chance of this going into glibc and/or gcc?
That would be an honor, but I have no idea how to go about submitting it :)
AP:
Quote:So basically you rehashed a bunch of knowledge already known for about 5+ years now...
This is a surprising to hear. I clearly stated in the article what is new, and that includes 5..20% gains on the old code. I wonder why you are saying this, and why you have to hide under cover of anonymity. Anyway, if you have nothing productive to add, sit down and shut up. Thank you.
AP2:
Quote:A few samples of what 'block sizes' correspond to these widly varying speedups (7%..300%) ????
I would assume better the bigger the block is (if you using SIMD moves).
Sure; see above. Yeah.
Promit:
Quote:Familiarity. That function functions as a drop-in replacement for std::copy when used with PoD types.
Gotcha. For me, it's the opposite :)
AP3:
Quote:Is there an .obj file available anywhere, or am I going to have to bite the bullet and install nasm to get this to work :)
Sorry, there is currently no .obj. The code is still under development (thought up a few improvements today), but I'll post one when it's stabilized.