Why std::copy is faster than std::memcpy ?

Started by
11 comments, last by RobTheBloke 8 years, 7 months ago

Hi,

Why std::copy is faster than std::memcpy ?

Possible implementation of std::copy :


template<class InputIt, class OutputIt>
OutputIt copy(InputIt first, InputIt last, OutputIt d_first)
{
  while (first != last) {
    *d_first++ = *first++;
  }
  return d_first;
}

Thanks

Advertisement

First things first: That's not how most std::copy implementations work.

If you actually look at it you will see a lot of them use compile time type deduction to optimize the amount of work they need to do. Furthermore, as they're templates and fairly small, they have a high chance of being inlined. Lastly, due to the fact that the compiler has exact information about the size of the objects it can explicitly optimize the copy for your type. Lastly, a lot of times, if copy can deduce the point is a pointer, then it can simply delegate to memmove.

memcpy tends to be located in the msvcrt dll, and as such will typically not be inlined (LTCG can do this though). It also is quite a bit more complex as it does not actually know what you're going to be passing into it at compile time, and thus has to do more work to efficiently move things around.

All of that being said, in general the two should perform roughly the same.

In time the project grows, the ignorance of its devs it shows, with many a convoluted function, it plunges into deep compunction, the price of failure is high, Washu's mirth is nigh.


That's not how most std::copy implementations work. If you actually look at it you will see...

That is actually a fun exercise.

Seriously, open up your implementation's version of std::copy. Find all the variations, since there are likely several with subtle differences in types.

Then look over how the different forms of copy's template parameter types are themselves template types with their own subtle variations and internal implementations. So a small number of std::copy() templates can be implemented in a large number of implementation details. Some implementation variations for random access iterators, for pointers to scalars, for forward iterators, for input iterators, for arbitrary other iterators, and so on.

Note that memcpy is (usually) forwarded to a compiler intrinsic (as in https://msdn.microsoft.com/en-us/library/tzkfha43.aspx and https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html).

Also see https://stackoverflow.com/questions/4707012/c-memcpy-vs-stdcopy, especially the "explanation of results" part. Fancy overloads of std::copy aren't (generally) relevant to the case where std::copy is super efficient.

The relevant bit is that the optimizer can see into the instantiation of templates and apply type-aware optimizations that C implementations of memcpy (and apparently most intrinsic versions) just can't.

This is akin to the same reason that C++ std::sort is faster than C's qsort, even if they implement the same sort algorithm internally. Or why C++ hashing functions can be faster than C versions. etc. Assuming you're okay with the increased compile times, C++ enables the compiler to work magic in a way that C doesn't.

Sean Middleditch – Game Systems Engineer – Join my team!

I wouldn't worry about differences in speed between two copy methods, but instead try to avoid having to copy things.

Seriously, if such speed differences make a significant difference, there is either something very wrong on the design, or you're working at the very edge of what the application or the system can handle, which means that if you make things a tad bigger, it dies anyway.


This is akin to the same reason that C++ std::sort is faster than C's qsort

Well, no.

std::sort is first and foremost faster because qsort is not only a non-inlineable library function, but one that that calls back a user-supplied function (which needs to cast from void* and do whatever is needed as comparison, and for which the compiler cannot assume strict aliasing rules). That callback cannot possibly be inlined, nor can the compiler optimize across it. So assuming an pretty good sorting algorithm that needs exactly N comparisons, you already have added N non-inlined function calls.

Now of course std::sort has a comparison functor, too. So technically you have just as many function callbacks. But these can in practically every case be inlined, and the compiler is able to further optimize the whole "unit" of sort+functor, since it can see all the source.

Also, the comparison for qsort returns -1, 0, or 1 depending on the result whereas comparators for std::sort return bool. This lends to a much simpler logic for std::sort (of course, on many architectures, the more complex logic can be optimized into one compare and 3 flag-dependent conditional jumps on the library side, but that is not guaranteed, and the added complexity needed to produce a tri-state at the user side remains).

Well, no.

std::sort is first and foremost faster because qsort is not only a non-inlineable library function, but one that that calls back a user-supplied function


Well, yes.

The function being inline-able into the algorithm is a consequence of how "the optimizer can see into the instantiation of templates" as I said. smile.png

Sean Middleditch – Game Systems Engineer – Join my team!

Why std::copy is faster than std::memcpy ?

Could you post your profiling scenario that made you observe this? We might speculate further then.

I read that from Stack Overflow but I did the test by myself memcpy vs copy to copy 1000000 times a matrix identity 4x4 but the time difference is there :


memcpy = 0ms
copy = 9ms

The test was made in release mode.

Those numbers show memcpy being faster, not the other way around. They also are strongly suggestive of basic mistakes in setting up a microbenchmark.

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

This topic is closed to new replies.

Advertisement