Jump to content
  • Advertisement
Sign in to follow this  
Jan Wassenberg

New Mini-Article: Speeding up memcpy

This topic is 4880 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Promit: interesting. Is the purpose familiarity/taste, or is there a pressing need for such a wrapper?

nmi:
Quote:
I'm wondering if it would be possible to determine the correct algorithm depending on size in cases where the size of the memory block is constant, i.e. known at compile time.
So if there are up to 64 byte to copy, the mov instructions can be inline directly. If there are more bytes to copy or the size is only known at runtime you generate a call to the other assembler functions which will check the cpu type etc.
This will reduce the overhead for small blocks of data and will create the same code in the other cases.

hm. I don't see a way to generate such code at compile-time. Is this a "wouldn't it be nice if" kind of thing? :)
BTW, overhead for small blocks is already quite low: the more complicated logic to select technique based on size only runs when size > 64. But yes, compiler-generated intrinsics are even better when available. For example, you could assign structs to each other rather than using memcpy.

AP:
Quote:
Quote:
Another thing to remember is that the best method to use depends on whether you intend to use the data afterwards. If you use non-caching writes it will be slower to access immediately after than if you had used a standard memcpy.

Maybe you missed that line which can be found in the paper:
- 64 KB (L1 data cache size; beyond this, techniques that avoid polluting the cache are a big win)
So for large amounts of data you should disable caching, or chances are high you will end up having the wrong data in your cache.

Let me expand on that.
The thresholds to the non-in-cache techniques are chosen at the limit where cache thrashing would otherwise result (i.e. cache size). Beyond that point, they are obviously beneficial, but they are not at all used otherwise; the implementation always tries to keep your data in-cache where it makes sense (i.e. it probably won't be displaced before being used).

Doynax:
Quote:
I think he meant that a synthetic benchmark may not be ideal, since the optimum method may depend on what you're going to do with the data afterwards.

Yes, measuring performance in context is always better.

Quote:
Also, how does the new function compare to small (especially fixed size) compiler intrinsics?
And what about 16-byte SSE transfers with MOVAPS/MOVDQA?

Good questions. Will investigate tomorrow; am dog tired.
I suspect on Athlons at least that MOVAPS will not help because IIRC internal datapaths are actually only 64 bits and larger accesses are split up. I'll give it a try, though. Unfortunately I lack SSE2 hardware to test MOVDQA.

Quote:
BTW doesn't the original (pre-XP) Athlons support MOVNTQ too, through the extended MMX instruction set?

Yes! Thanks for pointing that out. I will update the init routine and then you will have seriously boosted performance on old Athlons :)

Share this post


Link to post
Share on other sites
Advertisement
Quote:
Original post by Jan Wassenberg
hm. I don't see a way to generate such code at compile-time. Is this a "wouldn't it be nice if" kind of thing? :)


Templating, with recursion and partial specialization to create 'ranges' of memory sizes?


// probably horribly broken, but gets the idea across
template <int size, int type>
inline void my_memcpy(char* src, char* dest) {
my_memcpy<size, type - 1>(src, dest);
}

template <int size>
inline void my_memcpy<0>(char* src, char* dest) {
my_smallblock_memcpy(src, dest, size);
}

template <int size>
inline void my_memcpy<64>(char* src, char* dest) {
my_largeblock_memcpy(src, dest, size);
}

// array copy.
template <int size>
inline void my_array_memcpy(char src[size]&, char dest[size]&) {
my_memcpy<size, size>(src, dest);
}

Share this post


Link to post
Share on other sites
Quote:
Original post by Jan Wassenberg
... std::copy wraps it in a function and calls via function pointer (same treatment as for the others).


I didn't quite get what you mean here, there shouldn't be a need to use pointers to functions internally in generic algorithms unless you where referring to your version of std::copy or something.

Anyways bare in mind to others just as with memcpy/memmove you cannot use this code for arrays of NON POD-type, in C++ you should be using std::copy because it takes care of those "special cases" when you are dealing with arrays of POD-types, this is all transparent and incurs no cost in efficiency.

If you would like to use Jan's version of memcpy with something similar to std::copy to handle "special cases" transparently efficiently i've written some base code you can use, all you have to do is replace all occurrences of memmove (only 1 there [wink]) to a call to Jan's code.

The code depends on boost type traits & concept checking library, boost type traits can easily be replaced with TR1 type traits if your compiler supports TR1 or your own type traits. Concept checking can be removed if required.


#include <iterator>
#include <boost/type_traits/integral_constant.hpp>
#include <boost/type_traits/has_trivial_assign.hpp>
#include <boost/concept_check.hpp>

namespace custom_algorithm {

template < typename Tp >
struct is_iterator_wrapper : boost::false_type {
typedef is_iterator_wrapper<Tp> type;
};

// a little trick for G++ ;-)
#ifdef __GNUG__
template < typename Iterator, typename Container >
struct is_iterator_wrapper< std::__normal_iterator<Iterator, Container> >
: boost::true_type {
typedef is_iterator_wrapper<Iterator, Container> type;
};
#endif

template< typename InputIterator, typename OutputIterator >
inline OutputIterator copy__( InputIterator first, InputIterator last,
OutputIterator result, std::input_iterator_tag) {
for (; first != last; ++result, ++first)
*result = *first;
return result;
}

template< typename RandomAccessIterator, typename OutputIterator >
inline OutputIterator copy__( RandomAccessIterator first, RandomAccessIterator last,
OutputIterator result, std::random_access_iterator_tag) {

typedef typename std::iterator_traits<RandomAccessIterator>::difference_type
Distance;

for(Distance n = last - first; n > 0; --n) {
*result = *first;
++first;
++result;
}
return result;
}

template< typename Tp >
inline Tp* copy_trivial(const Tp* first, const Tp* last, Tp* result) {
std::memmove(result, first, sizeof(Tp) * (last - first));
return result + (last - first);
}

template< typename InputIterator, typename OutputIterator >
inline OutputIterator copy_aux2(InputIterator first, InputIterator last,
OutputIterator result, boost::false_type) {
typedef std::iterator_traits<InputIterator>::iterator_category itr_cat;
return copy__(first, last, result, itr_cat());
}

template< typename InputIterator, typename OutputIterator >
inline OutputIterator copy_aux2(InputIterator first, InputIterator last,
OutputIterator result, boost::true_type) {
typedef std::iterator_traits<InputIterator>::iterator_category itr_cat;
return copy__(first, last, result, itr_cat());
}

template< typename Tp >
inline Tp* copy_aux2(Tp* first, Tp* last, Tp* result, boost::true_type)
{ return copy_trivial(first, last, result); }

template< typename Tp >
inline Tp* copy_aux2(const Tp* first, const Tp* last, Tp* result, boost::true_type)
{ return copy_trivial(first, last, result); }

template< typename InputIterator, typename OutputIterator >
inline OutputIterator copy_ni2(InputIterator first, InputIterator last,
OutputIterator result, boost::true_type) {
typedef typename std::iterator_traits<InputIterator>::value_type
ValueType;
typedef typename boost::has_trivial_assign<ValueType>::type
Trivial;
return OutputIterator(copy_aux2(first, last, result.base(), Trivial()));
}

template< typename InputIterator, typename OutputIterator >
inline OutputIterator copy_ni2(InputIterator first, InputIterator last,
OutputIterator result, boost::false_type) {
typedef typename std::iterator_traits<InputIterator>::value_type
ValueType;
typedef typename boost::has_trivial_assign<ValueType>::type
Trivial;
return copy_aux2(first, last, result, Trivial());
}

template< typename InputIterator, typename OutputIterator >
inline OutputIterator copy_ni1(InputIterator first, InputIterator last,
OutputIterator result, boost::true_type) {

typedef typename is_iterator_wrapper<OutputIterator>::type is_wrapper;
return copy_ni2(first.base(), last.base(), result, is_wrapper());
}

template<typename InputIterator, typename OutputIterator>
inline OutputIterator copy_ni1(InputIterator first, InputIterator last,
OutputIterator result, boost::false_type) {
typedef typename is_iterator_wrapper<OutputIterator>::type is_wrapper;
return copy_ni2(first, last, result, is_wrapper());
}

template< typename InputIterator, typename OutputIterator >
inline OutputIterator copy( InputIterator first, InputIterator last,
OutputIterator result) {
using namespace boost;
// concept requirements
function_requires< InputIteratorConcept<InputIterator> >();
function_requires< OutputIteratorConcept<OutputIterator>,
typename std::iterator_traits<OutputIterator>::value_type >();
typedef typename is_iterator_wrapper<OutputIterator>::type is_wrapper;
return copy_ni1(first, last, result, is_wrapper());
}
};

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Quote:
Original post by Jan Wassenberg
Howdy. I've just written a Technical Report on speeding up memcpy.
It presents source code and techniques behind an implementation that beats VC7.1's memcpy() by 7..300%, depending on transfer size. There are no special CPU requirement (runs on 'all' CPUs from the original Pentium MMX on) and it can easily be dropped into other projects.
Hope it helps someone! Feedback is welcome.


So basically you rehashed a bunch of knowledge already known for about 5+ years now...

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Quote:
Original post by Jan Wassenberg
Howdy. I've just written a Technical Report on speeding up memcpy.
It presents source code and techniques behind an implementation that beats VC7.1's memcpy() by 7..300%, depending on transfer size. There are no special CPU requirement (runs on 'all' CPUs from the original Pentium MMX on) and it can easily be dropped into other projects.
Hope it helps someone! Feedback is welcome.



A few samples of what 'block sizes' correspond to these widly varying speedups (7%..300%) ????

I would assume better the bigger the block is (if you using SIMD moves).

Share this post


Link to post
Share on other sites
Quote:
Original post by Anonymous Poster
Quote:
Original post by Jan Wassenberg
Howdy. I've just written a Technical Report on speeding up memcpy.
It presents source code and techniques behind an implementation that beats VC7.1's memcpy() by 7..300%, depending on transfer size. There are no special CPU requirement (runs on 'all' CPUs from the original Pentium MMX on) and it can easily be dropped into other projects.
Hope it helps someone! Feedback is welcome.


So basically you rehashed a bunch of knowledge already known for about 5+ years now...


Obviously quite a few people enjoys it. What have you done for the community lately?

Share this post


Link to post
Share on other sites
Quote:
Original post by Jan Wassenberg
Promit: interesting. Is the purpose familiarity/taste, or is there a pressing need for such a wrapper?
Familiarity. That function functions as a drop-in replacement for std::copy when used with PoD types.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Is there an .obj file available anywhere, or am I going to have to bite the bullet and install nasm to get this to work :)

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!