Fast Approximation to memcpy()

Started by
29 comments, last by MikeWillHugYou 7 years, 4 months ago

Did somebody say templates?


#include <cstddef>
#include <iostream>

template <std::size_t I, std::size_t N>
struct memcpy_helper {
    static void
    do_(char* dst, char* src) {
        dst[I] = src[I];
        memcpy_helper<I + 1, N>::do_(dst, src);
    }
};

template <std::size_t N>
struct memcpy_helper<N, N> {
    static void
    do_(char*, char*) { }
};

template <std::size_t N, typename T>
void
memcpy(T* dst, T* src) {
    memcpy_helper<0, N * sizeof(T)>::do_(reinterpret_cast<char*>(dst), reinterpret_cast<char*>(src));
}

int
main() {
    int src[]{1, 2, 3};
    int dst[3];
    
    memcpy<3>(dst, src);
    
    for (int i : dst)
        std::cout << i << '\n';
}

It's fast because there is no run-time loop!

Advertisement

A coworker provided this extremely fast approximation.


void memcpy(void* dst, void*src, size_t size)
{
                if (size > 1 && dst != src)
                {
                                *dst = *src;
                }

                // The rest can’t be that important
}
L. Spiro

Thats a lol unless operator does:

TMemoryStream::Write(src, byte_amount); and that is:

System.move + c++

wait thats not pure c++ seems like writing to a stream should do,

either fancy way you have to keep it simple. or check all

On ARM CPUs, stmia and ldmia are usually better than single load/store instructions. Especially for memset, you can fill 8 registers with the value and write 32 bytes per instruction :) Or unroll it to 32 stmia instructions and get a whole KB per iteration. But dealing with unaligned addresses/non-multiple-of-32 size is a pain.

Mostly I'd be looking at why you're copying so much data around every frame in the first place.

Everyone knows that null values lead to access violations. So why not just prevent those pesky nulls from being copied in the first place?


void memcpy(void *dest, void *src, size_t size)
{
  strncpy((char*)dest, (const char*)src, size);
}

Reminds me of what Nintendo did with the Wii (for context: they compared keys using strncmp instead of memcmp, so all you needed was to use a key that's all zeroes and the console would take it as a valid disc)

http://thedailywtf.com/articles/Anatomii-of-a-Hack

Don't pay much attention to "the hedgehog" in my nick, it's just because "Sik" was already taken =/ By the way, Sik is pronounced like seek, not like sick.

This one actually works... within a given assumption :wink:


void memcpy(void *dest, void *src, size_t size)
{
  assert(dest == src);
}

Technically the memory pointed to by the src and dest pointers can't overlap, so a compiler could technically optimize "dest == src" to "size == 0" making your function very fast by virtue of accepting only zero-length inputs :lol:

“If I understand the standard right it is legal and safe to do this but the resulting value could be anything.”

unsigned char * source;

unsigned char * dest;

after copy you just cast

[spoiler]



inline void CopyToPCHAR(std::string str, int * len, char *& p)
{
DeleteIfPersist(len, p);

(*len) = str.Length();

if ((*len) > 0)
{
p = new char[ (*len) + 1 ];
strcpy(p, str.c_str());
p[(*len)] = NULL;
}

}
 

[/spoiler]


void copy(unsigned char * s, unsigned char * d, int len)
{
//skip len check since we only copy when length > 0
d = new unsigned char[len];
for (int i=0; i < len; i++)
d[i] = s[i];
}

and you use it like that


my_data * tmp = (my_data*)d;

so how do you transfer it faster?


d = new double[len];
for (int i=0; i < len; i++)
d[i] = s[i];
}

??

Copy 16 bytes at a time via vmovntdqa and vmovntdq (don’t pollute the cache).
Prefetch well via __builtin_prefetch(). Unroll loops to make prefetch less costly and more useful.
Handling trailing bytes with rep movsq and rep movsb.

There are plenty of things you can do.


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

Hey man, I don't think you can optimize 'memcpy()' or 'memset()' ...

see GNU implementation and actual optimization implemented https://fossies.org/dox/glibc-2.24/string_2memcpy_8c_source.html

You may be able to approximate by only copying 1 bytes of 2, but the unset memory will hold random data... is that really what you want?

However,, if you perfectly know what size is the memory, you can implement something like this

(less comparison are done => faster program)

(copying 8 bytes per 8 bytes => 8 times faster than byte by byte if working on a 64 bit system)


/** if we know src and dst are multiple of 64 bytes (8 uint64_t) */
void * memcpy(void * src, void * dst, size_t n) {
	uint64_t * s = (int *)src;
	size_t len = n / sizeof(uint64_t);
	uint64_t * ptr = (int *)dst;
	uint64_t * end = (int *)(dst + len);
	size_t i = 0;
	while (ptr < end) {
		*ptr++ = *s++; //1
		*ptr++ = *s++; //2
		*ptr++ = *s++; //...
		*ptr++ = *s++;
		*ptr++ = *s++;
		*ptr++ = *s++;
		*ptr++ = *s++;		
		*ptr++ = *s++; //8
	}
}
Your link is broken.
Of course, approximating std::memcpy() was a joke, which is why it is in Coding Horrors.

In any case, I optimized it and wrote a faster-in-all-cases version for PlayStation 4.
Already have one 2 or 3 times faster on my Windows® machine, and will make one for Xbox One soon.


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

This topic is closed to new replies.

Advertisement