Fastest Way to Copy Two Buffers Of Different Types

Started by
22 comments, last by Prune 12 years, 4 months ago
Before I get ahead of myself, this is in C (with assembly being permissible).
I've got a need to copy two buffers. The destination buffer is an int32 buffer, and the input buffer is either a uint8 or a uint16 buffer. Right now I'm just doing a normal loop, i.e.:

int i;

for (i = 0; i < length; ++i)
{
int_buffer = uchar_buffer;
}

// or

for (i = 0; i < length; ++i)
{
int_buffer = ushort_buffer;
}


Is there any way to speed this up perhaps? I'm okay with resorting to this method if there aren't any faster ways, but I'm particularly curious if there's some trick in C that can be used, or perhaps a specific assembly instruction. Thanks!
[size=2][ I was ninja'd 71 times before I stopped counting a long time ago ] [ f.k.a. MikeTacular ] [ My Blog ] [ SWFer: Gaplessly looped MP3s in your Flash games ]
Advertisement

Is there any way to speed this up perhaps? I'm okay with resorting to this method if there aren't any faster ways, but I'm particularly curious if there's some trick in C that can be used, or perhaps a specific assembly instruction. Thanks!


Duff's Device comes to mind ...
I almost wonder if you could abuse SIMD swizzling instructions for this, but unfortunately my SIMD is reaaaally rusty...

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

You could Duff it, yeah. If the thought of doing that makes you cringe, then just unrolling the loops a little helps a lot too. You'd do something like:
While (length is a multiple of 8)
{
dst[0] = src[0];
.
.
.
dst[7] = src[7];
dst += 8;
src += 8;
length -= 8;
}

Copy anything left over (regular for loop)

Experimenting with different multiples, I've found that 16 is quite good to use, giving slightly better performance (this was years ago, copying color-indexed data to a 32-bit direct draw surface via a palette lookup). My experience is that going higher slows things down again; your's may differ.

If you use 16, be aware that there may be one multiple of 8 left over, so unrolling that can help too (it will be very marginal though).

If you pad your arrays to always be the relevant multiple that can make things a lot simpler.

Either way it can get you a lot of code and if you need to do this more than once for different data types you might want to template it in C++.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

I wish I could use C++ in this (not that there's anything specific in C++ that would make it faster though), but it's strictly C. I thought about SIMD, but I've never actually used it so I don't want to take the time figure out how to do it (with the potential of doing it wrong) without knowing the kinds of benefits it would provide.

Anyway, thanks for reminding me of the existence of Duff's Device! I'll look into it and see if it helps.

One related question: I tried once to do this:

int i;
for (i = length; i--;)
{
*int_buffer++ = *uchar_buffer++;
}


But it actually ran slower like that. I didn't look at the disassembly, but I couldn't think of an explanation. Seems like it should at least be just as fast, if not (marginally) faster.
[size=2][ I was ninja'd 71 times before I stopped counting a long time ago ] [ f.k.a. MikeTacular ] [ My Blog ] [ SWFer: Gaplessly looped MP3s in your Flash games ]
I've benchmarked these methods, and have generally found that *dst++ = *src++ is always slower than array indexing. The compiler is obviously doing something different, but I haven't yet dug into the underlying asm to find out what's going on.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

Yay, let's profile...

#include <iostream>
#include <string>
#include <vector>
#include <algorithm>
#include <iostream>

struct Statistics {
Statistics() : times(512) { times.clear(); }

void add(__int64 t) { times.push_back(t); }

void print() const {
__int64 total = 0;
__int64 minimum = std::numeric_limits<__int64>::max();
__int64 maximum = 0;

for (size_t i = 0; i < times.size(); i++)
{
total += times;
minimum = std::min(times, minimum);
maximum = std::max(times, maximum);
}

const int N_BUCKETS = 8;
int buckets[N_BUCKETS+1];
for (int i = 0; i < N_BUCKETS; i++) buckets = 0;

__int64 delta = maximum - minimum;
if (delta > 0) {
for (size_t i = 0; i < times.size(); i++)
{
int bucket = (int)(N_BUCKETS * (times-minimum) / delta);
buckets[bucket]++;
}

for (int i = 0; i < N_BUCKETS; i++) {
printf("(%7.4d - %7.4d) |",
(int)(minimum + 1.0 * delta * i / N_BUCKETS),
(int)(minimum + 1.0 * delta * (i+1) / N_BUCKETS)
);
for (int j = 0; j < buckets; j++) printf("#");
printf("\n");
}
}

printf("%.1f [%.1f - %.1f]\n", 1.0*total/times.size(), 1.0*minimum, 1.0*maximum);
}
private:
std::vector<__int64> times;
};


typedef unsigned char uint8;
typedef signed int int32;

template < class T >
T * alalloc(size_t n) {
return (T*)_aligned_malloc(n * sizeof(T), 16);
}

size_t getval(const char * s, int def) {
int v = atoi(s);
return (size_t)((v > 0) ? v : def);
}

void flush_data_cache() {
size_t n = 2 * 64 * 1024;
char * temp = alalloc<char>(n);
for (size_t i = 0; i < n; i++) temp = (char)(temp + rand() % 0xff);
_aligned_free(temp);
}

void verify(int32 * a, int32 * b, size_t n) {
for (size_t i = 0; i < n; i++) {
if (a != b) {
printf("error\n");
std::terminate();
}
}
}

void copy_ref(uint8 * p8, int32 * p32, size_t n)
{
for (size_t i = 0; i < n; i++) {
p32 = p8;
}
}

void copy_2(uint8 * p8, int32 * p32, size_t n)
{
size_t m = n /8;
size_t r = n - (m*8);
for (size_t i = 0; i < m; i++) {
p32[8*i+0] = p8[8*i+0];
p32[8*i+1] = p8[8*i+1];
p32[8*i+2] = p8[8*i+2];
p32[8*i+3] = p8[8*i+3];
p32[8*i+4] = p8[8*i+4];
p32[8*i+5] = p8[8*i+5];
p32[8*i+6] = p8[8*i+6];
p32[8*i+7] = p8[8*i+7];

}
for (size_t i = 0; i < r; i++) {
p32[8*m + i] = p8[8*m + i];
}
}

void copy_3(uint8 * p8, int32 * p32, size_t n)
{
size_t m = n/4;
size_t r = n - (m*4);
unsigned int * pi = (unsigned int *)p8;
for (size_t i = 0; i < m; i++) {
unsigned int tmp = *pi;
p32[4*i+0] = (tmp & 0x000000ff) >> 0;
p32[4*i+1] = (tmp & 0x0000ff00) >> 8;
p32[4*i+2] = (tmp & 0x00ff0000) >> 16;
p32[4*i+3] = (tmp & 0xff000000) >> 24;
pi++;
}
for (size_t i = 0; i < r; i++) {
p32[4*m + i] = p8[4*m + i];
}
}

void copy_4(uint8 * p8, int32 * p32, size_t n)
{
size_t m = n/8;
size_t r = n - (m*8);
unsigned __int64 * pi = (unsigned __int64 *)p8;
for (size_t i = 0; i < m; i++) {
unsigned __int64 tmp = *pi;
p32[8*i+0] = (tmp & 0x00000000000000ff) >> 0;
p32[8*i+1] = (tmp & 0x000000000000ff00) >> 8;
p32[8*i+2] = (tmp & 0x0000000000ff0000) >> 16;
p32[8*i+3] = (tmp & 0x00000000ff000000) >> 24;
p32[8*i+4] = (tmp & 0x000000ff00000000) >> 32;
p32[8*i+5] = (tmp & 0x0000ff0000000000) >> 40;
p32[8*i+6] = (tmp & 0x00ff000000000000) >> 48;
p32[8*i+7] = (tmp & 0xff00000000000000) >> 56;
pi++;
}
for (size_t i = 0; i < r; i++) {
p32[8*m + i] = p8[8*m + i];
}
}

void copy_5(uint8 * src, int32 * dst, size_t count)
{
int n = (count+7)/8;
switch (count % 8) {
case 0: do { *dst++ = *src++;
case 7: *dst++ = *src++;
case 6: *dst++ = *src++;
case 5: *dst++ = *src++;
case 4: *dst++ = *src++;
case 3: *dst++ = *src++;
case 2: *dst++ = *src++;
case 1: *dst++ = *src++;
} while (--n > 0);
}

}

template < class F >
void test(F f, uint8 * src, int32 * dst, int32 * dstref, size_t n, size_t reps)
{
int foo = 0;

Statistics stats;

for (size_t rep = 0; rep < reps; rep++) {
memset(dst, 0, n*sizeof(int32));
flush_data_cache();

unsigned __int64 t1;
unsigned __int64 t2;
t1 = __rdtsc();

f(src, dst, n);

t2 = __rdtsc();

stats.add(t2-t1);

verify(dst, dstref, n);
foo += dst[rand() % n];
}

stats.print();
std::cout << foo << "\n";
}

int main(int argc, char * argv[])
{
size_t n = 64;
if (argc > 1) n = getval(argv[1], n);

int reps = 100;

uint8 * src = alalloc<uint8>(n);
uint8 * srcref = alalloc<uint8>(n);
int32 * dst = alalloc<int32>(n);
int32 * dstref = alalloc<int32>(n);

for (size_t i = 0; i < n; i++) src = (uint8)(rand() % 0xff);

copy_ref(src, dstref, n);

test(copy_ref, src, dst, dstref, n, reps);
test(copy_2, src, dst, dstref, n, reps);
test(copy_3, src, dst, dstref, n, reps);
test(copy_4, src, dst, dstref, n, reps);
test(copy_5, src, dst, dstref, n, reps);

_aligned_free(src);
_aligned_free(srcref);
_aligned_free(dst);
_aligned_free(dstref);

getchar();

return 0;
}
Lots of code....

Tu run, due to rdtsc, use: "start /affinity 0x01 test.exe 1500". Or similar, it's important to force running on single core. Or find a better profiler, as long as it's one that can report individual cycles.
-------------

copy_ref is dumb reference implementation.
copy_2 is unrolled loop.
copy_3 is one read per 4 writes
copy_4 is one read per 8 writes
copy_5 is Duff's device.

-----
Results 1024 elements
(32-bit executable, VS):
copy_2 and copy_3 are tied, copy_2 takes slight lead. copy_4 is worse than naive, duff's device comes solid third.

64-bit executable
copy_4 wins, copy_3 is second, copy_2 third, then Duff's device and finally naive.

----
Conclusions:

For a 64-bit app, minimizing memory reads through a register-wide temporary wins.
For a 32-bit app, unrolling alone is good enough, but to remain consistent, minimizing reads is just as good.

So copy_3 or copy_4 would be fastest, copy_2 for 32 bit apps. May vary on memory performance and CPU. And compiler. And OS. And weather. And phase of moon.

Gotcha: copy_3 and copy_4 may depend on endianess, so they aren't portable. Use unrolled loop to be on the safe side and avoid the hassles, it looks like only 10% slower.
-----

Sample output (4096 elements, values are in CPU cycles):( 8759 - 9643) |##################################################################################################
( 9643 - 10528) |
( 10528 - 11412) |
( 11412 - 12297) |
( 12297 - 13181) |
( 13181 - 14066) |
( 14066 - 14950) |
( 14950 - 15835) |#
8947.0 [8759.0 - 15835.0]
11586
( 6194 - 7077) |###############################################################################################
( 7077 - 7960) |
( 7960 - 8844) |#
( 8844 - 9727) |
( 9727 - 10610) |##
( 10610 - 11494) |
( 11494 - 12377) |
( 12377 - 13261) |#
6479.6 [6194.0 - 13261.0]
11867
( 6343 - 7185) |#################################################################################################
( 7185 - 8027) |#
( 8027 - 8870) |
( 8870 - 9712) |
( 9712 - 10554) |#
( 10554 - 11397) |
( 11397 - 12239) |
( 12239 - 13082) |
6481.0 [6343.0 - 13082.0]
12603
( 5853 - 6752) |#################################################################################################
( 6752 - 7651) |
( 7651 - 8551) |
( 8551 - 9450) |#
( 9450 - 10349) |
( 10349 - 11249) |
( 11249 - 12148) |
( 12148 - 13048) |#
6052.2 [5853.0 - 13048.0]
12449
( 7422 - 8374) |#################################################################################################
( 8374 - 9327) |
( 9327 - 10279) |
( 10279 - 11232) |
( 11232 - 12184) |##
( 12184 - 13137) |
( 13137 - 14089) |
( 14089 - 15042) |
7679.8 [7422.0 - 15042.0]
13554



Phew....
Antheus... I love you...
[size=2][ I was ninja'd 71 times before I stopped counting a long time ago ] [ f.k.a. MikeTacular ] [ My Blog ] [ SWFer: Gaplessly looped MP3s in your Flash games ]
memcpy() is probably your best bet. It checks if buffers are aligned and uses SSE when possible, therefore it's not likely you'll get better performance with own function. The following function does almost exactly same what standard memcpy() does in perfect situation:

void memcpySSE(void *_dst, void *_src, size_t _size) {
__asm {
mov edx, 0
mov eax, [_size]
mov ebx, 80h
div ebx
mov ecx, eax
mov esi, [_src]
mov edi, [_dst]
mov ebx, 0
copy:
movdqa xmm0, [esi + ebx + 00h]
movdqa xmm1, [esi + ebx + 10h]
movdqa xmm2, [esi + ebx + 20h]
movdqa xmm3, [esi + ebx + 30h]
movdqa xmm4, [esi + ebx + 40h]
movdqa xmm5, [esi + ebx + 50h]
movdqa xmm6, [esi + ebx + 60h]
movdqa xmm7, [esi + ebx + 70h]
movdqa [edi + ebx + 00h], xmm0
movdqa [edi + ebx + 10h], xmm1
movdqa [edi + ebx + 20h], xmm2
movdqa [edi + ebx + 30h], xmm3
movdqa [edi + ebx + 40h], xmm4
movdqa [edi + ebx + 50h], xmm5
movdqa [edi + ebx + 60h], xmm6
movdqa [edi + ebx + 70h], xmm7
add ebx, 80h
loop copy
};
}



On my system it copies 1 GB in about 160ms
Normally I'd just be using memcpy but the input and output buffers aren't the same type, so wouldn't it mess up the copy? Does your memcpySSE function work if the dst is a int* buffer and src is a uint8* buffer (or a uint16* buffer)? I need the values copied, not necessarily the bytes.
[size=2][ I was ninja'd 71 times before I stopped counting a long time ago ] [ f.k.a. MikeTacular ] [ My Blog ] [ SWFer: Gaplessly looped MP3s in your Flash games ]

This topic is closed to new replies.

Advertisement