Back to General and Gameplay Programming

Fastest Way to Copy Two Buffers Of Different Types

General and Gameplay Programming Programming

Started by Cornstalks November 30, 2011 09:36 PM

22 comments, last by Prune 12 years, 4 months ago

Cornstalks

7,033

Author

November 30, 2011 09:36 PM

Before I get ahead of myself, this is in C (with assembly being permissible).
I've got a need to copy two buffers. The destination buffer is an int32 buffer, and the input buffer is either a uint8 or a uint16 buffer. Right now I'm just doing a normal loop, i.e.:

int i;



for (i = 0; i < length; ++i)

{

    int_buffer = uchar_buffer;

}



// or



for (i = 0; i < length; ++i)

{

    int_buffer = ushort_buffer;

}

Is there any way to speed this up perhaps? I'm okay with resorting to this method if there aren't any faster ways, but I'm particularly curious if there's some trick in C that can be used, or perhaps a specific assembly instruction. Thanks!

[size=2][ I was ninja'd 71 times before I stopped counting a long time ago ] [ f.k.a. MikeTacular ] [ My Blog ] [ SWFer: Gaplessly looped MP3s in your Flash games ]

fastcall22

10,918

November 30, 2011 10:06 PM

Is there any way to speed this up perhaps? I'm okay with resorting to this method if there aren't any faster ways, but I'm particularly curious if there's some trick in C that can be used, or perhaps a specific assembly instruction. Thanks!

Duff's Device comes to mind ...

ApochPiQ

23,138

December 01, 2011 12:27 AM

I almost wonder if you could abuse SIMD swizzling instructions for this, but unfortunately my SIMD is reaaaally rusty...

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

21st Century Moose

13,459

December 01, 2011 01:07 AM

You could Duff it, yeah. If the thought of doing that makes you cringe, then just unrolling the loops a little helps a lot too. You'd do something like:

While (length is a multiple of 8)

{

   dst[0] = src[0];

   .

   .

   .

   dst[7] = src[7];

   dst += 8;

   src += 8;

   length -= 8;

}



Copy anything left over (regular for loop)

Experimenting with different multiples, I've found that 16 is quite good to use, giving slightly better performance (this was years ago, copying color-indexed data to a 32-bit direct draw surface via a palette lookup). My experience is that going higher slows things down again; your's may differ.

If you use 16, be aware that there may be one multiple of 8 left over, so unrolling that can help too (it will be very marginal though).

If you pad your arrays to always be the relevant multiple that can make things a lot simpler.

Either way it can get you a lot of code and if you need to do this more than once for different data types you might want to template it in C++.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

Cornstalks

7,033

Author

December 01, 2011 03:55 AM

I wish I could use C++ in this (not that there's anything specific in C++ that would make it faster though), but it's strictly C. I thought about SIMD, but I've never actually used it so I don't want to take the time figure out how to do it (with the potential of doing it wrong) without knowing the kinds of benefits it would provide.

Anyway, thanks for reminding me of the existence of Duff's Device! I'll look into it and see if it helps.

One related question: I tried once to do this:

int i;

for (i = length; i--;)

{

    *int_buffer++ = *uchar_buffer++;

}

But it actually ran slower like that. I didn't look at the disassembly, but I couldn't think of an explanation. Seems like it should at least be just as fast, if not (marginally) faster.

[size=2][ I was ninja'd 71 times before I stopped counting a long time ago ] [ f.k.a. MikeTacular ] [ My Blog ] [ SWFer: Gaplessly looped MP3s in your Flash games ]

21st Century Moose

13,459

December 01, 2011 11:32 AM

I've benchmarked these methods, and have generally found that *dst++ = *src++ is always slower than array indexing. The compiler is obviously doing something different, but I haven't yet dug into the underlying asm to find out what's going on.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

Antheus

2,410

December 01, 2011 05:41 PM

Yay, let's profile...

#include <iostream>

#include <string>

#include <vector>

#include <algorithm>

#include <iostream>



struct Statistics {

	Statistics() : times(512) {	times.clear(); }



	void add(__int64 t) { times.push_back(t); }



	void print() const {

    	__int64 total = 0;

    	__int64 minimum = std::numeric_limits<__int64>::max();

    	__int64 maximum = 0;



    	for (size_t i = 0; i < times.size(); i++) 

    	{

        	total += times;

        	minimum = std::min(times, minimum);

        	maximum = std::max(times, maximum);

    	}



    	const int N_BUCKETS = 8;

    	int buckets[N_BUCKETS+1];

    	for (int i = 0; i < N_BUCKETS; i++) buckets = 0;



    	__int64 delta = maximum - minimum;

    	if (delta > 0) {

        	for (size_t i = 0; i < times.size(); i++) 

        	{

            	int bucket = (int)(N_BUCKETS * (times-minimum) / delta);

            	buckets[bucket]++;

        	}



        	for (int i = 0; i < N_BUCKETS; i++) {

            	printf("(%7.4d - %7.4d) |", 

                	(int)(minimum + 1.0 * delta *  i / N_BUCKETS), 

                	(int)(minimum + 1.0 * delta * (i+1) / N_BUCKETS)

                	);

            	for (int j = 0; j < buckets; j++) printf("#");

            	printf("\n");

        	}

    	}



    	printf("%.1f [%.1f - %.1f]\n", 1.0*total/times.size(), 1.0*minimum, 1.0*maximum);

	}

private:

	std::vector<__int64> times;

};





typedef unsigned char uint8;

typedef signed int int32;



template < class T >

T * alalloc(size_t n) {

	return (T*)_aligned_malloc(n * sizeof(T), 16);

}



size_t getval(const char * s, int def) {

	int v = atoi(s);

	return (size_t)((v > 0) ? v : def);

}



void flush_data_cache() {

	size_t n = 2 * 64 * 1024;

	char * temp = alalloc<char>(n);

	for (size_t i = 0; i < n; i++) temp = (char)(temp + rand() % 0xff);

	_aligned_free(temp);

}



void verify(int32 * a, int32 * b, size_t n) {

	for (size_t i = 0; i < n; i++) {

    	if (a != b) {

        	printf("error\n");

        	std::terminate();

    	}

	}

}



void copy_ref(uint8 * p8, int32 * p32, size_t n)

{

	for (size_t i = 0; i < n; i++) {

    	p32 = p8;

	}

}



void copy_2(uint8 * p8, int32 * p32, size_t n)

{

	size_t m = n /8;

	size_t r = n - (m*8);

	for (size_t i = 0; i < m; i++) {

    	p32[8*i+0] = p8[8*i+0];

    	p32[8*i+1] = p8[8*i+1];

    	p32[8*i+2] = p8[8*i+2];

    	p32[8*i+3] = p8[8*i+3];

    	p32[8*i+4] = p8[8*i+4];

    	p32[8*i+5] = p8[8*i+5];

    	p32[8*i+6] = p8[8*i+6];

    	p32[8*i+7] = p8[8*i+7];



	}

	for (size_t i = 0; i < r; i++) {

    	p32[8*m + i] = p8[8*m + i];

	}

}



void copy_3(uint8 * p8, int32 * p32, size_t n)

{

	size_t m = n/4;

	size_t r = n - (m*4);

	unsigned int * pi = (unsigned int *)p8;

	for (size_t i = 0; i < m; i++) {

    	unsigned int tmp = *pi;

    	p32[4*i+0] = (tmp & 0x000000ff) >>  0;

    	p32[4*i+1] = (tmp & 0x0000ff00) >>  8;

    	p32[4*i+2] = (tmp & 0x00ff0000) >> 16;

    	p32[4*i+3] = (tmp & 0xff000000) >> 24;

    	pi++;

	}

	for (size_t i = 0; i < r; i++) {

    	p32[4*m + i] = p8[4*m + i];

	}

}



void copy_4(uint8 * p8, int32 * p32, size_t n)

{

	size_t m = n/8;

	size_t r = n - (m*8);

	unsigned __int64 * pi = (unsigned __int64 *)p8;

	for (size_t i = 0; i < m; i++) {

    	unsigned __int64 tmp = *pi;

    	p32[8*i+0] = (tmp & 0x00000000000000ff) >>  0;

    	p32[8*i+1] = (tmp & 0x000000000000ff00) >>  8;

    	p32[8*i+2] = (tmp & 0x0000000000ff0000) >> 16;

    	p32[8*i+3] = (tmp & 0x00000000ff000000) >> 24;

    	p32[8*i+4] = (tmp & 0x000000ff00000000) >> 32;

    	p32[8*i+5] = (tmp & 0x0000ff0000000000) >> 40;

    	p32[8*i+6] = (tmp & 0x00ff000000000000) >> 48;

    	p32[8*i+7] = (tmp & 0xff00000000000000) >> 56;

    	pi++;

	}

	for (size_t i = 0; i < r; i++) {

    	p32[8*m + i] = p8[8*m + i];

	}

}



void copy_5(uint8 * src, int32 * dst, size_t count)

{

	int n = (count+7)/8;

	switch (count % 8) {

	case 0: do { *dst++ = *src++;

	case 7:  	*dst++ = *src++;

	case 6:  	*dst++ = *src++;

	case 5:  	*dst++ = *src++;

	case 4:  	*dst++ = *src++;

	case 3:  	*dst++ = *src++;

	case 2:  	*dst++ = *src++;

	case 1:  	*dst++ = *src++;

        	} while (--n > 0);

	}



}



template < class F >

void test(F f, uint8 * src, int32 * dst, int32 * dstref, size_t n, size_t reps) 

{

	int foo = 0;



	Statistics stats;



	for (size_t rep = 0; rep < reps; rep++) {

    	memset(dst, 0, n*sizeof(int32));

    	flush_data_cache();



    	unsigned __int64 t1;

    	unsigned __int64 t2;

    	t1 = __rdtsc();



    	f(src, dst, n);



    	t2 = __rdtsc();



    	stats.add(t2-t1);



    	verify(dst, dstref, n);

    	foo += dst[rand() % n];

	}



	stats.print();

	std::cout << foo << "\n";

}



int main(int argc, char * argv[])

{

	size_t n = 64;

	if (argc > 1) n = getval(argv[1], n);



	int reps = 100;



	uint8 * src = alalloc<uint8>(n);

	uint8 * srcref = alalloc<uint8>(n);

	int32 * dst = alalloc<int32>(n);

	int32 * dstref = alalloc<int32>(n);



	for (size_t i = 0; i < n; i++) src = (uint8)(rand() % 0xff);



	copy_ref(src, dstref, n);



	test(copy_ref, src, dst, dstref, n, reps);

	test(copy_2, src, dst, dstref, n, reps);

	test(copy_3, src, dst, dstref, n, reps);

	test(copy_4, src, dst, dstref, n, reps);

	test(copy_5, src, dst, dstref, n, reps);



	_aligned_free(src);

	_aligned_free(srcref);

	_aligned_free(dst);

	_aligned_free(dstref);



	getchar();



	return 0;

}

Lots of code....

Tu run, due to rdtsc, use: "start /affinity 0x01 test.exe 1500". Or similar, it's important to force running on single core. Or find a better profiler, as long as it's one that can report individual cycles.
-------------

copy_ref is dumb reference implementation.
copy_2 is unrolled loop.
copy_3 is one read per 4 writes
copy_4 is one read per 8 writes
copy_5 is Duff's device.

-----
Results 1024 elements
(32-bit executable, VS):
copy_2 and copy_3 are tied, copy_2 takes slight lead. copy_4 is worse than naive, duff's device comes solid third.

64-bit executable
copy_4 wins, copy_3 is second, copy_2 third, then Duff's device and finally naive.

----
Conclusions:

For a 64-bit app, minimizing memory reads through a register-wide temporary wins.
For a 32-bit app, unrolling alone is good enough, but to remain consistent, minimizing reads is just as good.

So copy_3 or copy_4 would be fastest, copy_2 for 32 bit apps. May vary on memory performance and CPU. And compiler. And OS. And weather. And phase of moon.

Gotcha: copy_3 and copy_4 may depend on endianess, so they aren't portable. Use unrolled loop to be on the safe side and avoid the hassles, it looks like only 10% slower.
-----

Sample output (4096 elements, values are in CPU cycles):

(   8759 -	9643) |##################################################################################################

(   9643 -   10528) |

(  10528 -   11412) |

(  11412 -   12297) |

(  12297 -   13181) |

(  13181 -   14066) |

(  14066 -   14950) |

(  14950 -   15835) |#

8947.0 [8759.0 - 15835.0]

11586

(   6194 -	7077) |###############################################################################################

(   7077 -	7960) |

(   7960 -	8844) |#

(   8844 -	9727) |

(   9727 -   10610) |##

(  10610 -   11494) |

(  11494 -   12377) |

(  12377 -   13261) |#

6479.6 [6194.0 - 13261.0]

11867

(   6343 -	7185) |#################################################################################################

(   7185 -	8027) |#

(   8027 -	8870) |

(   8870 -	9712) |

(   9712 -   10554) |#

(  10554 -   11397) |

(  11397 -   12239) |

(  12239 -   13082) |

6481.0 [6343.0 - 13082.0]

12603

(   5853 -	6752) |#################################################################################################

(   6752 -	7651) |

(   7651 -	8551) |

(   8551 -	9450) |#

(   9450 -   10349) |

(  10349 -   11249) |

(  11249 -   12148) |

(  12148 -   13048) |#

6052.2 [5853.0 - 13048.0]

12449

(   7422 -	8374) |#################################################################################################

(   8374 -	9327) |

(   9327 -   10279) |

(  10279 -   11232) |

(  11232 -   12184) |##

(  12184 -   13137) |

(  13137 -   14089) |

(  14089 -   15042) |

7679.8 [7422.0 - 15042.0]

13554

Phew....

Cornstalks

7,033

Author

December 01, 2011 06:05 PM

Antheus... I love you...

[size=2][ I was ninja'd 71 times before I stopped counting a long time ago ] [ f.k.a. MikeTacular ] [ My Blog ] [ SWFer: Gaplessly looped MP3s in your Flash games ]

Ripiz

539

December 01, 2011 06:13 PM

memcpy() is probably your best bet. It checks if buffers are aligned and uses SSE when possible, therefore it's not likely you'll get better performance with own function. The following function does almost exactly same what standard memcpy() does in perfect situation:

void memcpySSE(void *_dst, void *_src, size_t _size) {

	__asm {

		mov edx, 0

		mov eax, [_size]

		mov ebx, 80h

		div ebx

		mov ecx, eax

		mov esi, [_src]

		mov edi, [_dst]

		mov ebx, 0

		copy:

		movdqa xmm0, [esi + ebx + 00h]

		movdqa xmm1, [esi + ebx + 10h]

		movdqa xmm2, [esi + ebx + 20h]

		movdqa xmm3, [esi + ebx + 30h]

		movdqa xmm4, [esi + ebx + 40h]

		movdqa xmm5, [esi + ebx + 50h]

		movdqa xmm6, [esi + ebx + 60h]

		movdqa xmm7, [esi + ebx + 70h]

		movdqa [edi + ebx + 00h], xmm0

		movdqa [edi + ebx + 10h], xmm1

		movdqa [edi + ebx + 20h], xmm2

		movdqa [edi + ebx + 30h], xmm3

		movdqa [edi + ebx + 40h], xmm4

		movdqa [edi + ebx + 50h], xmm5

		movdqa [edi + ebx + 60h], xmm6

		movdqa [edi + ebx + 70h], xmm7

		add ebx, 80h

		loop copy

	};

}

On my system it copies 1 GB in about 160ms

Cornstalks

7,033

Author

December 01, 2011 06:19 PM

Normally I'd just be using memcpy but the input and output buffers aren't the same type, so wouldn't it mess up the copy? Does your memcpySSE function work if the dst is a int* buffer and src is a uint8* buffer (or a uint16* buffer)? I need the values copied, not necessarily the bytes.

[size=2][ I was ninja'd 71 times before I stopped counting a long time ago ] [ f.k.a. MikeTacular ] [ My Blog ] [ SWFer: Gaplessly looped MP3s in your Flash games ]

Fastest Way to Copy Two Buffers Of Different Types

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Fastest Way to Copy Two Buffers Of Different Types

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines