Special case - faster than memcpy

Started by
5 comments, last by popsoftheyear 15 years, 9 months ago
I'm working on an imaging library, and upon profiling, one of the performance bottle necks (more so in the application using the lib. than the lib. itself) is copying images (for many bad, bad reasons, but I digress). Anyway, I'm developing for specifically Windows in VS2008, and we need to be compatible for 64 bit compiles (so no inline assembly)... so given those contstraints here is the memcpy I currently use. Under non-optimal circumstances it defaults to Microsoft's memcpy, which we all know has been optimized to death. I hope that this can either be useful to someone or someone can give me some constructive feedback. For what it's worth, under optimal circumstances, this performs at about 62% of the time that memcpy does, and that is making a big difference for me at the moment (of course we are working with massive amounts of data though).

// If the memory source and destination addresses line up on a 16 byte boundary, or are both equally
// unaligned on the 16 bytes boundary, we can do a nice super fast memcpy that even beats
// Microsoft's heavily UV optimized memcpy... this is a highly special case though, which is
// very useful if we can force it to happen :)
__forceinline void ImageMemCopy(void *Dest, const void *Src, UInt Size) {
	if (!__sse2_available || Size < 16 || ((UInt)Src & 15) != ((UInt)Dest & 15)) {
		memcpy(Dest, Src, Size);
	}
	else {
		char *pSrc = (char *)Src, *pDest = (char *)Dest;

		UInt32 Count = (16 - ((UInt)pSrc & 15)) & 15;
		for (UInt Index = Count; Index > 0; --Index) {
			*pDest++ = *pSrc++;
		}	
		Size -= Count;

		__m128i Reg0, Reg1, Reg2, Reg3, Reg4, Reg5, Reg6, Reg7;
		for (UInt Index = Size >> 7; Index > 0; --Index) {
			_mm_prefetch(pSrc + 256, _MM_HINT_NTA);
			_mm_prefetch(pSrc + 256 + 64, _MM_HINT_NTA);
			Reg0 = _mm_load_si128((__m128i *)(pSrc));
			Reg1 = _mm_load_si128((__m128i *)(pSrc + 16));
			Reg2 = _mm_load_si128((__m128i *)(pSrc + 32));
			Reg3 = _mm_load_si128((__m128i *)(pSrc + 48));
			Reg4 = _mm_load_si128((__m128i *)(pSrc + 64));
			Reg5 = _mm_load_si128((__m128i *)(pSrc + 80));
			Reg6 = _mm_load_si128((__m128i *)(pSrc + 96));
			Reg7 = _mm_load_si128((__m128i *)(pSrc + 112));
			_mm_stream_si128((__m128i *)(pDest), Reg0);
			_mm_stream_si128((__m128i *)(pDest + 16), Reg1);
			_mm_stream_si128((__m128i *)(pDest + 32), Reg2);
			_mm_stream_si128((__m128i *)(pDest + 48), Reg3);
			_mm_stream_si128((__m128i *)(pDest + 64), Reg4);
			_mm_stream_si128((__m128i *)(pDest + 80), Reg5);
			_mm_stream_si128((__m128i *)(pDest + 96), Reg6);
			_mm_stream_si128((__m128i *)(pDest + 112), Reg7);
			pSrc += 128;
			pDest += 128;
		}
		const UInt SizeOn16 = Size & 0x70;
		switch (SizeOn16 >> 4) {
			case 7: Reg7 = _mm_load_si128((__m128i *)(pSrc + 96));
			case 6:	Reg6 = _mm_load_si128((__m128i *)(pSrc + 80));
			case 5:	Reg5 = _mm_load_si128((__m128i *)(pSrc + 64));
			case 4:	Reg4 = _mm_load_si128((__m128i *)(pSrc + 48));
			case 3:	Reg3 = _mm_load_si128((__m128i *)(pSrc + 32));
			case 2:	Reg2 = _mm_load_si128((__m128i *)(pSrc + 16));
			case 1:	Reg1 = _mm_load_si128((__m128i *)(pSrc));
		}
		switch (SizeOn16 >> 4) {
			case 7: _mm_stream_si128((__m128i *)(pDest + 96), Reg7);
			case 6: _mm_stream_si128((__m128i *)(pDest + 80), Reg6);
			case 5: _mm_stream_si128((__m128i *)(pDest + 64), Reg5);
			case 4: _mm_stream_si128((__m128i *)(pDest + 48), Reg4);
			case 3: _mm_stream_si128((__m128i *)(pDest + 32), Reg3);
			case 2: _mm_stream_si128((__m128i *)(pDest + 16), Reg2);
			case 1: _mm_stream_si128((__m128i *)(pDest), Reg1);
		}
		pSrc += SizeOn16;
		pDest += SizeOn16;
		for (UInt Index = Size & 15; Index > 0; --Index) {
			*pDest++ = *pSrc++;
		}	
	}
}
Also, this is my first time really trying to use SSE2 and it actually being faster than the original code... also this is forceinlined for my own purposes and you may not need to do that. Cheers -Scott [edit] Slight changes to the function [Edited by - popsoftheyear on July 23, 2008 4:12:38 PM]
Advertisement
Sweet, Duff's Device.

I don't suppose you'd consider (a) putting the memcpy branch first, as a guard clause; (b) factoring out the common 'Size & 0x70' subexpression, (c) getting Jan Wassenburg to go over it ;)

But really, the fastest code is that which doesn't run; are the "bad, bad reasons" reasons that the copying is slow, or that the copying takes place? If the latter, you should almost certainly be addressing that first. Don't move things around if you don't need to.
Quote:Original post by Zahlman
Sweet, Duff's Device.
That's not Duff's device. There's a separate trailer switch taking care of the remaining words rather than that awesome loop-inside-switch construct which branches straight into the loop.
Of course using Duff's device would make it impossible to separate the loads from the stores. I honestly don't know if that makes a difference but I do seem to recall AMD's optimization manual recommending two loads before store.

At any rate nice work. Though you really ought to consider other approaches than actual copying. If the application's design forces a lot of dummy copies and you don't want to change the interface then perhaps you should look into abusing the MMU to simply map the same pages twice, but set read-only protection on both and catch the write exceptions to actually copy the data (e.g. copy-on-write just like the OS does it). This has an additional benefit in that you don't have to use non-temporal stores since these smaller pages will fit in the cache and be used straight away.
a) good idea
b) Hmmm... sure why not... but why? Clarity?
c) Wouldn't mind his input. I read about his article, but haven't looked at it yet... I'll have to check it out.

But really, the bad, bad reasons would require the redesign of 20 years of work by multiple programmers, some of whom decided that convoluted_code == job_security + paycheck, and eventually left anyway. Not to mention practically no comments, and techniques that contain DOS code remnants, and mixing two 3rd party imaging libraries by copying and pasting EVERYTHING instead of refactoring. etc. etc. (etc.)

With proper redesign, I'd say a good 60% of this application could end up being no code at all, but unfortunately I don't have the time (or permission) to get into it like that, although I've managed to get projects that forced me to redesign certain areas anyway. Over the next couple years I intend on turning a lot of these bad, bad reasons into "no code at all", but in the mean while I needed to attack some things that came up whilest profiling (yay I talked them into installing a profiler on my computer!)... and this was one of them.

Cheers!
-Scott
Quote:Original post by implicit
... Though you really ought to consider other approaches than actual copying. If the application's design forces a lot of dummy copies and you don't want to change the interface then perhaps you should look into abusing the MMU to simply map the same pages twice...


Unfortunately, part of the problem is the data can be (but isn't necessarily) changed after each copy, and the program preserves it at each step, for optimization purposes. By the time we get back to reading the data to copy and turn it into something else, it will often have exceeded even large L2 caches. Without really getting into details, there's just a lot of stuff here that would have been smart 10-15 years ago when it was created.

Cheers
-Scott
Quote:Original post by implicit
Unfortunately, part of the problem is the data can be (but isn't necessarily) changed after each copy, and the program preserves it at each step, for optimization purposes. By the time we get back to reading the data to copy and turn it into something else, it will often have exceeded even large L2 caches. Without really getting into details, there's just a lot of stuff here that would have been smart 10-15 years ago when it was created.
I still don't see why.
The point of virtual memory tricks is that the attempts at rewriting will be caught automatically by the MMU without any extra logic on your part (aside from the exception handler that is) and if it's never written then it's never copied either. Plus since you'd then only copy the next 4k-64k page or so at a time per interrupt the target would easily fit inside the cache.
Granted there are still all kinds of issues with the solution, for one thing the image classes would probably have to be updated, it wouldn't work with VRAM, or images with a "pitch", etc. But it's a very useful hack in the cases where it can be made to work, such as for the data segment of forked process.
That's really interesting. I'll have to attempt that eventually. Thanks for the tip.

This topic is closed to new replies.

Advertisement