Back to General and Gameplay Programming

Truncated/padded memcpy

popsoftheyear · 2008-04-24T12:57:12

Is there a reasonably fast way to copy N dwords from one memory location to another location, but truncated to N words. So say we copied 2048 dwords = 8192 bytes from memory location a, memory location b would contain 2048 words = 4096 bytes of memory, all the the low 16 bits actually, even though I don't see as if that in itself matters. And vice versa... copying 2048 words to 2048 dwords, padding with 0s, (the numbers are more in the millions but 2048 is easier to read of course). ??? Thanks -Scott

General and Gameplay Programming Programming

Started by popsoftheyear April 21, 2008 01:30 PM

17 comments, last by Jan Wassenberg 15 years, 11 months ago

arithma

226

April 21, 2008 04:17 PM

Doesn't anyone think if this should be done, the implementation of memcpy should be looked at? Twist some knobs and u'd have what u need to do i believe.

[ my blog ]

SiCrane

11,840

April 21, 2008 04:27 PM

The implementation of memcpy() usually works out to something like:

	shr	ecx, 2	rep movsd	mov	ecx, eax	and	ecx, 3	rep movsb

What knobs are you going to twist on that?

Jan Wassenberg

1,000

April 22, 2008 12:46 AM

> 2 GB images? That sounds like a good reason :)

Quote:I know they use MMX optimizations where possible

Yes, that is applicable here. (SSE2 mostly consists of the old MMX instructions expanded to 128 bit SSE registers)

Quote:The implementation of memcpy() usually works out to something like:

That was true up to the 486, but on superscalar processors you are better off with a loop. In fact to reach peak DDR bandwidth, quite a bit more effort needs to be applied: Speeding Up Memory Copy.
Since a big part of the gains for large transfers involves using MMX/SSE, it's still difficult to just 'turn knobs'. But arithma is correct insofar as implementations lacking the modern memcpy techniques will max out at ~400 MB/s.

E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3

asp_

172

April 22, 2008 03:04 AM

Actually SiCrane is correct. MSVC generates pretty much the code template he specified with certain optimizations depending on how much is known about the source / target data. Whether it's optimal or not is another issue :P

SiCrane

11,840

April 22, 2008 09:20 AM

Quote:Original post by Jan Wassenberg
Quote:The implementation of memcpy() usually works out to something like:

That was true up to the 486, but on superscalar processors you are better off with a loop.

I didn't say that's what the implementation should be, just what the implementation usually is. I would think you of all people would know how suboptimal default compiler implementations often turn out to be.

Jan Wassenberg

1,000

April 22, 2008 03:22 PM

Quote:Actually SiCrane is correct. MSVC generates pretty much the code template he specified with certain optimizations depending on how much is known about the source / target data. Whether it's optimal or not is another issue :P

Please specify these "certain optimizations" more exactly. With VC8 SP1, nothing known at compile time about alignment/size, and cflags "/Ox /Oi /Os /Oy /Ob2 /LTCG /MD" (pretty normal settings but favoring intrinsics as much as possible), I see a call to the CRT's memcpy. In fact not even #pragma intrinsic(memcpy) is enough to sway the compiler to generate MOVS. What gives?

Quote:I didn't say that's what the implementation should be, just what the implementation usually is. I would think you of all people would know how suboptimal default compiler implementations often turn out to be.

heh. No argument on what the implementation *should* be; I'm saying that VC's CRT memcpy() has indeed been in form of a loop since the Pentium days. (Side note: the amusingly outdated U/V pipe comments have recently been removed in favor of an SSE2 implementation.)

E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3

asp_

172

April 22, 2008 03:36 PM

#include <iostream>#include <memory.h>static char const mystr[1023] = "arg";static char mydest[1023];int __cdecl main(){    memcpy(mydest, mystr, sizeof(mydest));    std::cout << "Result: " << mydest << std::endl;}

Microsoft Visual Studio 2005 Professional, Version 8.0.50727.762, compiler function intrinsics on or off, optimize for speed, favor speed generates:

B9 FF 00 00 00          mov     ecx, 0FFhBE C0 58 41 00          mov     esi, offset mystr ; "arg"BF C0 54 41 00          mov     edi, offset mydestF3 A5                   rep movsd68 C0 54 41 00          push    offset mydest51                      push    ecx66 A5                   movsw

Might be because the array is of a known somewhat small size. It does cheat a bit thanks to it knowing the size at compile time.

*edit*
Yepp, a bigger array causes a call to the memcpy function which does the whole SSE2 shebang. Examining the memcpy disassembly they also fall back to the same if the array is less than 256 bytes in size. However they use a jump table to do the correct write on the trailing bytes. Write it off as one of those "compiler writer knows best"? :)

Zahlman

1,682

April 22, 2008 06:30 PM

Quote:Original post by asp_
Write it off as one of those "compiler writer knows best"? :)

They don't always. But not everyone can be, well, Jan Wassenburg. ;)

Anyway, none of that really helps, AFAIK, for truncated/expanded-per-element memory copies...

Jan Wassenberg

1,000

April 24, 2008 12:57 PM

Quote:Might be because the array is of a known somewhat small size. It does cheat a bit thanks to it knowing the size at compile time.

Ah, indeed. It only seems to happen with known, small sizes. We now know to avoid a certain compiler pessimization via #pragma function(memcpy) or by using a different memcpy implementation.

Quote:However they use a jump table to do the correct write on the trailing bytes. Write it off as one of those "compiler writer knows best"? :)

A variant of that jump table approach is indeed fast, the fastest of all trailing-byte-processing-methods I could think of and evaluated when writing that paper.

Quote:Anyway, none of that really helps, AFAIK, for truncated/expanded-per-element memory copies...

heh, we are back to the point where the CRT memcpy can be used as the base recipe. That together with SSE unpacking or shuffling and a dash of block prefetching should be all the information that is needed to achieve good performance. For an additional gold star, see What Every Programmer Should Know About Memory.

E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3

Truncated/padded memcpy

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Truncated/padded memcpy

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines