Sign in to follow this  
popsoftheyear

Truncated/padded memcpy

Recommended Posts

popsoftheyear    2194
Is there a reasonably fast way to copy N dwords from one memory location to another location, but truncated to N words. So say we copied 2048 dwords = 8192 bytes from memory location a, memory location b would contain 2048 words = 4096 bytes of memory, all the the low 16 bits actually, even though I don't see as if that in itself matters. And vice versa... copying 2048 words to 2048 dwords, padding with 0s, (the numbers are more in the millions but 2048 is easier to read of course). ??? Thanks -Scott

Share this post


Link to post
Share on other sites
SiCrane    11839
In what programming language? On what platform? Are you using traditional CS meanings of word/dword or an API typedef meaning?

Share this post


Link to post
Share on other sites
popsoftheyear    2194
oh.. right. sorry. Win32 C++. And those were just an example. Basically I might have an array with elements of an arbitrary bit depth, 32 bits for example, as a source. And then another array with elements of a different arbitrary bit depth (but possibly the same which can just use your favorite memcpy method) - 16 bits for example, as a destination. Now when copying from the source to the destination, the 32 bit elements need to be truncated. Being arbitrary, it could be the other way around, and the 16 bit array might be the source, in which case the 32 bit array elements would be zero padded. Regardless... is there a reasonably fast way to do the first case (larger elements to smaller elements with truncation)?

For bonus points, since I imagine they would be somewhat different, what about the second case (smaller to larger)?

Cheers
-Scott

Share this post


Link to post
Share on other sites
A simple for() loop reading from and writing to pointers, casting int to short (or the other way around) is reasonably fast in C++.

If it is still not fast enough, and your processor supports SSE2, and you have both arrays aligned to 16 bytes, you can do two aligned fetches, two shuffles, and a non-temporal store to truncate 32 to 16, and you could use PUNPCKHWD/PUNPCKLWD instead of shuffles for the other way around. Also, don't forget prefetching on the read side.
This obviously isn't "real" C++ (nor very portable) any more, but it will be faster.

Share this post


Link to post
Share on other sites
popsoftheyear    2194
Interesting... yeah it's win32 specific software, and sse2 is good. Also my arrays are 16-byte (octword???) aligned because I knew I would be getting myself into sse2 stuff eventually :) I just don't wanna spend time with it except where most beneficial. This will have to be one of those cases though...

Sounds like some research but I can probably google the topics you spoke of... but I'm still open to any other thoughts! Thanks

-Scott



Share this post


Link to post
Share on other sites
Jan Wassenberg    999
For the specific examples, you can use SSE2 to pack 4 i32 into 8 i16 and blast those out to memory in one instruction. [It is probably better to unroll 4x and thus fill the write-combine buffer.] The opposite is also easily possible.

Note the integer types, though - only very new (as in this year) CPUs have pack instructions that avoid the signed 16-bit saturation.

Anyway, you still haven't given enough information. Of course it matters WHICH and HOW MANY bits you want to truncate (e.g. applicability of the SSE pack instructions, whether saturation is OK). And what are the requirements on "reasonably fast"?

Share this post


Link to post
Share on other sites
popsoftheyear    2194
I would keep the low bits. Let's assume everything is unsigned too. So if one of the elements was 0xff003311, the output would be 0x3311. Etc. Right now I've got a super-super-general copy just to get the algorithm working. I will probably need a few special case algorithms... but this one always works. Since these intels are LSB ordered:

void ImageBlitter::CopyGeneric_(UInt32 SrcX, UInt32 SrcY, UInt32 Width, UInt32 Height,
Image &DestImageObj, UInt32 DestX, UInt32 DestY) {
// Get some byte counts
UInt8 CopyBitCount = min(DestImageObj.GetChannelDepth(), ImageObj_->GetChannelDepth());
UInt8 CopyByteCount = CopyBitCount / 8;
UInt32 SkipByteCountIn = (ImageObj_->GetChannelDepth() - CopyBitCount) / 8;
UInt32 SkipByteCountOut = (DestImageObj.GetChannelDepth() - CopyBitCount) / 8;
UInt32 NextLineBytesIn = (ImageObj_->GetBPP() / 8) * (ImageObj_->GetWidth() - Width);
UInt32 NextLineBytesOut = (DestImageObj.GetBPP() / 8) * (DestImageObj.GetWidth() - Width);

// [Get our UInt8 *SrcPtr and UInt8 *DestPtr...]

// Do the super general always works but pretty slow copy.
for (UInt32 YI = SrcY; YI < Height + SrcY; YI++) {
for (UInt32 XI = SrcX; XI < Width + SrcX; XI++) {
for (UInt32 Bytes = 0; Bytes < CopyByteCount; Bytes ++) {
*DestPtr++ = *SrcPtr++;
}
SrcPtr += SkipByteCountIn;
DestPtr += SkipByteCountOut;
}
SrcPtr += NextLineBytesIn;
DestPtr += NextLineBytesOut;
}
}

(EDIT - just copied the whole function for clarity's sake...)

So yeah... just looking for some ideas on the matter... didn't wanna waste time looking into one method if another was more appropriate. I realize most optimized versions will have to have special cases... but there aren't that many supported bit depths. This is one of the most important and widely used functions though so it must go faster than it currently does (meaning anything faster = reasonably fast)

Thanks
-Scott

Share this post


Link to post
Share on other sites
Zahlman    1682
std::copy should be up to the task. If not, try std::transform, using a predicate which performs a static_cast and assignment.

But.

Why do you assume your existing copy method is too slow for you?

And for that matter, why do you need to do this copying yourself? "Image blitting" sounds like a library task.

Share this post


Link to post
Share on other sites
popsoftheyear    2194
:)
Matrox changed their licensing terms and their lib doesn't contain some of the stuff we are starting to need anyway... (like support for >2GB images on 32 bit hardware)... nor leadtools...
I assume it's too slow because the matrox lib does it faster... and the lib is like 10 years old or something. I know they use MMX optimizations where possible, and I figured there might be some good ideas around here before delving into it myself.

And yeah... I had wondered if the std library might suffice... I'll have to try it too.

Cheers
-Scott

Share this post


Link to post
Share on other sites
arithma    226
Doesn't anyone think if this should be done, the implementation of memcpy should be looked at? Twist some knobs and u'd have what u need to do i believe.

Share this post


Link to post
Share on other sites
SiCrane    11839
The implementation of memcpy() usually works out to something like:

shr ecx, 2
rep movsd
mov ecx, eax
and ecx, 3
rep movsb

What knobs are you going to twist on that?

Share this post


Link to post
Share on other sites
Jan Wassenberg    999
> 2 GB images? That sounds like a good reason :)

Quote:
I know they use MMX optimizations where possible

Yes, that is applicable here. (SSE2 mostly consists of the old MMX instructions expanded to 128 bit SSE registers)

Quote:
The implementation of memcpy() usually works out to something like:

That was true up to the 486, but on superscalar processors you are better off with a loop. In fact to reach peak DDR bandwidth, quite a bit more effort needs to be applied: Speeding Up Memory Copy.
Since a big part of the gains for large transfers involves using MMX/SSE, it's still difficult to just 'turn knobs'. But arithma is correct insofar as implementations lacking the modern memcpy techniques will max out at ~400 MB/s.

Share this post


Link to post
Share on other sites
asp_    172
Actually SiCrane is correct. MSVC generates pretty much the code template he specified with certain optimizations depending on how much is known about the source / target data. Whether it's optimal or not is another issue :P

Share this post


Link to post
Share on other sites
SiCrane    11839
Quote:
Original post by Jan Wassenberg
Quote:
The implementation of memcpy() usually works out to something like:

That was true up to the 486, but on superscalar processors you are better off with a loop.


I didn't say that's what the implementation should be, just what the implementation usually is. I would think you of all people would know how suboptimal default compiler implementations often turn out to be.

Share this post


Link to post
Share on other sites
Jan Wassenberg    999
Quote:
Actually SiCrane is correct. MSVC generates pretty much the code template he specified with certain optimizations depending on how much is known about the source / target data. Whether it's optimal or not is another issue :P

Please specify these "certain optimizations" more exactly. With VC8 SP1, nothing known at compile time about alignment/size, and cflags "/Ox /Oi /Os /Oy /Ob2 /LTCG /MD" (pretty normal settings but favoring intrinsics as much as possible), I see a call to the CRT's memcpy. In fact not even #pragma intrinsic(memcpy) is enough to sway the compiler to generate MOVS. What gives?

Quote:
I didn't say that's what the implementation should be, just what the implementation usually is. I would think you of all people would know how suboptimal default compiler implementations often turn out to be.

heh. No argument on what the implementation *should* be; I'm saying that VC's CRT memcpy() has indeed been in form of a loop since the Pentium days. (Side note: the amusingly outdated U/V pipe comments have recently been removed in favor of an SSE2 implementation.)

Share this post


Link to post
Share on other sites
asp_    172

#include <iostream>
#include <memory.h>
static char const mystr[1023] = "arg";
static char mydest[1023];

int __cdecl main()
{
memcpy(mydest, mystr, sizeof(mydest));
std::cout << "Result: " << mydest << std::endl;
}


Microsoft Visual Studio 2005 Professional, Version 8.0.50727.762, compiler function intrinsics on or off, optimize for speed, favor speed generates:


B9 FF 00 00 00 mov ecx, 0FFh
BE C0 58 41 00 mov esi, offset mystr ; "arg"
BF C0 54 41 00 mov edi, offset mydest
F3 A5 rep movsd
68 C0 54 41 00 push offset mydest
51 push ecx
66 A5 movsw


Might be because the array is of a known somewhat small size. It does cheat a bit thanks to it knowing the size at compile time.

*edit*
Yepp, a bigger array causes a call to the memcpy function which does the whole SSE2 shebang. Examining the memcpy disassembly they also fall back to the same if the array is less than 256 bytes in size. However they use a jump table to do the correct write on the trailing bytes. Write it off as one of those "compiler writer knows best"? :)

Share this post


Link to post
Share on other sites
Zahlman    1682
Quote:
Original post by asp_
Write it off as one of those "compiler writer knows best"? :)


They don't always. But not everyone can be, well, Jan Wassenburg. ;)

Anyway, none of that really helps, AFAIK, for truncated/expanded-per-element memory copies...

Share this post


Link to post
Share on other sites
Jan Wassenberg    999
Quote:
Might be because the array is of a known somewhat small size. It does cheat a bit thanks to it knowing the size at compile time.

Ah, indeed. It only seems to happen with known, small sizes. We now know to avoid a certain compiler pessimization via #pragma function(memcpy) or by using a different memcpy implementation.

Quote:
However they use a jump table to do the correct write on the trailing bytes. Write it off as one of those "compiler writer knows best"? :)

A variant of that jump table approach is indeed fast, the fastest of all trailing-byte-processing-methods I could think of and evaluated when writing that paper.

Quote:
Anyway, none of that really helps, AFAIK, for truncated/expanded-per-element memory copies...

heh, we are back to the point where the CRT memcpy can be used as the base recipe. That together with SSE unpacking or shuffling and a dash of block prefetching should be all the information that is needed to achieve good performance. For an additional gold star, see What Every Programmer Should Know About Memory.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this