Fast memory copying

Started by
18 comments, last by Jan Wassenberg 18 years ago
Hello everyone.. I need to copy alot of memory fast.. Often, there'll be large segments that are equal on the dest and the source.. Say I have 1024 bytes in two buffers.. I need to copy from buffer1 to buffer2.. Sometimes there can be 128 bytes that are equal in the buffers.. So.. Should I 1. Just use memcpy() and copy everything all the time (how optimized is memcpy?) 2. Compare the buffers, say 64 bytes at a time, and if a change is detected, copy the remaining bytes of the chunk 3. Compare all the way and only write differences? We are talking several MBs at a time here, so it is important that it's done the fastest way possible.. On some systems, the readspeed is 3-5x faster than the write speed.. On my system, it's 3200mb vs 1300mb.. So, what do you suggest?
Advertisement
Interesting.

Where does the source buffer get its data?

Kuphryn
Any sort of comparison will slow you down.
It's generated by an independent module (sortof).. In other words, it's cpu generated, and it's only ram, no device IO involved.. It's not very predictable, and parts of it is generated from scripts and plugin dlls, which cannot interact with the copying code (for example, it can't say which areas are modified or the like)..
I agree with the AP. You'd probably be best just making one call to memcpy(). How many MB are we talking here? If it's less than 10 or 20, then the speed isn't going to be that bad.
Quote:Original post by Anonymous Poster
Any sort of comparison will slow you down.


He could pre-calculate some sort of checksum/modification_flag for the blocks of memory and use them for the comparision. If the data is more likely to reamain unchaged, or to be modified just a little, not copying several megabytes in vain could be worth to be taken in consideration.
[size="2"]I like the Walrus best.
Quote:We are talking several MBs at a time here, so it is important that it's done the fastest way possible.. On some systems, the readspeed is 3-5x faster than the write speed.. On my system, it's 3200mb vs 1300mb..

How are you doing the copying?
Techniques outlined in technical report on speeding up memcpy might help.
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
Here's the deal.. It's a pool.. It's 20-64mb in size, and it has a 6.25% memory overhead.. The pool will be divided into 64bytes chunks, and one allocation can use one or more of those chunks.. There's no point in copying unused chunks.. Some of the used chunks might not change (the first 4mb or so will change very little for example, and many of those chunks will never change).. The rest of the chunks are less likely to be used (chunk #65537 will almost always be used, while #80000 will be way less likely to be used)..

Ideas?
Can anyone help me building the code from the article? I'm not really into ASM, so I don't understand any of that code..

I linker errors like this:
1>Memory.obj : error LNK2001: unresolved external symbol "void __cdecl ia32_asm_init(void)" (?ia32_asm_init@@YAXXZ)
1>Memory.obj : error LNK2001: unresolved external symbol "bool __cdecl ia32_cpuid(unsigned int,unsigned int *)" (?ia32_cpuid@@YA_NIPAI@Z)


Any ideas on how to fix?

I tried to just create a .asm file in my project, and a .h and a .cpp and add the relevant code to each of them, and then I added the custom build step for nasm.. What else to I have to do?
Do you know memcpy() is a bottleneck? If not, use memcpy(). Anything else is premature optimisation.

Note that if I were a vaguely sane C library writer, and if only writing if the data that's different was faster than always writing, my implementation of memcpy() would do that.

Some attempts at logic

Let's say reading is precisely three times slower than writing.

To do the comparison, you need to read from the source and the destination. Let's say you can load the 128-bytes into some SIMD registers and you can compare the bytes in parallel with a single instruction. I don't know if that's true. I'm guessing that the time taken to do the comparison is vanishingly small in comparison to memory access time. I don't know if that's true. If the comparison reveals that the block has changed, then we must do the write operation.

If all the data has changed, and the above paragraph is a realistic description of the timing requirements, then the whole copy will take about 60% longer than just doing the write (because we've introduced an extra read operation that wasn't there before). If only half the data has changed, it's about 10% longer. You only win when less than a third of the data has changed.

This topic is closed to new replies.

Advertisement