Sign in to follow this  
DvDmanDT

Fast memory copying

Recommended Posts

Hello everyone.. I need to copy alot of memory fast.. Often, there'll be large segments that are equal on the dest and the source.. Say I have 1024 bytes in two buffers.. I need to copy from buffer1 to buffer2.. Sometimes there can be 128 bytes that are equal in the buffers.. So.. Should I 1. Just use memcpy() and copy everything all the time (how optimized is memcpy?) 2. Compare the buffers, say 64 bytes at a time, and if a change is detected, copy the remaining bytes of the chunk 3. Compare all the way and only write differences? We are talking several MBs at a time here, so it is important that it's done the fastest way possible.. On some systems, the readspeed is 3-5x faster than the write speed.. On my system, it's 3200mb vs 1300mb.. So, what do you suggest?

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Any sort of comparison will slow you down.

Share this post


Link to post
Share on other sites
It's generated by an independent module (sortof).. In other words, it's cpu generated, and it's only ram, no device IO involved.. It's not very predictable, and parts of it is generated from scripts and plugin dlls, which cannot interact with the copying code (for example, it can't say which areas are modified or the like)..

Share this post


Link to post
Share on other sites
I agree with the AP. You'd probably be best just making one call to memcpy(). How many MB are we talking here? If it's less than 10 or 20, then the speed isn't going to be that bad.

Share this post


Link to post
Share on other sites
Quote:
Original post by Anonymous Poster
Any sort of comparison will slow you down.


He could pre-calculate some sort of checksum/modification_flag for the blocks of memory and use them for the comparision. If the data is more likely to reamain unchaged, or to be modified just a little, not copying several megabytes in vain could be worth to be taken in consideration.

Share this post


Link to post
Share on other sites
Here's the deal.. It's a pool.. It's 20-64mb in size, and it has a 6.25% memory overhead.. The pool will be divided into 64bytes chunks, and one allocation can use one or more of those chunks.. There's no point in copying unused chunks.. Some of the used chunks might not change (the first 4mb or so will change very little for example, and many of those chunks will never change).. The rest of the chunks are less likely to be used (chunk #65537 will almost always be used, while #80000 will be way less likely to be used)..

Ideas?

Share this post


Link to post
Share on other sites
Can anyone help me building the code from the article? I'm not really into ASM, so I don't understand any of that code..

I linker errors like this:
1>Memory.obj : error LNK2001: unresolved external symbol "void __cdecl ia32_asm_init(void)" (?ia32_asm_init@@YAXXZ)
1>Memory.obj : error LNK2001: unresolved external symbol "bool __cdecl ia32_cpuid(unsigned int,unsigned int *)" (?ia32_cpuid@@YA_NIPAI@Z)


Any ideas on how to fix?

I tried to just create a .asm file in my project, and a .h and a .cpp and add the relevant code to each of them, and then I added the custom build step for nasm.. What else to I have to do?

Share this post


Link to post
Share on other sites
Do you know memcpy() is a bottleneck? If not, use memcpy(). Anything else is premature optimisation.

Note that if I were a vaguely sane C library writer, and if only writing if the data that's different was faster than always writing, my implementation of memcpy() would do that.

Some attempts at logic

Let's say reading is precisely three times slower than writing.

To do the comparison, you need to read from the source and the destination. Let's say you can load the 128-bytes into some SIMD registers and you can compare the bytes in parallel with a single instruction. I don't know if that's true. I'm guessing that the time taken to do the comparison is vanishingly small in comparison to memory access time. I don't know if that's true. If the comparison reveals that the block has changed, then we must do the write operation.

If all the data has changed, and the above paragraph is a realistic description of the timing requirements, then the whole copy will take about 60% longer than just doing the write (because we've introduced an extra read operation that wasn't there before). If only half the data has changed, it's about 10% longer. You only win when less than a third of the data has changed.

Share this post


Link to post
Share on other sites
Quote:
Original post by owl
Quote:
Original post by Anonymous Poster
Any sort of comparison will slow you down.


He could pre-calculate some sort of checksum/modification_flag for the blocks of memory and use them for the comparision. If the data is more likely to reamain unchaged, or to be modified just a little, not copying several megabytes in vain could be worth to be taken in consideration.


A clarification: When I said blocks of memory I was talking about partitioning the buffer in chunks (dunno, some size) and copying only those that were modified.

Share this post


Link to post
Share on other sites
Quote:
Original post by owl
Quote:
Original post by owl
Quote:
Original post by Anonymous Poster
Any sort of comparison will slow you down.


He could pre-calculate some sort of checksum/modification_flag for the blocks of memory and use them for the comparision. If the data is more likely to reamain unchaged, or to be modified just a little, not copying several megabytes in vain could be worth to be taken in consideration.


A clarification: When I said blocks of memory I was talking about partitioning the buffer in chunks (dunno, some size) and copying only those that were modified.

The problem with this is that a checksum can't be used to offer a guarantee that a chunk has not changed, only that it has. If not writing to memory was a performance problem, then this technique would be useful. Presumably, to be efficient you couldn't be using each byte of the chunk, so you'll have some bytes which don't contribute to the checksum anyway. Even if you do use each byte, the pigeonhole principle guarantees that N chunks have the same checksum, where N is 2(length of chunk in bits - length of checksum in bits).

Share this post


Link to post
Share on other sites
Quote:
Original post by Nathan Baum
Quote:
Original post by owl
Quote:
Original post by owl
Quote:
Original post by Anonymous Poster
Any sort of comparison will slow you down.


He could pre-calculate some sort of checksum/modification_flag for the blocks of memory and use them for the comparision. If the data is more likely to reamain unchaged, or to be modified just a little, not copying several megabytes in vain could be worth to be taken in consideration.


A clarification: When I said blocks of memory I was talking about partitioning the buffer in chunks (dunno, some size) and copying only those that were modified.

The problem with this is that a checksum can't be used to offer a guarantee that a chunk has not changed, only that it has. If not writing to memory was a performance problem, then this technique would be useful. Presumably, to be efficient you couldn't be using each byte of the chunk, so you'll have some bytes which don't contribute to the checksum anyway. Even if you do use each byte, the pigeonhole
principle
guarantees that N chunks have the same checksum, where N is 2(length of chunk in bits - length of checksum in bits).


You could manage the access to the blocks of memory and keep track in an array of bools which ones where modified.

Note: I'm just brainstorming here :)

Share this post


Link to post
Share on other sites
Yes, but DvDmanDT states that he doesn't have access to that kind of functionality; the function which does the modification "cannot interact with the copying code (for example, it can't say which areas are modified or the like)."

Share this post


Link to post
Share on other sites
Quote:
Original post by Nathan Baum
Yes, but DvDmanDT states that he doesn't have access to that kind of functionality; the function which does the modification "cannot interact with the copying code (for example, it can't say which areas are modified or the like)."


That's true, I didn't read that part.

Well, in that case, the only way to not reading_and_copying everyting all the time, is to read everything all the time an keep track of the changes. That would require a copy of the buffer in question and a process that trasverses the source buffer once in a while and copies only those parts that suffered a modification and at the same time updates the buffer used to look up for changes. Since it was stated that reading is faster than writing, reading everything all the time, gotta be faster than reading and writing everything all the time.

The only (maybe detrimental issue) is that you require to duplicate the memory needed by the data, you need to trasverse two buffers, and that the writing part (that could be very big) gotta be done twice.

Share this post


Link to post
Share on other sites
Maybe you can use threading - ie one thread to check a 64kb block,
while another one is copying a 'dirty' one.. ?

Probably won't bring much on single-core, but it might help on
dual(/quad)-core


You'd have to use some locking mechanism though to protect reading
and writing the same block (but since you have lots of 64k blocks,
that shouldn't be so problematic I guess)

Share this post


Link to post
Share on other sites
This'll be pretty theoretical, but oh well. On your machine you can read 3200mb in the same time as you can write 1300mb right? So our relative times are:

read = 1.00
write = 2.46
So, assuming the compare is negligable (won't be your bottleneck)
non-checked copy = 3.46
checked copy = 2.00 - 4.46

But that's assuming a fairly naive copy. So it might be worth checking if generally less than half your data is modified. What I'd probly do is pick a chunk size then go through comparing. The moment you find something modified you just copy the rest of the chunk without checking. That'd reduce your worst case a bit. I dunno, just a thought. If this is important to you, try several things and get some good performance measurements on them.

Share this post


Link to post
Share on other sites
Quote:
I tried to just create a .asm file in my project, and a .h and a .cpp and add the relevant code to each of them, and then I added the custom build step for nasm.. What else to I have to do?

That's all you ought to have to do.
These linker errors indicate the C++ compiler is mangling names, but asm code expects C linkage. Apparently in my project, the relevant declarations are wrapped in extern "C" {}. Please try changing all extern declarations in the C++ code/header portions to extern "C".

Share this post


Link to post
Share on other sites
Yes, sorry.. When I looked at your code, I saw extern void funcname, so I thought it was in order.. Worked great once I changed that to extern "C" void funcname.. And it did speed up things quite a bit.. I don't have numbers, but previously a copy of 32mb was noticeable, now I can easily get away with 64mb..

I keep track of which chunks are used (it's a pool as I said).. I only need to copy the used ones.. It's a separate list of ints (one int/chunk), with one bit being set if it's used, and the rest is used for the size of the data in the chunk (accutually, how many chunks belong to this block).. I have to copy that list, but after that, I could walk through it and see what chunks I need to copy.. In a wostcase, I would loose quite some performance by doing that, but I could on the other hand get rid of lots of chunks.. Another thing I can do is keep track of the last chunk used.. That should be a fairly easy and fast thing to do..

Share this post


Link to post
Share on other sites
Quote:
Yes, sorry.. When I looked at your code, I saw extern void funcname, so I thought it was in order..Worked great once I changed that to extern "C" void funcname..

Indeed, that's what it should have been in the article. I've updated it; thanks!

Quote:
And it did speed up things quite a bit.. I don't have numbers, but previously a copy of 32mb was noticeable, now I can easily get away with 64mb..

Excellent! BTW, be sure to call ia32_init - otherwise the codepaths that provide really big gains will remain disabled.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this