Most efficient way to write identical data to 2 memory locations simultaneously

Started by
5 comments, last by Nahrix 9 years, 11 months ago

I have a thread that writes to a struct, and it needs access to that struct all the time. I don't want to use a mutex to allow other threads to access this struct, because I don't want the first thread to ever have to wait for access to the data in the struct.

My solution is to write a copy of the struct every time it modifies the data, so other threads can access the copy of the (now potentially outdated, but that doesn't matter) struct data, without interrupting the primary thread. If the primary thread needs to read the data, it references its own copy. If it needs to write the data, it writes to its own struct, and only writes a copy if that copy isn't being read from (otherwise it skips that step). I'd like to make this as efficient as possible, so I was wondering if there a way to write to 2 identical structs that's more efficient than simply writing, for example:

struct1.data1 = newdata;

struct2.data1 = newdata;

assuming that you know the data is identical, and both are being written at the exact same time.

Advertisement
There are complex routines out there to make multiple copies of large amounts of data.

If you are talking about copies of kilobytes at a time it won't be worth the effort of a faster memory-to-memory process. If you are talking about copying many megabytes at a time, and are doing it often enough it becomes worth your investment in human time, it can become worth the cost.

You also write that it "doesn't matter" that your threads can have inconsistent values. Unfortunately for you, when it comes to hardware design, cache consistency and coherency are very important. The system has rules it will follow even if you don't care about them. It is a good thing, because without those automatic rules writing code on multiprocessing systems would be even harder that it already is.


For a small, "normal sized" structure, what you have will work well. It requires that the structure make a trip through memory, but for small objects they will fit inside the cache, they will probably already exist within the cpu cache, and the processor's cache coherency system will need to know about the changes. So doing it the naive way of just making direct copies is also probably the best, both for performance and for programmer effort.

The direct copy in your example is nice because the hardware is designed with that kind of normal use in mind. Certain paths through the hardware are fast and convenient, others require trips through slower hardware, or across physical devices on the board, which can be incredibly slow relative to on-die operations.

There are some architectures that allow a direct memory-to-memory transfer without involving the CPU. It takes a bit of time and processing to set up, but for large enough data structures it can be worth it. Usually this is done between devices: copy a large texture from main memory to graphics memory, transfer to/from disk controller memory, transfer to/from network cards, transfer to/from audio cards, transfer to/from USB devices, and so on. If you do a DMA transfer from main memory to main memory there are some costs, like ensuring the CPU caches are invalidated as the copies are made.

It is a lower-level programming either in assembly or an exposed library. It is not something you find exposed in any of the major programming languages. The technology exists on Cell processors, on Intel's I/OAT-supporting chips. There is also the x86 family's system bus, although these days bus transfers are usually relatively slow unless you are using one of the very specific transfer routes like the HyperTransport with its memory-to-device channels.

It is not a process you would use to copy a structure or a class, but it is something you would use to copy certain very large blocks from one location to another.

You also write that it "doesn't matter" that your threads can have inconsistent values. Unfortunately for you, when it comes to hardware design, cache consistency and coherency are very important. The system has rules it will follow even if you don't care about them. It is a good thing, because without those automatic rules writing code on multiprocessing systems would be even harder that it already is.

Sorry, I didn't mean it that way. Let me clarify with an example stripped to the bones:

Let's say you have 2 ints: int1 and int2, and 2 threads: thread1 and thread2. int1 and int2 are set to 0 by default. At some point, thread1 takes int1, adds 1 to it, and saves the value to int1 and int2. int2 will be accessed by thread2 as well at some point, so the write to int2 is done with a mutex. Later, thread1 reads from int1 (1), adds 1 to it (2), and saves it back to int1. Thread1 notices that int2 is being read from, so code is added to ignore the attempt to write to it (doesn't wait on the mutex, just skips to whatever else thread1 wants to do next). What I'm saying is that, specific to the code I'm writing, it logically doesn't matter to me that thread2 is now reading 1 from int2, when the most current version of int1 reads 2. I'm not talking about race conditions or anything, just that the logic in my code doesn't care that the two threads are dealing with 2 different values at this point in time.

The copy will be happening on large blocks of data hundreds of times per second, so this is an area I'd be willing to find an efficient solution to, even if it's something very complex.


The copy will be happening on large blocks of data hundreds of times per second

Define 'large'? Hundreds of times a second isn't very fast - we are potentially talking 10 milliseconds between copies, which is quite a lot of CPU time...

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

This sounds similar to the problem of drawing the screen bitmap in one thread and transferring that to the graphics card in another.

When you don't want either thread to have to wait for the other, then the general solution is to use triple-buffering.

Have a read of this and see if it might be what you are looking for:

https://en.wikipedia.org/wiki/Multiple_buffering

https://en.wikipedia.org/wiki/Triple_buffering#Triple_buffering

"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms


The copy will be happening on large blocks of data hundreds of times per second

Define 'large'? Hundreds of times a second isn't very fast - we are potentially talking 10 milliseconds between copies, which is quite a lot of CPU time...

About 10 megabytes.

This sounds similar to the problem of drawing the screen bitmap in one thread and transferring that to the graphics card in another.

When you don't want either thread to have to wait for the other, then the general solution is to use triple-buffering.

Have a read of this and see if it might be what you are looking for:

https://en.wikipedia.org/wiki/Multiple_buffering

https://en.wikipedia.org/wiki/Triple_buffering#Triple_buffering

This analogy is close enough that the solution actually fits really well! I can just have 2 structs that the primary thread alternates between writing, and then it can stick to just one of the structs while the other thread is busy reading from another one. I only have to write a single time per loop, too! Thanks!

This topic is closed to new replies.

Advertisement