There are complex routines out there to make multiple copies of large amounts of data.
If you are talking about copies of kilobytes at a time it won't be worth the effort of a faster memory-to-memory process. If you are talking about copying many megabytes at a time, and are doing it often enough it becomes worth your investment in human time, it can become worth the cost.
You also write that it "doesn't matter" that your threads can have inconsistent values. Unfortunately for you, when it comes to hardware design, cache consistency and coherency are very important. The system has rules it will follow even if you don't care about them. It is a good thing, because without those automatic rules writing code on multiprocessing systems would be even harder that it already is.
For a small, "normal sized" structure, what you have will work well. It requires that the structure make a trip through memory, but for small objects they will fit inside the cache, they will probably already exist within the cpu cache, and the processor's cache coherency system will need to know about the changes. So doing it the naive way of just making direct copies is also probably the best, both for performance and for programmer effort.
The direct copy in your example is nice because the hardware is designed with that kind of normal use in mind. Certain paths through the hardware are fast and convenient, others require trips through slower hardware, or across physical devices on the board, which can be incredibly slow relative to on-die operations.
There are some architectures that allow a direct memory-to-memory transfer without involving the CPU. It takes a bit of time and processing to set up, but for large enough data structures it can be worth it. Usually this is done between devices: copy a large texture from main memory to graphics memory, transfer to/from disk controller memory, transfer to/from network cards, transfer to/from audio cards, transfer to/from USB devices, and so on. If you do a DMA transfer from main memory to main memory there are some costs, like ensuring the CPU caches are invalidated as the copies are made.
It is a lower-level programming either in assembly or an exposed library. It is not something you find exposed in any of the major programming languages. The technology exists on Cell processors, on Intel's I/OAT-supporting chips. There is also the x86 family's system bus, although these days bus transfers are usually relatively slow unless you are using one of the very specific transfer routes like the HyperTransport with its memory-to-device channels.
It is not a process you would use to copy a structure or a class, but it is something you would use to copy certain very large blocks from one location to another.
Edited by frob, 16 May 2014 - 02:33 PM.