Copying the array means either one of two things: A) Your array is so small that copying is cheap; if it's that small, you shouldn't be multithreading it anyway. or B) The array is large, but you're taking the easy way out instead of understanding the problem (which sometimes is the economic thing to do, if you don't have time to waste doing the research).
Or the array is small and cheap to copy, but the computation you're performing is expensive and makes sense to spread over multiple threads...
That was mentioned previously; I conceded that point already. :wink:
Doing a quick read up on cache lines, it appears that Intel architecture cache lines are 64 byte. So if the data is divisible by 64 like a 1024 byte array, this should avoid performance hits shouldn't it?
Again, noob here. From what I understand, whatever size the array is, regardless of whether it's divisible by cachelines or not, as long as more than one thread is not accessing (with at least one writing) the same cacheline around the same time (and as long as you aren't shuffling the same cacheline back and forth between the two threads), you should be fine in avoiding the speed-trap I mentioned.
Basically, if you are giving half of your array to different threads, you probably want to make your dividing point be aligned to a cache-line, even if it means giving one thread a few elements more of data (and ofcourse, without cutting an element in-half). So if your array happens to be (roughly) 3 cachelines, give two to one thread and one to the other, rather than 1.5 to both. (I say 'probably', because yes, if you are operating on a small amount of data, but doing a large amount of processing, that certainly changes how the code ought to be written).
If whatever chunks you give your threads don't contain overlapping cachelines segments that they are simultaneously trying to access, and other unrelated threads in your program aren't trying to access the same cachelines, you're fine. By "access", if every thread is only reading, you're fine, but if even one thread is writing, that means the read's of the other threads need to re-cache the written-to values (which means re-copying the entire cache-line up and then down the caches).
But if you ping-pong the cache, that can be slow too. Even if ThreadA and ThreadB aren't accessing the cache at the same time, if they alternate who accesses it, each alternation (if a write occurred) requires syncing the memory between their different L1 caches (if they are using different L1 caches).
ThreadA writes to [line of bytes]
ThreadB writes to [line of bytes] //Forces re-syncing the memory up and then down the caches
ThreadA writes to [line of bytes] //Forces re-syncing the memory up and then down the caches
ThreadB writes to [line of bytes] //Forces re-syncing the memory up and then down the caches
It's (conceptually) faster to do:
ThreadA writes to [line of bytes]
ThreadA writes to [line of bytes]
ThreadB writes to [line of bytes] //Forces re-syncing the memory up and then down the caches
ThreadB writes to [line of bytes]
But if alot of other memory is being accessed in-between access, it doesn't matter, since it's likely been pushed out of the L1 or even L2 caches anyway.
(which is where the obligatory "profile profile profile" comes in)
tldr:
A) Each time a thread wants to access [cachline #23424]. if it's not already in the local L1 cache, it has to read to L2 cache, and then to main RAM (or L3 if it exists, then main RAM). So accessing a cacheline a second time while it's still in the L1 cache is faster than accessing it when it's not in the L1 cache (which is the point of the cache).
You already know (A). What I'm trying to adding is:
B) If you have the same cacheline copied to two different L1 caches, and one of those copies gets written to, then the other copy will have to get sync'd before your next read or write goes through, and that can be slow. Once or twice is no biggy, but I'd suspect that if it's happening once after divying up tasks for threads, it's likely happening more than once, and it'd be better to guarantee cacheline exclusivity.
So my inexperienced suggestion would be, when dividing up your thread *workloads* (irregardless of you copying or not copying memory) make sure the workloads are on cache-boundries, and let those threads have exclusive access to those cachelines until they complete their work. This doesn't necessarily affect the allocated size of your array itself, but padding out the beginning/end of the array to ensure other threads aren't using that memory wouldn't be a bad idea.