This topic is 535 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hi guys,

If I have large array that goes through a heavy computational process, is it safe to have threads work on different sections of this array without locking it.

For example have one thread work on the top half and another work on the bottom half. As long as I can indicate to the main program that both threads have completed processing, there shouldn't be any need to lock the array. Right?

##### Share on other sites

Interesting.

So you'd be better off copying the array to separate arrays to ensure this doesn't happen?

Something like this?

|            MainArray[1024]             |


And then merge the two after calculation has completed again?

Edited by DarkRonin

##### Share on other sites

Copying the array means either one of two things:
A) Your array is so small that copying is cheap; if it's that small, you shouldn't be multithreading it anyway.
or B) The array is large, but you're taking the easy way out instead of understanding the problem (which sometimes is the economic thing to do, if you don't have time to waste doing the research).

If you are going to take the easy way out, the easiest way is leaving it singlethreaded. But if you are wanting to multithread it, you can do so without splitting up the array.

Another easy and safe way to split up workloads is to use a single thread and simply do half the array on odd frames, and the other half on even frames. Depending on the nature of your problem.

Generally, yes it will be ok... but technically it's unsafe, and you need to be aware of the actual compiler/hardware behavior to justify it as being safe to do.

To get into specifics... if you're compiling for a PC (x86 / x86-64), then the CPU can atomically write 4-byte sized variables.
If the elements within your array have sizeof/alignof > 4, then you'll be ok.

If the elements within your array have sizeof < 4, then you're in trouble
e.g. say you've got an array of six unsigned short's, shared by two threads:

| thread 1  | thread 2  |
|u16|u16|u16|u16|u16|u16|
|block 1|block 2|block 3|
As far as the CPU is concerned, it can only read or write to these 4-byte blocks, labelled block 1/2/3 above.
So if thread#1 tries to write to index [2] and thread 2 tries to write to index [3] at the same time, then both threads will be performing a read-modify-write operation on "block 2", which is a race condition.

You're talking about making each variable word-aligned. But isn't another issue ensuring that two threads aren't operating on the same cache-line at the same time as well?

I'm an noob when it comes to multi-threading (i.e. I read about it but have never played with it), so correct me if I'm wrong here, but shouldn't each work-load be cache-aligned if there is a chance that at least one of the two threads will write something to the same cache that the other thread wants to read from, or the memory will have to be repeatedly shuffled from CoreA's L1 up to L2, L3, or even main RAM, and then back then to CoreB's L1, if the hardware architecture means they aren't sharing the same cache (e.g. if the two threads aren't hyperthreads on the same core).

Wouldn't that shoot your performance in the foot? i.e. if your elements aren't word-aligned, you can have undefined behavior. But even if word-aligned elements, if your thread workloads aren't cache-aligned, you might have unexpected and hard to detect slowdowns. I think.  :ph34r:

##### Share on other sites

Copying the array means either one of two things:
A) Your array is so small that copying is cheap; if it's that small, you shouldn't be multithreading it anyway.

Seems a bit general. Wouldn't this depend on what the calculations are that I am doing on the data?

##### Share on other sites

You're talking about making each variable word-aligned. But isn't another issue ensuring that two threads aren't operating on the same cache-line at the same time as well?

Wouldn't that shoot your performance in the foot?

Yes

- making the data word-aligned affects program correctness (avoiding race conditions).

- making the data cache-line-aligned affects program performance (false-sharing could be a performance pitfall).

I didn't mention it because it's a performance optimization issue, not an "is it safe" issue :)

Generally, if your array is large enough and both threads are advancing linearly, by the time that thread#1 gets to the end of its range, thread#2 will also be near the end of its range, so it's very unlikely that both threads will write to this shared cache line at the same time anyway.

##### Share on other sites

Copying the array means either one of two things:
A) Your array is so small that copying is cheap; if it's that small, you shouldn't be multithreading it anyway.

Seems a bit general. Wouldn't this depend on what the calculations are that I am doing on the data?

True; I was assuming small element sizes since we were already discussing word-alignment.  :)

##### Share on other sites
Doing a quick read up on cache lines, it appears that Intel architecture cache lines are 64 byte. So if the data is divisible by 64 like a 1024 byte array, this should avoid performance hits shouldn't it?

##### Share on other sites

Copying the array means either one of two things: A) Your array is so small that copying is cheap; if it's that small, you shouldn't be multithreading it anyway. or B) The array is large, but you're taking the easy way out instead of understanding the problem (which sometimes is the economic thing to do, if you don't have time to waste doing the research).

Or the array is small and cheap to copy, but the computation you're performing is expensive and makes sense to spread over multiple threads...

##### Share on other sites

Doing a quick read up on cache lines, it appears that Intel architecture cache lines are 64 byte. So if the data is divisible by 64 like a 1024 byte array, this should avoid performance hits shouldn't it?

Under the assumption that your data actually starts at a cache-line boundary, I think so.

struct { char x;  char manyx[1024]; }

##### Share on other sites

Doing a quick read up on cache lines, it appears that Intel architecture cache lines are 64 byte. So if the data is divisible by 64 like a 1024 byte array, this should avoid performance hits shouldn't it?

Under the assumption that your data actually starts at a cache-line boundary, I think so.

struct { char x;  char manyx[1024]; }

Is there a way to ensure that the data starts at the cache line?

Can a 64 byte char array start in the middle of a cache line and overlap the next?

##### Share on other sites

Copying the array means either one of two things: A) Your array is so small that copying is cheap; if it's that small, you shouldn't be multithreading it anyway. or B) The array is large, but you're taking the easy way out instead of understanding the problem (which sometimes is the economic thing to do, if you don't have time to waste doing the research).

Or the array is small and cheap to copy, but the computation you're performing is expensive and makes sense to spread over multiple threads...

That was mentioned previously; I conceded that point already.  :wink:

Doing a quick read up on cache lines, it appears that Intel architecture cache lines are 64 byte. So if the data is divisible by 64 like a 1024 byte array, this should avoid performance hits shouldn't it?

Again, noob here. From what I understand, whatever size the array is, regardless of whether it's divisible by cachelines or not, as long as more than one thread is not accessing (with at least one writing) the same cacheline around the same time (and as long as you aren't shuffling the same cacheline back and forth between the two threads), you should be fine in avoiding the speed-trap I mentioned.

Basically, if you are giving half of your array to different threads, you probably want to make your dividing point be aligned to a cache-line, even if it means giving one thread a few elements more of data (and ofcourse, without cutting an element in-half). So if your array happens to be (roughly) 3 cachelines, give two to one thread and one to the other, rather than 1.5 to both. (I say 'probably', because yes, if you are operating on a small amount of data, but doing a large amount of processing, that certainly changes how the code ought to be written).

If whatever chunks you give your threads don't contain overlapping cachelines segments that they are simultaneously trying to access, and other unrelated threads in your program aren't trying to access the same cachelines, you're fine. By "access", if every thread is only reading, you're fine, but if even one thread is writing, that means the read's of the other threads need to re-cache the written-to values (which means re-copying the entire cache-line up and then down the caches).

But if you ping-pong the cache, that can be slow too. Even if ThreadA and ThreadB aren't accessing the cache at the same time, if they alternate who accesses it, each alternation (if a write occurred) requires syncing the memory between their different L1 caches (if they are using different L1 caches).

ThreadA writes to [line of bytes]
ThreadB writes to [line of bytes] //Forces re-syncing the memory up and then down the caches
ThreadA writes to [line of bytes] //Forces re-syncing the memory up and then down the caches
ThreadB writes to [line of bytes] //Forces re-syncing the memory up and then down the caches

It's (conceptually) faster to do:

ThreadA writes to [line of bytes]
ThreadA writes to [line of bytes]
ThreadB writes to [line of bytes] //Forces re-syncing the memory up and then down the caches
ThreadB writes to [line of bytes]

But if alot of other memory is being accessed in-between access, it doesn't matter, since it's likely been pushed out of the L1 or even L2 caches anyway.
(which is where the obligatory "profile profile profile" comes in)

tldr:
A) Each time a thread wants to access [cachline #23424]. if it's not already in the local L1 cache, it has to read to L2 cache, and then to main RAM (or L3 if it exists, then main RAM). So accessing a cacheline a second time while it's still in the L1 cache is faster than accessing it when it's not in the L1 cache (which is the point of the cache).

B) If you have the same cacheline copied to two different L1 caches, and one of those copies gets written to, then the other copy will have to get sync'd before your next read or write goes through, and that can be slow. Once or twice is no biggy, but I'd suspect that if it's happening once after divying up tasks for threads, it's likely happening more than once, and it'd be better to guarantee cacheline exclusivity.

##### Share on other sites

Can a 64 byte char array start in the middle of a cache line and overlap the next?

Sure. The compiler usually word-aligns variables for you, in structs and so on, but doesn't cache-align it unless you ask.
(with char arrays, it may be a little different, since chars are often used as raw bytes)

Is there a way to ensure that the data starts at the cache line?

If you have your own allocator, you can align it to whatever you want (even bad alignments that the CPU chokes on).

But also, look at alignas().

##### Share on other sites
Thanks for that SotL.

How do you ensure that your data is in fact starting at the start of a cache line? Or is this just an automatic thing? What if other unrelated variables are in that cache? Does the array we want to work with append to the end? Or is the cache flushed and the array start at the beginning of the cache line regardless?

Edit

You just ninja'd me with your second post. I'll take a look at alignas(). :)

In my case I am dealing with packed floats. It will be pretty easy to ensure that I am dealing with 64 byte chunks, so alignment won't be too much of an issue in this particular case.

Just more worried to ensure that this 64 byte chunk is actually starting at the start of a cache line. Which I'm not sure how to guarantee, yet.

##### Share on other sites

In this particular case it will be a huge array that is going to have three or four sets of matrix calculations per element performed on it. So it will be quite intensive.

##### Share on other sites
The classical way to ensure alignment when using a simpler allocator is to over-allocate and manually align.

e.g., to allocate on a 64-byte boundary, allocate an additional 63 bytes and then align your allocation forward as needed.

static const size_t CACHELINE_SIZE = 64; // must be power of two!
static const size_t CACHELINE_MASK = CACHELINE_SIZE - 1;

void* allocate_aligned(size_t size) {
// allocate excess space for alignment
char* block = static_cast<char*>(malloc(size + CACHELINE_MASK));

// align allocation within the padded space, if necessary
if ((block & CACHELINE_MASK) != 0)
block += CACHELINE_SIZE - (block & CACHELINE_MASK);
return block;
}

void deallocate_aligned(void* block) {
// WARNING: not portable!
free(block);
}


There are also platform-specific functions to do this for you, e.g. posix_memalign or _aligned_malloc. Be aware that for portability you need to ensure that any memory you allocate with an aligned allocate is deallocated with the corresponding deallocator (e.g. _aligned_free).

The alignas attribute _only_ works for stack-allocated values in the current version of C++. The next revision will support over-aligned types with new/delete via some new overloads, but that is not yet supported by any released compiler iirc.

##### Share on other sites

Arrays are generally aligned through means of an allocator, usually on the heap upon some kind of initialization. Like already discussed here you want threads touching data on different cache lines to avoid false sharing. With thread pool implementations it can also be smart to align the thread context/ID to separate cache lines as well.

Another one for funsies (16 byte alignment):

void* malloc16( size_t size )
{
void* p = malloc( size + 16 );
if ( !p ) return 0;
unsigned char offset = (size_t)p & 15;
p = (char*)((((size_t)p + 16) + 15) & ~15);
*((char*)p - 1) = 16 - offset;
return p;
}

void free16( void* p )
{
if ( !p ) return;
free( (char*)p - (size_t)*((char*)p - 1) );
}

Edited by Randy Gaul

##### Share on other sites

You don't need to align to cache-line sizes; the memory system is quite happy handling one core writing to a byte of a cache line "simultaneously" with a core writing to a different byte in the same line -- the intercore hyperbus will just transfer cacheline ownership around, introducing wait-states as appropriate.

(You can't guarantee the order of the writes, so they can't overlap, but they will both happen).

You DO want to avoid doing this all the time. (Because of the overhead of the hyperbus communications mentioned above).

So just don't thread finely in memory -- if you have four threads, don't send x[0] to one thread and x[1] to the next. Partition your array and send x[0-99] to one thread and x[100-199] to the next.

Don't over-worry about the cacheline sharing at the borders -- they're a small percentage of the total problem and cacheline ownership changes are designed to be very fast.

The rule to apply here is "don't thread finely in memory all the time".

##### Share on other sites

In this particular case it will be a huge array that is going to have three or four sets of matrix calculations per element performed on it. So it will be quite intensive.

I will try to appoint performance benefits of your case , against possible performance trade-offs that would have disabled them on the other hand.

1- Large array to be written to is processed, and gets written to at the point it is being red from (the most important benefit)

2- instead of a single thread behaving this very friendly way, make 2 threads behave the same way.

Considering option 2 along with the size of array and the operation expense- you may benefit from literaly coppying two separate sub-arrays and let two threads process them, then write them both back to your main thread. You can achieve it by socket exchange operations for example. This rather bit expensive way of copy-in/copy-out allows your main thread of immediate independence, wheather it would be polling for the expected arrays, or even still safely reading from the old array that is to be updated.

it is up to the situation if it is most apropriate. Instancing a thread with some main memory is other option (sub pointed array), if you are sure of  no "outsynchronize" consequences (and more issues on your main side), but this operation is still not too trivial as well. Memory will get coppied in the manner of L2 cache (per cpu) as separate at the very least, lock of possible RAM reads and writes to not race etc. Program shared memory must conform certain abstract properties across entire proccess so that you are safe.

The well done second option outperforms the first option, but rather in some not too big achievment way, but on the other hand imposes the difficulty of abstract thin ice to development of applicaiton, its data, logic and run-time.

You can try both approaches, and see the difference to decide which one you go for.

##### Share on other sites

Thanks for the advice so far, guys!

Getting some great performance by doing this.

Spreading the load across 8 cores is giving me around 6.5x more calculations (when timed for a one second duration).

Very nice!

##### Share on other sites

Copying the array means either one of two things: A) Your array is so small that copying is cheap; if it's that small, you shouldn't be multithreading it anyway. or B) The array is large, but you're taking the easy way out instead of understanding the problem (which sometimes is the economic thing to do, if you don't have time to waste doing the research).

Or the array is small and cheap to copy, but the computation you're performing is expensive and makes sense to spread over multiple threads...

No reasonable computation is that complex.
e.g. What could you possibly be doing that is worse than an FFT?

Edited by Shannon Barber