Basic Multithreaded Question

Started by
20 comments, last by Shannon Barber 7 years, 8 months ago

Doing a quick read up on cache lines, it appears that Intel architecture cache lines are 64 byte. So if the data is divisible by 64 like a 1024 byte array, this should avoid performance hits shouldn't it?

Under the assumption that your data actually starts at a cache-line boundary, I think so.


struct { char x;  char manyx[1024]; }

Is there a way to ensure that the data starts at the cache line?

Can a 64 byte char array start in the middle of a cache line and overlap the next?

Advertisement

Copying the array means either one of two things: A) Your array is so small that copying is cheap; if it's that small, you shouldn't be multithreading it anyway. or B) The array is large, but you're taking the easy way out instead of understanding the problem (which sometimes is the economic thing to do, if you don't have time to waste doing the research).


Or the array is small and cheap to copy, but the computation you're performing is expensive and makes sense to spread over multiple threads...


That was mentioned previously; I conceded that point already. :wink:

Doing a quick read up on cache lines, it appears that Intel architecture cache lines are 64 byte. So if the data is divisible by 64 like a 1024 byte array, this should avoid performance hits shouldn't it?


Again, noob here. From what I understand, whatever size the array is, regardless of whether it's divisible by cachelines or not, as long as more than one thread is not accessing (with at least one writing) the same cacheline around the same time (and as long as you aren't shuffling the same cacheline back and forth between the two threads), you should be fine in avoiding the speed-trap I mentioned.

Basically, if you are giving half of your array to different threads, you probably want to make your dividing point be aligned to a cache-line, even if it means giving one thread a few elements more of data (and ofcourse, without cutting an element in-half). So if your array happens to be (roughly) 3 cachelines, give two to one thread and one to the other, rather than 1.5 to both. (I say 'probably', because yes, if you are operating on a small amount of data, but doing a large amount of processing, that certainly changes how the code ought to be written).

If whatever chunks you give your threads don't contain overlapping cachelines segments that they are simultaneously trying to access, and other unrelated threads in your program aren't trying to access the same cachelines, you're fine. By "access", if every thread is only reading, you're fine, but if even one thread is writing, that means the read's of the other threads need to re-cache the written-to values (which means re-copying the entire cache-line up and then down the caches).

But if you ping-pong the cache, that can be slow too. Even if ThreadA and ThreadB aren't accessing the cache at the same time, if they alternate who accesses it, each alternation (if a write occurred) requires syncing the memory between their different L1 caches (if they are using different L1 caches).


ThreadA writes to [line of bytes]
ThreadB writes to [line of bytes] //Forces re-syncing the memory up and then down the caches
ThreadA writes to [line of bytes] //Forces re-syncing the memory up and then down the caches
ThreadB writes to [line of bytes] //Forces re-syncing the memory up and then down the caches

It's (conceptually) faster to do:


ThreadA writes to [line of bytes]
ThreadA writes to [line of bytes] 
ThreadB writes to [line of bytes] //Forces re-syncing the memory up and then down the caches
ThreadB writes to [line of bytes]

But if alot of other memory is being accessed in-between access, it doesn't matter, since it's likely been pushed out of the L1 or even L2 caches anyway.
(which is where the obligatory "profile profile profile" comes in)

tldr:
A) Each time a thread wants to access [cachline #23424]. if it's not already in the local L1 cache, it has to read to L2 cache, and then to main RAM (or L3 if it exists, then main RAM). So accessing a cacheline a second time while it's still in the L1 cache is faster than accessing it when it's not in the L1 cache (which is the point of the cache).

You already know (A). What I'm trying to adding is:
B) If you have the same cacheline copied to two different L1 caches, and one of those copies gets written to, then the other copy will have to get sync'd before your next read or write goes through, and that can be slow. Once or twice is no biggy, but I'd suspect that if it's happening once after divying up tasks for threads, it's likely happening more than once, and it'd be better to guarantee cacheline exclusivity.

So my inexperienced suggestion would be, when dividing up your thread *workloads* (irregardless of you copying or not copying memory) make sure the workloads are on cache-boundries, and let those threads have exclusive access to those cachelines until they complete their work. This doesn't necessarily affect the allocated size of your array itself, but padding out the beginning/end of the array to ensure other threads aren't using that memory wouldn't be a bad idea.

Can a 64 byte char array start in the middle of a cache line and overlap the next?

Sure. The compiler usually word-aligns variables for you, in structs and so on, but doesn't cache-align it unless you ask.
(with char arrays, it may be a little different, since chars are often used as raw bytes)

Is there a way to ensure that the data starts at the cache line?

If you have your own allocator, you can align it to whatever you want (even bad alignments that the CPU chokes on).

But also, look at alignas().

Thanks for that SotL.

How do you ensure that your data is in fact starting at the start of a cache line? Or is this just an automatic thing? What if other unrelated variables are in that cache? Does the array we want to work with append to the end? Or is the cache flushed and the array start at the beginning of the cache line regardless?

Edit

You just ninja'd me with your second post. I'll take a look at alignas(). :)

In my case I am dealing with packed floats. It will be pretty easy to ensure that I am dealing with 64 byte chunks, so alignment won't be too much of an issue in this particular case.

Just more worried to ensure that this 64 byte chunk is actually starting at the start of a cache line. Which I'm not sure how to guarantee, yet.

Coming into this late but there are three items which should be asked explicitly and were only implied in various replies along the way. First: is the data in the array mutable? Second: how variable is the execution time of the work, short on some, longer on others, 2x, 3x, more variation? What amount of work are we talking about? These three questions should preface any discussion of multi-threading since there are very important implications of each.

Taking the first item. If the data is mutable, then make it immutable. That solves the issues of alignment and cache line false sharing in the input side. So, in the first response suggesting making copies, don't do that, just all read from the initial array but 'store' the results to a second array. When everything is done, the results are in the second array, if they happen to be pointers, just swap the pointers and away you go. False sharing in the 'output' array is still possible but with partitioning of any significant amount of work, the threads should never be looking at data in close proximity so it should never show up as a problem. I.e. thread 1 would have to be nearly done with it's work before it got near work thread 2 is looking at, unless thread 2 was REALLY slow, it should have moved out of the area of contention so there is no overlap.

Second item: variable execution time can destroy performance in any partitioning scheme. If one thread gets unlucky working on heavier work items, the other threads will be stuck waiting till that one gets done. Most partitioning schemes are meant for nearly fixed cost computations where no one thread gets stuck doing unexpectedly large amounts of work.

Third item: as suggested, you should only do this on significantly large amounts of data. Something I did not see mentioned at all is that threads don't wake up from mutex/semaphore/etc immediately, so right off the top you loose a significant amount of performance due to slow wakeup on the worker threads. Unless you have the threads in a busy wait andthe work is significant, you may never see a benefit.

Hope I didn't restate too much, but I didn't see much of this asked specifically.

In this particular case it will be a huge array that is going to have three or four sets of matrix calculations per element performed on it. So it will be quite intensive.

The classical way to ensure alignment when using a simpler allocator is to over-allocate and manually align.

e.g., to allocate on a 64-byte boundary, allocate an additional 63 bytes and then align your allocation forward as needed.


static const size_t CACHELINE_SIZE = 64; // must be power of two!
static const size_t CACHELINE_MASK = CACHELINE_SIZE - 1;

void* allocate_aligned(size_t size) {
  // allocate excess space for alignment
  char* block = static_cast<char*>(malloc(size + CACHELINE_MASK));

  // align allocation within the padded space, if necessary
  if ((block & CACHELINE_MASK) != 0)
    block += CACHELINE_SIZE - (block & CACHELINE_MASK);
  return block;
}

void deallocate_aligned(void* block) {
  // WARNING: not portable!
  free(block);
}

There are also platform-specific functions to do this for you, e.g. posix_memalign or _aligned_malloc. Be aware that for portability you need to ensure that any memory you allocate with an aligned allocate is deallocated with the corresponding deallocator (e.g. _aligned_free).

The alignas attribute _only_ works for stack-allocated values in the current version of C++. The next revision will support over-aligned types with new/delete via some new overloads, but that is not yet supported by any released compiler iirc.

Sean Middleditch – Game Systems Engineer – Join my team!

Arrays are generally aligned through means of an allocator, usually on the heap upon some kind of initialization. Like already discussed here you want threads touching data on different cache lines to avoid false sharing. With thread pool implementations it can also be smart to align the thread context/ID to separate cache lines as well.

Another one for funsies (16 byte alignment):


void* malloc16( size_t size )
{
	void* p = malloc( size + 16 );
	if ( !p ) return 0;
	unsigned char offset = (size_t)p & 15;
	p = (char*)((((size_t)p + 16) + 15) & ~15);
	*((char*)p - 1) = 16 - offset;
	return p;
}

void free16( void* p )
{
	if ( !p ) return;
	free( (char*)p - (size_t)*((char*)p - 1) );
}

You don't need to align to cache-line sizes; the memory system is quite happy handling one core writing to a byte of a cache line "simultaneously" with a core writing to a different byte in the same line -- the intercore hyperbus will just transfer cacheline ownership around, introducing wait-states as appropriate.

(You can't guarantee the order of the writes, so they can't overlap, but they will both happen).

You DO want to avoid doing this all the time. (Because of the overhead of the hyperbus communications mentioned above).

So just don't thread finely in memory -- if you have four threads, don't send x[0] to one thread and x[1] to the next. Partition your array and send x[0-99] to one thread and x[100-199] to the next.

Don't over-worry about the cacheline sharing at the borders -- they're a small percentage of the total problem and cacheline ownership changes are designed to be very fast.

The rule to apply here is "don't thread finely in memory all the time".

In this particular case it will be a huge array that is going to have three or four sets of matrix calculations per element performed on it. So it will be quite intensive.

I will try to appoint performance benefits of your case , against possible performance trade-offs that would have disabled them on the other hand.

1- Large array to be written to is processed, and gets written to at the point it is being red from (the most important benefit)

2- instead of a single thread behaving this very friendly way, make 2 threads behave the same way.

Considering option 2 along with the size of array and the operation expense- you may benefit from literaly coppying two separate sub-arrays and let two threads process them, then write them both back to your main thread. You can achieve it by socket exchange operations for example. This rather bit expensive way of copy-in/copy-out allows your main thread of immediate independence, wheather it would be polling for the expected arrays, or even still safely reading from the old array that is to be updated.

it is up to the situation if it is most apropriate. Instancing a thread with some main memory is other option (sub pointed array), if you are sure of no "outsynchronize" consequences (and more issues on your main side), but this operation is still not too trivial as well. Memory will get coppied in the manner of L2 cache (per cpu) as separate at the very least, lock of possible RAM reads and writes to not race etc. Program shared memory must conform certain abstract properties across entire proccess so that you are safe.

The well done second option outperforms the first option, but rather in some not too big achievment way, but on the other hand imposes the difficulty of abstract thin ice to development of applicaiton, its data, logic and run-time.

You can try both approaches, and see the difference to decide which one you go for.

This topic is closed to new replies.

Advertisement