Sign in to follow this  

How to set Cache Line Alignment Properly

This topic is 2322 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi, suppose I wish to align my data to a cache line to prevent cache trashing when accessing via multiple threads how should I do it easily? What is the difference between padding it manually or using those compiler keywords like __declspec(align(#)) or __attribute__(align(#))


[code]
//E.g 1
struct Data
{
int a;
char padding_0[CACHE_LINE_SIZE - sizeof(int)];

double b;
char padding_1[CACHE_LINE_SIZE - sizeof(double)];
};

//E.g 2
__declspec(align(CACHE_LINE_SIZE)) struct Data
{
__declspec(align(CACHE_LINE_SIZE)) int a;
__declspec(align(CACHE_LINE_SIZE)) double b;
};
[/code]

Is there any difference between the above 2 e.g?

I know of ways to get the CACHE_LINE_SIZE programmatically e.g via GetLogicalProcessorInformation. But is there anyway to know it at compile time?

It seems pointless to get the CACHE_LINE_SIZE programmatically as I can't change the padding size when the program is running.

regards

Share this post


Link to post
Share on other sites
Remembering that this is "For Beginners" forum....

This falls into the bucket of "If you have to ask, you aren't ready for it".

What you describe will NOT fix your problem of cache trashing. In fact, it will more likely cause your data structure to be larger, further increasing the problem.


You are correct that you are trying to use information at compile time that is only available at runtime. The information is not available yet. There are ways to do what you described by abusing pointers to structures, pointers to functions, and layout-compatible structures, but it is an advanced subject not suitable for the For Beginners forum. (I'd argue it isn't suitable even for experts because of the side effects it causes, but the simple fact is that they do exist out there.)



The correct way to avoid the problem is to use appropriate locking and synchronization tools.



As a general rule, any posts in the For Beginners forum are encouraged to stay far away from multiple threads

Multiprocessing (including multithreading) is an extremely complex topic. Even programmers with years of professional experience will push back on multiprocessing if they can. It adds a very big additional level of complexity to the designs. Everything that touches or crosses a boundary requires significant additional engineering effort. It introduces incredibly difficult bugs into systems, some bugs can take several months or even years of effort to track down, and some are impossible to fix without significant system-wide overhauls.


Even innocent sound things, like the simple Windows Mutex, can bring a system to a grinding halt. Many beginners see that they have the easiest to understand documentation of the lot, and they appear simple to use, so they try it out. (Hint: A Windows Mutex is the worst-case synchronization object. It does ensure proper access, but painfully, effectively stopping every process on the system and doing a comprehensive shakedown. Repeat every time it is opened or released. It would be like installing a tire ripper at the end of the service pits of an F1 racing track to stop each car so each car can have a TSA inspector search them for a single explosive, thereby forcing them around the track again only to end up right back in queue; only the cars are every program on your computer. Sadly many novices reach for it as their synchronization system of choice.)




If you are sure you want to continue experimenting with this, and you understand that trying multithreading too early is a great way to permanently injure your mental health, I suggest you look at [url="http://www.google.com/search?q=interlocked+variables"]interlocked variables[/url]. That would be one of the easiest and safest tools to share the values between threads. If that doesn't do enough for you, I'd suggest instead going over to [url="http://www.boost.org/doc/libs/1_47_0/doc/html/interprocess/synchronization_mechanisms.html"]Boost.Threads[/url] rather than trying to implement threading yourself, or move to a language with better support for it like C#.

Share this post


Link to post
Share on other sites
Thanks for the detailed reply :)

I guess it is true that my example did not reflect a problem of cache trashing (or maybe i use the wrong word :D ).

I do use interlocked variables, but have replaced them with std::atomic.

The exact problem of my problem is actually where I have an array of data that can be accessed by multiple threads. For simplicity say something like:
[code]
struct Data {};
Data values[thread_count];

//in Thread One
//do something like read/write to values[thread_id]

//in Thread Two
//do something like read/write to values[thread_id]
[/code]

but as each element of data might lie on the same cache line. I think this might causes potential false sharing?

regards

Share this post


Link to post
Share on other sites
I am already using boost.Thread and interlocked/atomic variables. But I think that both Boost.Thread and Interlocked variables can't really solve the problem of false sharing?

Share this post


Link to post
Share on other sites
[quote name='littlekid' timestamp='1312701449' post='4845696']
but as each element of data might lie on the same cache line. I think this might causes potential false sharing?
[/quote]

It all depends on the size of the "Data" structs and how your threads are accessing the data.

If each data element is smaller than the size of a cache line and each thread access each element in the form you suggest then yes, you will get false sharing and after each write the CPU will need to resync the cache between each cores.

The correct way to do this is not to make your data structure bigger but to try and ensure your threads access data at least 1 cache line size apart to avoid the problem all together; this will win you better cache usage as each thread will be able to work on X amount of data from the cache before having to refetch from memory.

Share this post


Link to post
Share on other sites
Bear in mind that you don't know how big the cachelines are on the CPU the code will run on.

Instead of interleaving your thread accesses, divide the work the other way; if you're doing 20 objects in the array, instead of giving one thread the even numbered ones and one the odd numbers, cut the array in half and give one thread the first ten and the other thread the next ten.

You'll not only minimise the cache line fighting, but you'll also gain more from any prefetching that can be done.


But 999 times out of a thousand, you don't have a problem which needs solving at this level. Frob is completely right about this -- when you get to dealing with the complexity of things like line invalidation, you're deep, deep, deep into performance code. It's the last resort before you decide that the problem can't be solved with this decade's computers.

Share this post


Link to post
Share on other sites

This topic is 2322 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this