Hello,
I am working on some libraries using intensively OpenMP. In a few books about OpenMP I've read that it is a good practice to leave at least one thread unused by the program. So the OS can operate it etc.
I would like to ask You Guys about Your experience in that subject. Is it really faster to use serial algorithms on two CPU machine? Or anyway it is better to evaluate some expression on two cores using omp?
If I am for example supposed to use two CPUs, so I'd rather use serial versions of algorithms? And when I have 3 to use, I can use 2 threads then... or just use the number of threads equal to the number of cores?
Thanks for any suggestions :]
Optimal OpenMP threads number
It depends. The answer depends on your parallel algorithm and it depends on your parallel hardware.
It depends on your algorithm. Some parallel algorithms (such as searching) benefit from non-linear speedups and are not compute intensive, but they are memory intensive. Based on your memory and cache situation, it MAY make sense to have more than one thread per core. Other parallel algorithms are very communications intensive; in those situations it MAY make sense to have exactly one thread per core. Other parallel algorithms are more compute intensive and may best served as you described, leaving a processing core around for other purposes. Those parallel algorithms MAY benefit from less than one thread per core.
It depends on your processor configuration. Differently configured hardware, such as dual-dual or dual-quad or quad-quad, will have different performance characteristics that a single quad core or single 6-core or single 8-core or single 12-core chip, which will be different again than dual-6 or dual-8 or dual-12 or .... etc. Communication between on-die cores will be many orders of magnitude faster than communications between different chips. If your threads must communicate then the physical configuration becomes important.
It depends on your memory performance. Parallel algorithms that are memory intensive may begin to have performance suffer when too many threads are running at once because they simply starve each other for memory bandwidth.
It depends on your cache usage and cache performance. Parallel algorithms with data that live entirely within cache may run better when the on-die cache is only partially used; this is another resource starvation issue.
In short, it depends on your algorithm and it depends on your hardware.
It depends on your algorithm. Some parallel algorithms (such as searching) benefit from non-linear speedups and are not compute intensive, but they are memory intensive. Based on your memory and cache situation, it MAY make sense to have more than one thread per core. Other parallel algorithms are very communications intensive; in those situations it MAY make sense to have exactly one thread per core. Other parallel algorithms are more compute intensive and may best served as you described, leaving a processing core around for other purposes. Those parallel algorithms MAY benefit from less than one thread per core.
It depends on your processor configuration. Differently configured hardware, such as dual-dual or dual-quad or quad-quad, will have different performance characteristics that a single quad core or single 6-core or single 8-core or single 12-core chip, which will be different again than dual-6 or dual-8 or dual-12 or .... etc. Communication between on-die cores will be many orders of magnitude faster than communications between different chips. If your threads must communicate then the physical configuration becomes important.
It depends on your memory performance. Parallel algorithms that are memory intensive may begin to have performance suffer when too many threads are running at once because they simply starve each other for memory bandwidth.
It depends on your cache usage and cache performance. Parallel algorithms with data that live entirely within cache may run better when the on-die cache is only partially used; this is another resource starvation issue.
In short, it depends on your algorithm and it depends on your hardware.
Most of the time the amount of logical cores is a reasonable starting point... but you have to try and measure to actually find out. And as far as "leaving a thread for the program" etc. is concerned this should be configurable by the library user imho.
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement