gcc vs. VC++ code optimization

Started by
51 comments, last by phresnel 13 years, 10 months ago
Quote:Original post by RobTheBloke
by TTB? do you mean TBB?


Yes, sorry, bad typo was bad [sad]

Quote:
I've never used TTB (unless of course you mean TBB) or ConcRT, but I'm surprised to hear anyone advocate anything other than OpenMP? Certainly OpenMP under VC is a touch shaky (and not available in most editions), but under the Intel compiler it craps over the other solutions I've seen (i.e, Thread Building Blocks) by a noticeable amount.....

Hell, even Intel recommend that you always try to use OpenMP unless the compiler doesn't support it (and then you have TBB....). Imho, TBB only seems useful for serial applications that are retrofitting multi-core support....


Hmmm, do you have a link/source for that information? The only thing I was able to dig up with a couple of minutes googling was an article by Intel from July 2009 (clicky) which doesn't strongly recommend one over the other.

In fact, if anything it seems to recommend OpenMP for cases when "parallelism is primarily for bounded loops over built-in types, or if it is flat do-loop centric parallelism."

If anything, my personal view of OpenMP was that it was best suited for quickly adding parallel sections to serial code, where as TBB and MS's ConcRT are more for C++ applications where you build in threading from the ground up (although they can be used for converting serial code to parallel via the parallel_for algorithms).

Personally, my only experiance with OpenMP was quickly throwing it at some code I had to do at uni and the setup/teardown time for the various parallel blocks was pretty painful in that example, certainly when trying to do nested parallism.

In contrast when working on a very simply particle system I could use TBB's parallel_for to very quickly iterate over blocks of particles, with my own partion scheme, and update 1 million 2D particles in ~4ms.
Advertisement
OpenMP goes far beyond simple "#pragma omp parallel for" use cases. I am basically just utilizing its parallel region and lock support (more convenient than native APIs) and building everything else on top.
Such an approach can take you further than TBB, e.g. by using knowledge of shared caches or NUMA nodes. However, a good amount of infrastructure is required: detecting cache/NUMA topology, splitting up your tasks so that CPUs don't have to cross a cache-line boundary, reducing/broadcasting results, etc.

Nested parallelism is something to avoid (e.g. by telling client libraries not to do their own threading), but I have not otherwise noticed any overhead.
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
(as a minor update to my status of testing picogen: I must have bogofied the sourcecode, as the binary outputs black pixels only for both builds; must investigate further)

This topic is closed to new replies.

Advertisement