On Windows, it performs as expected i.e. very well in low contention cases, terribly under high contention. But on OS X, I noticed my high contention stress tests were completing much faster for benaphores than they were for my pthread_mutex wrappers and I'm hoping someone might be able to shed any light on the reasons why.
I understand that pthread mutexes will be inherently slower due to additional features and capabilities, but given that we're talking over an order of magnitude's performance difference under high contention scenarios (in the opposite direction to that which I'd expect), I'd like to understand what's going on.
The forum says I'm not permitted to attach a .cpp file (really?!), so I've put the benchmark code in my DropBox Public folder for now. It's a single C++ file, essentially equivalent to one of my stress test cases. The semaphore I've used here is dispatch_semaphore_t from libdispatch. A number of threads are started, each of which increments a shared variable a large number of times, acquiring/releasing the lock before/after each increment. Here's the body of each thread:
template<typename Mutex>
void *thread_body(void *opaque_arg)
{
CHECK( opaque_arg != 0 );
shared_stuff<Mutex> &stuff = *static_cast<shared_stuff<Mutex> *>(opaque_arg);
for (uint32_t i = 0; i != stuff.increments; ++i)
{
stuff.mtx.acquire();
++stuff.total;
stuff.mtx.release();
}
return 0;
}
And here are the results on my quad core i5 iMac running OS X 10.7, built without error checking code.
(Sorry about the use of an image, I fiddled with formatting for 30 minutes but the forum always insisted on mangling it to some degree).
Even in the case of 2 threads we're looking at 2.458 vs 86.479 seconds (!). I can happily accept that the libdispatch semaphore is more efficient than a pthread_mutex, but given the degree of improvement in the benaphore case, I'm inclined to believe that I've misconfigured/misunderstood something. Any ideas as to an explanation?






