Multithreading - C++11 vs. OpenMP

Started by
5 comments, last by JoeJ 7 years, 10 months ago

I try to find some 'best performance practices' about multi threading before looking at implementing more advanced job systems.

But i don't understand the results - either OpenMP or C++11 is a lot faster.

It's always possible in my tests to get 4x speedup at my i7, so it's surely no hardware limit.

Example one, C++11 speedup is slightly above 4, OpenMP speedup only 2:

C++11:

void MyClass::ProcessMT (const int tID, std::atomic<int> *work, const int iterations)
    {
        for (;;) // each virtual core steals a batch of given work iterations and processes them until nothing is left.
        {
            const int workStart = work->fetch_add(iterations);
            const int workEnd = min (workStart + iterations, myTotalWorkAmount);

            for (int i=workStart; i<workEnd; i++)
                    DoTheWork (i);

            if (workEnd == size) break;
        }
    }
 
void MyClass::Update()
{
unsigned int num_threads = std::thread::hardware_concurrency(); // returning 8 on i7
num_threads = min (64, max (4, num_threads));
std::thread threads[64];
 
std::atomic<int> work = 0;
for (int tID=1; tID<num_threads; tID++) threads[tID] = std::thread (&MyClass::ProcessMT, this, tID, &work, 64);
ProcessMT (0, &work, 64);
for (int tID=1; tID<num_threads; tID++) threads[tID].join();
}

OpenMP for the same thing is simply:


#pragma omp parallel for
            for (int i=0; i<myTotalWorkAmount; i++)
               DoTheWork (i);
 

I have used this method on various different tasks.

There are also cases where OpenMP is 4 times and my C++11 method is only 2 times faster - exactly the opposite behaviour.

I tried to tune the iteration count and ensured not to create any threads that would have no work - only minor improvements.

The tasks itself are nothing special - no dynamic allocations or something.

My guess is, the only solid method is to create presistent worker threads and keeping them alive

instead of creating new threads for each new kind of task.

But this alone can not explain the behaviour i see.

Maybe C++11 (VC2013, Win10) is not that ready yet, similar like OpenMP is not really useable for games?

I'd really like to use C++11 instead of looking at libraries or OS, but as long as i'm not on par with OpenMP everywhere? <_<

Maybe you can share some experience, have some thoughts or suggest a better method...

Advertisement

But i don't understand the results - either OpenMP or C++11 is a lot faster.

Not necessarily faster, but different from each other and designed for different purposes.

As a car analogy, a fancy sports car is very fast on the highway compared to a front end loader, but if you're going over rough terrain with large rocks and tree stumps the front end loader is better than a sports car. Or a different analogy, a race horse may be great for a sprint around the track but would make a terrible draft animal in the field. Different situations, different purposes, different results.

Maybe C++11 (VC2013, Win10) is not that ready yet, similar like OpenMP is not really useable for games?

So you've mentioned three sets of libraries. There are the OS platform libraries, the C++ libraries, and the OpenMP libraries.

I'll take them one at a time.

OS Platform Libraries with compiler support.

The OS platform libraries for multithreading allow the most control. They are also the least portable. If you are developing for multiple platforms you will need to take platform-specific steps for using them. Often it is hard to wrap these up as libraries since elements like memory barriers are part of the compiler and don't abstract away into functions.

They offer an amazing amount of control, and if you are programming only on a single platform you can use them to great effect. However, with that control comes the burden of implementing the code behind frequent multithreading and parallel idioms. With the control, the developer is responsible for ensuring all the platform-specific rules are met. The developer is responsible to ensure all is well, that memory blocks and cache synchronization and other fun stuff is all correct.

OS platform libraries have the most functionality and most control, but come with a high implementation cost. If you need the control then this is often the only way to get it.

C++ multithreading libraries

The C++11 multithreading libraries are general purpose. They are getting a little better over time and are being modified in the near future (proposal changes are already incorporated to many compilers). Most game companies have already got their own existing libraries of platform-specific multithreading stuff. There is an installed user base with much momentum, so if groups have a code base using a platform's libraries or their own implementations, the group is unlikely to change.

Like any abstraction, the abstractions in the standard library reduce control relative to the platform-specific versions. The libraries were created mostly for people who want to forego the control a little bit in favor of portable abstractions. The C++ libraries work. If they generally require less effort when it comes to implementation, at the cost of losing some control of the features. Otherwise, if you need the specific thing, use the specific libraries.

OpenMP multithreading

OpenMP is a compiler extension designed to not impact your code if you are working on a compiler that doesn't support it. It gives far less control than either of the methods mentioned above, but in many situations is trivially easy to use, just drop in a #pragma before big loops. The nature of the system means that you are fairly limited in how you use the OpenMP libraries. They are wonderful if your problem set matches the system's design, but nearly useless if they don't.

OpenMP is mature and works well, but is designed for a different set of problems than games usually address. It is based around a minimally-invasive loop processing model. When you are processing a very large array (on the order of many megabytes or many gigabytes), or doing significant processing on a smaller array, OpenMP can have one thread per processor standing by to split up the work. There is some overhead in splitting up the work, but it effectively sets each of the processors to work at processing different segments of the array.

While games do have some array processing, usually the patterns followed don't map well to OpenMP's way of doing things. Some patterns in games do process large arrays, such as updating a very large particle system, but they are the exception rather than the rule. Arrays of several kilobytes or occasionally a megabyte or so are common in games, but gigabyte lengths are rare.

OpenMP does have some support for task-based parallelism, but the control is not fine grained. It might be a good match for some games, but I've only seen a few places here and there where it could be leveraged.

My general recommendation is to avoid multiprocessing if you can, it introduces amazing new classes of bugs and defects. Using system-asynchronous calls is typically the easiest route to go, and the least error prone. After that your options depend on the real reasons you need to use parallelism. Different problems need different solutions. Task-based parallelism is fairly common in games but doesn't take much advantage of multiprocessing's power, it is just doing more tasks at once. Algorithmic-based parallelism takes much more computer-science work but can solve many problems much more effectively than serial processing. Loop-based processing like OpenMP is somewhere in the middle, some processing gets better results than other processing.

In every situation there is nuance and detail. There are costs to set up and configure your multiprocessing system. There are costs to distribute work and bring it together again. You need to understand both the nature of the work and the nature of the parallel systems to make good decisions. I've seen people implement parallel processing systems that were wonderful. I've seen them where tasks were too fine grained and overhead far exceeded the benefits. I've seen systems where people didn't understand the intercommunications between systems and the frequent communications and locks and blocking between computes made performance plummet rather than improve. You need to understand both the problem you are solving and the nature of your systems in order to get good results. There is no universal answer for what and where to parallelize.

Thanks!

I'll stick at C++11 - enough functionality to learn, maybe i'll look into OS (much) later :)

I found the reason for the performance differences.

In cases where my C++11 approach was slow, that's because threats have been created and joined for each iteration of an outer loop.

Even the outer loop has only 11 iterations and the single threaded runtime is 27ms, this caused the slow down.

I've fixed this by removing the outer loop, but it's still necessary to finish all work in order.

Simple problem, but this already requires ugly and hard to read code like below - maybe i can improve it.

However - with this code i can't measure a performance difference between C++11 and OpenMP anymore :)


void ProcessMT (std::atomic<int> *workIndex, std::atomic<int> *workDone, const int iterations)
    {
        const int endIndex = levelLists.size;
        for (;;)
        {
            int workStart = workIndex->fetch_add(iterations);
            int workEnd = min (workStart + iterations, endIndex);

            int level = levelLists.GetLevel(workStart);
            int levelDownAtIndex = (level <= 0 ? INT_MAX : levelLists.GetFirstIndex (level-1) );
 
            while (levelLists.GetLevel(workDone->load()) > level+1) std::this_thread::yield(); // don't skip over a whole level

            int i=workStart;
            while (i<workEnd)
            {
                int safeEnd = min(levelDownAtIndex, workEnd);
                int progress = safeEnd - i;

                for (; i<safeEnd; i++)
                    DoTheWork (i);
 
                workDone->fetch_add(progress);

                if (i==levelDownAtIndex)
                {
                    while (workDone->load() < workStart) std::this_thread::yield(); // wait until current level has been processed by all other threads
                        
                    level--;
                    levelDownAtIndex = (level <= 0 ? INT_MAX : levelLists.GetFirstIndex (level-1) );
                }
            }

            if (workEnd == endIndex) break;
        }
    }
 

27ms sounds low, you could try one thread with 5, and one with 6 (different) iterations.

In general, longer thread execution time reduces the startup/shutdown overhead of a thread.

However, I would not risk running into multi-threading bugs for gaining about 0.15 seconds (11*27ms = about 0.3 seconds, 2 threads each 0.15 seconds = 0.15 gain)

Hehe, already discovered that my code does not work... always :)

But i know that problems from GPGPU and usually after a first day of headache things get better and 'thinking multi threaded' becomes easier.

The loop i'm talking about is basically a breadth-first tree traversal with 11 tree levels - total runtime is 27ms.

First iteration has 1 nodes, second 2, forth 8 and so forth - so i process the first levels with a single thread while the others are wasted with waiting.

Bottom tree level has 60000 nodes so multi threading is a big win at the end (8ms).

The waiting could be avoided using a more advanced job system so they can grab other independent work.

Using this job system everywhere should limit the risk of an undiscovered bug and bad belly feeling to zero after time.

Do the first 3 in one thread, then fork 4 threads for the 4 alternatives, compute min() afterwards.

You can also fork 4 threads immediately, and do the first 2 levels a bit more often than once.

First iteration has 1 nodes, second 2, forth 8 and so forth - so i process the first levels with a single thread while the others are wasted with waiting.

Bottom tree level has 60000 nodes so multi threading is a big win at the end (8ms).

Huh? 1st level is 2**0, then 11th level is 2**10 == 1024.

It's not a binary tree i'm using, i just used it as a worst case example.

In case someone is interested, here's the correct code i'm using now. After two days using it i'm sure this time :)

It serves well enough as a minimal job system and i was able to speed up my application by the number for cores, also stuff where i would have expected bandwidth limits.

Usage example is:

int num_threads = std::thread::hardware_concurrency();
num_threads = min (64, max (4, num_threads));
std::thread threads[64];
 
ThreadsForLevelLists workList;
workList.AddLevel (0, 4); // first execute nodes 0-3
workList.AddLevel (4, 6); // then nodes 4-6, ensuring the previous level is done
//...
 
workList.SetJobFunction (ComputeBoundingBox, this);
for (int tID=1; tID<num_threads; tID++) threads[tID] = std::thread (ThreadsForLevelLists::sProcessDependent, &workList);
workList_LevelCounters_TopDown.ProcessDependent();
for (int tID=1; tID<num_threads; tID++) threads[tID].join();

Instead of dividing each levels work by the number of cores, it uses work stealing of small batches.

Advantage is that this compensates different runtimes of the job function.

struct ThreadsForLevelLists
{
    // Calls Threads in order:
    // for (int level=0; level<numLevels; level++)
    // {
    //     for (int i=0; i<numLevels; o++) jobFunction (levelStartIndex + i);
    //     barrier in case of ProcessDependent() to ensure previous level has been completed
    // }

    enum
    {
        MAX_LEVELS = 32,
    };

    int firstIteration[MAX_LEVELS];
    unsigned int firstIndex[MAX_LEVELS+1];

    int numLevels;
    void (*jobFunction)(const int, void*);
    void* data;

    std::atomic<int> workIndex;
    std::atomic<int> workDone;
    int iterations;

    ThreadsForLevelLists ()
    {
        numLevels = 0;
        firstIndex[0] = 0;
        workIndex = 0;
        workDone = 0;
    }

    void Reset ()
    {
        workIndex = 0;
        workDone = 0;
    }

    void SetJobFunction (void (*jobFunction)(const int, void*), void *data, int iterations = 64)
    {
        Reset ();
        this->jobFunction = jobFunction;
        this->data = data;
        this->iterations = iterations;
    }

    void AddLevel (const int levelStartIndex, const unsigned int size)
    {
assert (numLevels < MAX_LEVELS);
        firstIteration[numLevels] = levelStartIndex;
        firstIndex[numLevels+1] = firstIndex[numLevels] + size;
        numLevels++;
    }

    void ProcessDependent ()
    {
        const unsigned int wEnd = firstIndex[numLevels];
        int level = 0;
        int levelReady = 0;

        for (;;)
        {
            int wI = workIndex.fetch_add (iterations);        
            if (wI >= wEnd) break;

            int wTarget = min (wI + iterations, wEnd);
            while (wI != wTarget)
            {
                while (wI >= firstIndex[level+1]) level++;

                int wMax = min (wTarget, firstIndex[level+1]);            
                int numProcessed = wMax - wI;

                for (;;)
                {
                    int dI = workDone.load();        
                    while (dI >= firstIndex[levelReady+1]) levelReady++;
                    if (levelReady >= level) break;
                    std::this_thread::yield();

                    // todo: optionally store a pointer to another ThreadsForLevelLists and process it instead of yielding
                }

                int indexOffset = firstIteration[level] - firstIndex[level];
                for (; wI < wMax; wI++)
                    jobFunction (indexOffset + wI, data);

                workDone.fetch_add (numProcessed);
            }
        }
    }

    void ProcessIndependent ()
    {
        const unsigned int wEnd = firstIndex[numLevels];
        int level = 0;

        for (;;)
        {
            int wI = workIndex.fetch_add (iterations);        
            if (wI >= wEnd) break;

            int wTarget = min (wI + iterations, wEnd);
            while (wI != wTarget)
            {
                while (wI >= firstIndex[level+1]) level++;

                int wMax = min (wTarget, firstIndex[level+1]);            
                
                int indexOffset = firstIteration[level] - firstIndex[level];
                for (; wI < wMax; wI++)
                    jobFunction (indexOffset + wI, data);
            }
        }
    }

    static void sProcessDependent (ThreadsForLevelLists *ptr) // todo: move Process() to cpp file to avoid the need for a static function
    {
        ptr->ProcessDependent();
    }
    static void sProcessIndependent (ThreadsForLevelLists *ptr) // todo: move Process() to cpp file to avoid the need for a static function
    {
        ptr->ProcessIndependent();
    }
};

This topic is closed to new replies.

Advertisement