Jump to content
  • Advertisement
Sign in to follow this  
JoeJ

Multithreading - C++11 vs. OpenMP

This topic is 771 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I try to find some 'best performance practices' about multi threading before looking at implementing more advanced job systems.

But i don't understand the results - either OpenMP or C++11 is a lot faster.

It's always possible in my tests to get 4x speedup at my i7, so it's surely no hardware limit.

 

Example one, C++11 speedup is slightly above 4, OpenMP speedup only 2:

 

C++11:

void MyClass::ProcessMT (const int tID, std::atomic<int> *work, const int iterations)
    {
        for (;;) // each virtual core steals a batch of given work iterations and processes them until nothing is left.
        {
            const int workStart = work->fetch_add(iterations);
            const int workEnd = min (workStart + iterations, myTotalWorkAmount);

            for (int i=workStart; i<workEnd; i++)
                    DoTheWork (i);

            if (workEnd == size) break;
        }
    }
 
void MyClass::Update()
{
unsigned int num_threads = std::thread::hardware_concurrency(); // returning 8 on i7
num_threads = min (64, max (4, num_threads));
std::thread threads[64];
 
std::atomic<int> work = 0;
for (int tID=1; tID<num_threads; tID++) threads[tID] = std::thread (&MyClass::ProcessMT, this, tID, &work, 64);
ProcessMT (0, &work, 64);
for (int tID=1; tID<num_threads; tID++) threads[tID].join();
}

 

OpenMP for the same thing is simply:

#pragma omp parallel for
            for (int i=0; i<myTotalWorkAmount; i++)
               DoTheWork (i);
 

 

I have used this method on various different tasks.

There are also cases where OpenMP is 4 times and my C++11 method is only 2 times faster - exactly the opposite behaviour.

 

I tried to tune the iteration count and ensured not to create any threads that would have no work - only minor improvements.

The tasks itself are nothing special - no dynamic allocations or something.

 

My guess is, the only solid method is to create presistent worker threads and keeping them alive

instead of creating new threads for each new kind of task.

But this alone can not explain the behaviour i see.

 

Maybe C++11 (VC2013, Win10) is not that ready yet, similar like OpenMP is not really useable for games?

I'd really like to use C++11 instead of looking at libraries or OS, but as long as i'm not on par with OpenMP everywhere? <_<

 

Maybe you can share some experience, have some thoughts or suggest a better method...

 

 

 

Share this post


Link to post
Share on other sites
Advertisement

Thanks!

I'll stick at C++11 - enough functionality to learn, maybe i'll look into OS (much) later :)

 

I found the reason for the performance differences.

In cases where my C++11 approach was slow, that's because threats have been created and joined for each iteration of an outer loop.

Even the outer loop has only 11 iterations and the single threaded runtime is 27ms, this caused the slow down.

 

I've fixed this by removing the outer loop, but it's still necessary to finish all work in order.

Simple problem, but this already requires ugly and hard to read code like below - maybe i can improve it.

 

However - with this code i can't measure a performance difference between C++11 and OpenMP anymore :)

void ProcessMT (std::atomic<int> *workIndex, std::atomic<int> *workDone, const int iterations)
    {
        const int endIndex = levelLists.size;
        for (;;)
        {
            int workStart = workIndex->fetch_add(iterations);
            int workEnd = min (workStart + iterations, endIndex);

            int level = levelLists.GetLevel(workStart);
            int levelDownAtIndex = (level <= 0 ? INT_MAX : levelLists.GetFirstIndex (level-1) );
 
            while (levelLists.GetLevel(workDone->load()) > level+1) std::this_thread::yield(); // don't skip over a whole level

            int i=workStart;
            while (i<workEnd)
            {
                int safeEnd = min(levelDownAtIndex, workEnd);
                int progress = safeEnd - i;

                for (; i<safeEnd; i++)
                    DoTheWork (i);
 
                workDone->fetch_add(progress);

                if (i==levelDownAtIndex)
                {
                    while (workDone->load() < workStart) std::this_thread::yield(); // wait until current level has been processed by all other threads
                        
                    level--;
                    levelDownAtIndex = (level <= 0 ? INT_MAX : levelLists.GetFirstIndex (level-1) );
                }
            }

            if (workEnd == endIndex) break;
        }
    }
 
Edited by JoeJ

Share this post


Link to post
Share on other sites

27ms sounds low, you could try one thread with 5, and one with 6 (different) iterations.

 

In general, longer thread execution time reduces the startup/shutdown overhead of a thread.

 

 

However, I would not risk running into multi-threading bugs for gaining about 0.15 seconds (11*27ms = about 0.3 seconds, 2 threads each 0.15 seconds = 0.15 gain)

Share this post


Link to post
Share on other sites

Hehe, already discovered that my code does not work... always :)

 

But i know that problems from GPGPU and usually after a first day of headache things get better and 'thinking multi threaded' becomes easier.

 

The loop i'm talking about is basically a breadth-first tree traversal with 11 tree levels - total runtime is 27ms.

First iteration has 1 nodes, second 2, forth 8 and so forth - so i process the first levels with a single thread while the others are wasted with waiting.

Bottom tree level has 60000 nodes so multi threading is a big win at the end (8ms).

 

The waiting could be avoided using a more advanced job system so they can grab other independent work.

Using this job system everywhere should limit the risk of an undiscovered bug and bad belly feeling to zero after time.

Share this post


Link to post
Share on other sites

Do the first 3 in one thread, then fork 4 threads for the 4 alternatives, compute min() afterwards.

You can also fork 4 threads immediately, and do the first 2 levels a bit more often than once.

 

First iteration has 1 nodes, second 2, forth 8 and so forth - so i process the first levels with a single thread while the others are wasted with waiting.

Bottom tree level has 60000 nodes so multi threading is a big win at the end (8ms).

Huh?  1st level is 2**0, then 11th level is 2**10 == 1024.

Share this post


Link to post
Share on other sites

It's not a binary tree i'm using, i just used it as a worst case example.

 

In case someone is interested, here's the correct code i'm using now. After two days using it i'm sure this time :)

It serves well enough as a minimal job system and i was able to speed up my application by the number for cores, also stuff where i would have expected bandwidth limits.

 

Usage example is:

int num_threads = std::thread::hardware_concurrency();
num_threads = min (64, max (4, num_threads));
std::thread threads[64];
 
ThreadsForLevelLists workList;
workList.AddLevel (0, 4); // first execute nodes 0-3
workList.AddLevel (4, 6); // then nodes 4-6, ensuring the previous level is done
//...
 
workList.SetJobFunction (ComputeBoundingBox, this);
for (int tID=1; tID<num_threads; tID++) threads[tID] = std::thread (ThreadsForLevelLists::sProcessDependent, &workList);
workList_LevelCounters_TopDown.ProcessDependent();
for (int tID=1; tID<num_threads; tID++) threads[tID].join();

 

Instead of dividing each levels work by the number of cores, it uses work stealing of small batches.

Advantage is that this compensates different runtimes of the job function.

 

 

 

struct ThreadsForLevelLists
{
    // Calls Threads in order:
    // for (int level=0; level<numLevels; level++)
    // {
    //     for (int i=0; i<numLevels; o++) jobFunction (levelStartIndex + i);
    //     barrier in case of ProcessDependent() to ensure previous level has been completed
    // }

    enum
    {
        MAX_LEVELS = 32,
    };

    int firstIteration[MAX_LEVELS];
    unsigned int firstIndex[MAX_LEVELS+1];

    int numLevels;
    void (*jobFunction)(const int, void*);
    void* data;

    std::atomic<int> workIndex;
    std::atomic<int> workDone;
    int iterations;

    ThreadsForLevelLists ()
    {
        numLevels = 0;
        firstIndex[0] = 0;
        workIndex = 0;
        workDone = 0;
    }

    void Reset ()
    {
        workIndex = 0;
        workDone = 0;
    }

    void SetJobFunction (void (*jobFunction)(const int, void*), void *data, int iterations = 64)
    {
        Reset ();
        this->jobFunction = jobFunction;
        this->data = data;
        this->iterations = iterations;
    }

    void AddLevel (const int levelStartIndex, const unsigned int size)
    {
assert (numLevels < MAX_LEVELS);
        firstIteration[numLevels] = levelStartIndex;
        firstIndex[numLevels+1] = firstIndex[numLevels] + size;
        numLevels++;
    }

    void ProcessDependent ()
    {
        const unsigned int wEnd = firstIndex[numLevels];
        int level = 0;
        int levelReady = 0;

        for (;;)
        {
            int wI = workIndex.fetch_add (iterations);        
            if (wI >= wEnd) break;

            int wTarget = min (wI + iterations, wEnd);
            while (wI != wTarget)
            {
                while (wI >= firstIndex[level+1]) level++;

                int wMax = min (wTarget, firstIndex[level+1]);            
                int numProcessed = wMax - wI;

                for (;;)
                {
                    int dI = workDone.load();        
                    while (dI >= firstIndex[levelReady+1]) levelReady++;
                    if (levelReady >= level) break;
                    std::this_thread::yield();

                    // todo: optionally store a pointer to another ThreadsForLevelLists and process it instead of yielding
                }

                int indexOffset = firstIteration[level] - firstIndex[level];
                for (; wI < wMax; wI++)
                    jobFunction (indexOffset + wI, data);

                workDone.fetch_add (numProcessed);
            }
        }
    }

    void ProcessIndependent ()
    {
        const unsigned int wEnd = firstIndex[numLevels];
        int level = 0;

        for (;;)
        {
            int wI = workIndex.fetch_add (iterations);        
            if (wI >= wEnd) break;

            int wTarget = min (wI + iterations, wEnd);
            while (wI != wTarget)
            {
                while (wI >= firstIndex[level+1]) level++;

                int wMax = min (wTarget, firstIndex[level+1]);            
                
                int indexOffset = firstIteration[level] - firstIndex[level];
                for (; wI < wMax; wI++)
                    jobFunction (indexOffset + wI, data);
            }
        }
    }

    static void sProcessDependent (ThreadsForLevelLists *ptr) // todo: move Process() to cpp file to avoid the need for a static function
    {
        ptr->ProcessDependent();
    }
    static void sProcessIndependent (ThreadsForLevelLists *ptr) // todo: move Process() to cpp file to avoid the need for a static function
    {
        ptr->ProcessIndependent();
    }
};

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!