Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 30 Aug 2006
Offline Last Active Yesterday, 11:03 PM

#5301228 External Level Editor?

Posted by JoeJ on 18 July 2016 - 10:58 AM



Looks good. (Haven't tried myself yet)

#5297248 Is there any reason to prefer procedural programming over OOP

Posted by JoeJ on 19 June 2016 - 02:33 PM

I'm constantly moving away from OOP over the years. The idea of inheritance never made much sense to me - it just complicates things and forces you to make decissions about software design. To me that's just blah blah and i prefer to spend this time on solving real problems.


So i ended up using C with classes style, but i moved away from that too, mainly because of this:


Class member functions hide some of the data they use because you don't know what member variables they access without looking at the implementation.

This way it's hard to see data complexity, which is important to optimizing / refactoring.

Often i ended up making member functions static, forcing me to add all data to the function parameters - just to see how many there are (ALWAYS more than you would expect).


Next i realized that static member functions can be used from anywhere, how practical.

So why did i still using classes?

My answer was simply: To group related functions together by 'topic', so i cand find them somehow.


But there is something better to do this: Namespaces.

With namespaces it's possible to group stuff in hirarchies without any restrictions or problems known from inheritance.


Today i create a classes very rarely, using them only as an interface to a large system which is implemented mostly procedural under the hood.

But i still use a lot of small structs with member functions for trivial functionality like indexing arrays or un/packing.

#5297100 Is C++11's library a magic bullet for ALL multicore/multithread programming?

Posted by JoeJ on 17 June 2016 - 11:37 PM

Let's say we have 3 tasks to do:

Software occlusion culling (front to back dependency -> serial algorithm -> not good to parallelize)

Animating 100 characters

Physics simulation (100 islands of rigid bodies in contact)


The easy route would be to use one thread per task - maybe suboptiomal, but good enough if your speedup is about the number of cores.

The hard and error prone route would be trying to parallelize the occlusion culling - ending up with a small speedup for a lot of work and debugging time.


The practical route would be: One thread for occlision culling, the others are free to parallelize a job system processing all characters and after that all physics islands.

If a single character would be very fast, we would coose to process e.g. 4 characters per job to hide the synchronization costs.


std::async and other high level functionality can be used to achieve this, my approach using atomics is more the low level kind.

Looking at http://en.cppreference.com/w/cpp/thread/async i tend to think: There is no control on the creation of threads (which is expensive), there is no guarantee multi threading is used at all. So i'll probably never use it, but probably it's just a matter of personal preference.

#5297002 Is C++11's library a magic bullet for ALL multicore/multithread programming?

Posted by JoeJ on 17 June 2016 - 01:18 PM

I've had a similar thread a while back: http://www.gamedev.net/topic/679252-multithreading-c11-vs-openmp/


The little Job system i made there may have some advantages over your current approach:

You divide Stuff in four 'groups' - but what if one group finishes early and another takes very long? The Job system balances itself here (if the jobs are small enough).

Also, if those groups do very different things, they need to access differnt memory. With the Job system it's easy to parallelize a single workload to utilize cache better.

#5296100 Multithreading - C++11 vs. OpenMP

Posted by JoeJ on 11 June 2016 - 11:43 AM

It's not a binary tree i'm using, i just used it as a worst case example.


In case someone is interested, here's the correct code i'm using now. After two days using it i'm sure this time :)

It serves well enough as a minimal job system and i was able to speed up my application by the number for cores, also stuff where i would have expected bandwidth limits.


Usage example is:

int num_threads = std::thread::hardware_concurrency();
num_threads = min (64, max (4, num_threads));
std::thread threads[64];
ThreadsForLevelLists workList;
workList.AddLevel (0, 4); // first execute nodes 0-3
workList.AddLevel (4, 6); // then nodes 4-6, ensuring the previous level is done
workList.SetJobFunction (ComputeBoundingBox, this);
for (int tID=1; tID<num_threads; tID++) threads[tID] = std::thread (ThreadsForLevelLists::sProcessDependent, &workList);
for (int tID=1; tID<num_threads; tID++) threads[tID].join();


Instead of dividing each levels work by the number of cores, it uses work stealing of small batches.

Advantage is that this compensates different runtimes of the job function.




struct ThreadsForLevelLists
    // Calls Threads in order:
    // for (int level=0; level<numLevels; level++)
    // {
    //     for (int i=0; i<numLevels; o++) jobFunction (levelStartIndex + i);
    //     barrier in case of ProcessDependent() to ensure previous level has been completed
    // }

        MAX_LEVELS = 32,

    int firstIteration[MAX_LEVELS];
    unsigned int firstIndex[MAX_LEVELS+1];

    int numLevels;
    void (*jobFunction)(const int, void*);
    void* data;

    std::atomic<int> workIndex;
    std::atomic<int> workDone;
    int iterations;

    ThreadsForLevelLists ()
        numLevels = 0;
        firstIndex[0] = 0;
        workIndex = 0;
        workDone = 0;

    void Reset ()
        workIndex = 0;
        workDone = 0;

    void SetJobFunction (void (*jobFunction)(const int, void*), void *data, int iterations = 64)
        Reset ();
        this->jobFunction = jobFunction;
        this->data = data;
        this->iterations = iterations;

    void AddLevel (const int levelStartIndex, const unsigned int size)
assert (numLevels < MAX_LEVELS);
        firstIteration[numLevels] = levelStartIndex;
        firstIndex[numLevels+1] = firstIndex[numLevels] + size;

    void ProcessDependent ()
        const unsigned int wEnd = firstIndex[numLevels];
        int level = 0;
        int levelReady = 0;

        for (;;)
            int wI = workIndex.fetch_add (iterations);        
            if (wI >= wEnd) break;

            int wTarget = min (wI + iterations, wEnd);
            while (wI != wTarget)
                while (wI >= firstIndex[level+1]) level++;

                int wMax = min (wTarget, firstIndex[level+1]);            
                int numProcessed = wMax - wI;

                for (;;)
                    int dI = workDone.load();        
                    while (dI >= firstIndex[levelReady+1]) levelReady++;
                    if (levelReady >= level) break;

                    // todo: optionally store a pointer to another ThreadsForLevelLists and process it instead of yielding

                int indexOffset = firstIteration[level] - firstIndex[level];
                for (; wI < wMax; wI++)
                    jobFunction (indexOffset + wI, data);

                workDone.fetch_add (numProcessed);

    void ProcessIndependent ()
        const unsigned int wEnd = firstIndex[numLevels];
        int level = 0;

        for (;;)
            int wI = workIndex.fetch_add (iterations);        
            if (wI >= wEnd) break;

            int wTarget = min (wI + iterations, wEnd);
            while (wI != wTarget)
                while (wI >= firstIndex[level+1]) level++;

                int wMax = min (wTarget, firstIndex[level+1]);            
                int indexOffset = firstIteration[level] - firstIndex[level];
                for (; wI < wMax; wI++)
                    jobFunction (indexOffset + wI, data);

    static void sProcessDependent (ThreadsForLevelLists *ptr) // todo: move Process() to cpp file to avoid the need for a static function
    static void sProcessIndependent (ThreadsForLevelLists *ptr) // todo: move Process() to cpp file to avoid the need for a static function

#5295752 Multithreading - C++11 vs. OpenMP

Posted by JoeJ on 09 June 2016 - 04:09 AM


I'll stick at C++11 - enough functionality to learn, maybe i'll look into OS (much) later :)


I found the reason for the performance differences.

In cases where my C++11 approach was slow, that's because threats have been created and joined for each iteration of an outer loop.

Even the outer loop has only 11 iterations and the single threaded runtime is 27ms, this caused the slow down.


I've fixed this by removing the outer loop, but it's still necessary to finish all work in order.

Simple problem, but this already requires ugly and hard to read code like below - maybe i can improve it.


However - with this code i can't measure a performance difference between C++11 and OpenMP anymore :)

void ProcessMT (std::atomic<int> *workIndex, std::atomic<int> *workDone, const int iterations)
        const int endIndex = levelLists.size;
        for (;;)
            int workStart = workIndex->fetch_add(iterations);
            int workEnd = min (workStart + iterations, endIndex);

            int level = levelLists.GetLevel(workStart);
            int levelDownAtIndex = (level <= 0 ? INT_MAX : levelLists.GetFirstIndex (level-1) );
            while (levelLists.GetLevel(workDone->load()) > level+1) std::this_thread::yield(); // don't skip over a whole level

            int i=workStart;
            while (i<workEnd)
                int safeEnd = min(levelDownAtIndex, workEnd);
                int progress = safeEnd - i;

                for (; i<safeEnd; i++)
                    DoTheWork (i);

                if (i==levelDownAtIndex)
                    while (workDone->load() < workStart) std::this_thread::yield(); // wait until current level has been processed by all other threads
                    levelDownAtIndex = (level <= 0 ? INT_MAX : levelLists.GetFirstIndex (level-1) );

            if (workEnd == endIndex) break;

#5295677 Multithreading - C++11 vs. OpenMP

Posted by JoeJ on 08 June 2016 - 02:31 PM

I try to find some 'best performance practices' about multi threading before looking at implementing more advanced job systems.

But i don't understand the results - either OpenMP or C++11 is a lot faster.

It's always possible in my tests to get 4x speedup at my i7, so it's surely no hardware limit.


Example one, C++11 speedup is slightly above 4, OpenMP speedup only 2:



void MyClass::ProcessMT (const int tID, std::atomic<int> *work, const int iterations)
        for (;;) // each virtual core steals a batch of given work iterations and processes them until nothing is left.
            const int workStart = work->fetch_add(iterations);
            const int workEnd = min (workStart + iterations, myTotalWorkAmount);

            for (int i=workStart; i<workEnd; i++)
                    DoTheWork (i);

            if (workEnd == size) break;
void MyClass::Update()
unsigned int num_threads = std::thread::hardware_concurrency(); // returning 8 on i7
num_threads = min (64, max (4, num_threads));
std::thread threads[64];
std::atomic<int> work = 0;
for (int tID=1; tID<num_threads; tID++) threads[tID] = std::thread (&MyClass::ProcessMT, this, tID, &work, 64);
ProcessMT (0, &work, 64);
for (int tID=1; tID<num_threads; tID++) threads[tID].join();


OpenMP for the same thing is simply:

#pragma omp parallel for
            for (int i=0; i<myTotalWorkAmount; i++)
               DoTheWork (i);


I have used this method on various different tasks.

There are also cases where OpenMP is 4 times and my C++11 method is only 2 times faster - exactly the opposite behaviour.


I tried to tune the iteration count and ensured not to create any threads that would have no work - only minor improvements.

The tasks itself are nothing special - no dynamic allocations or something.


My guess is, the only solid method is to create presistent worker threads and keeping them alive

instead of creating new threads for each new kind of task.

But this alone can not explain the behaviour i see.


Maybe C++11 (VC2013, Win10) is not that ready yet, similar like OpenMP is not really useable for games?

I'd really like to use C++11 instead of looking at libraries or OS, but as long as i'm not on par with OpenMP everywhere? <_<


Maybe you can share some experience, have some thoughts or suggest a better method...




#5294227 GCN Shader Extensions

Posted by JoeJ on 30 May 2016 - 02:31 PM

I think OpenCL will stay - it has most of its applications outside games / graphics. Or are there any games using it?

Personally i liked OpenCL a lot more than GLSL for compute. Easier to use, more solid programming - less 'just a shader', and it was always faster on any hardware i've tested.


SPIR-V already draws a line between OpenCL and Vulkan-Compute:

"Execution models include Vertex, GLCompute, etc. (one for each graphical stage), as well as Kernel for OpenCL kernels."

I don't know if there are technical or business reasons behind that.

Because there are no plans for OpenCL <-> Vulkan data sharing, we have no choice anyways.


But those extensions are exactly what i want and there should be no big reason to look back to OpenCL.

#5294089 GCN Shader Extensions

Posted by JoeJ on 29 May 2016 - 06:12 PM



Just in case someone else missed this.

No need to wait for SM 6.0 :D


#5293817 Atomic Add for Float Type

Posted by JoeJ on 27 May 2016 - 10:13 AM

Often it's best to avoid atomics, e.g. to sum up all numbers from an array, a prefix scan method should be much faster.

OpenCL sample:


accSum[lID] = lacc; // lID = thread index, each thread puts its value to the array, but we want to compute the sum of all values

    barrier (CLK_LOCAL_MEM_FENCE);

    uint step = 0;
    while (true)
        uint add = (1<<step);
        uint index = lID + add; // neighbour array index
        if ((lID & ((~1)<<step)) == lID) accSum[lID] += accSum[index]; // add neighbour value to own value
        barrier (CLK_LOCAL_MEM_FENCE);
        if (add == (wgS>>1)) break; // wgS = total number of threads btw. array entries
// accSum[0] contains sum now


how this works illustrated is:


2 2 2 2
4   4


So for an array of 256 entries you need only 8 loop iterations without any atomic conflicts.


To make this fast you need compute, where all threads have acces to the array stored in LDS memory.


If that's new to you look it up! ;)

#5293568 Are Third Party Game Engines the Future

Posted by JoeJ on 26 May 2016 - 06:56 AM

"As technology improves and third party tools improve, do you think that the bigger AAA game studios that have internal engines will eventually switch to using third party engines or will the industry continue as is for the foreseeable future?"


What a nightmare - the end of game "developement", and the rise of the game "maker" area.


Fortunately this will not happen so soon. If you look closely, UE4, Cryengine and Unity are mainly indie engines today.

(Nothing against that - it's very welcome)


For me it would be MUCH more work to tweak those engines to my needs than to write a new one from scratch.

Same for AAA companies - few people are enough to do this, a fraction of what is necessary for content creation.

It makes more sense to pay for something secialized like Umbra, Natural Motion, Simplygon etc.


The only downside of in house engine is missing public reward. Lots of people out there fell in love with UE4 demos or still think Crysis is best graphics ever.

They run around screaming 'downgrade!' and 'upscaled!' knowing nothing about the work they criticize or the limitations of their favorized engines.

At least that's my impression after reading sites related to pc gaming... there's something going wrong here.

#5293499 Cracks between patches with same the LOD level.

Posted by JoeJ on 25 May 2016 - 11:28 PM

An indexing bug? Instead of using the same hight twice for both borders, you use hight(n) for one and height(n+1) for the other?

#5293496 what good are cores?

Posted by JoeJ on 25 May 2016 - 11:15 PM

Reminds me to the learning process i've gone through with GPU compute and LDS.

Since that i really wish i would have something like control over the CPU cache.


But i also noticed most people would not want this. They don't want to code close to metal,

they just want the metal (or the compiler) to be clever enough to run their code efficiently.


Probably they are afraid of additional work.

Personally i think the more control you have the less work is necessary - less trial and error, guessing, hoping and profiling.


On the other hand, on GPU the LDS size limit became a big influence on what algorithms i choose.

E.g. if it would grow twice as large for a new generation of GPUs, i'd need to change huge amounts of code in drastic ways to get best performance again.


So - on the long run - maybe those other people are right? Man should rule the machine - not the other way around?

#5293239 what good are cores?

Posted by JoeJ on 24 May 2016 - 11:37 AM


Honestly, the fact you think this means you are a good 15 years behind the curve right now - threads have been a pretty big deal for some time and far from 'bling bling'.

Threads and cores are two different things imho; having hundreds of threads doesn't imply you need many cores.


I believe it's actually pretty hard to usefully use more than 1-2 cores full time.



I have observed this situation quite often in a preprocessing tool using OpenMP:


Do a very simple thing like building a mip-map level in prallel: Speed up is 1.5... very disapointing

Do a complex thing like ray tracing: Speed up is 4... yep - that's the number of cores


My conclusion is that memory bandwidth limits hurt the mip-map generation.

I assume it would be faster to do mips and tracing at the same time, so memory limit is hidden behind tracing calculations.


Are there any known approaches where a job system tries to do this automatically with some info like job.m_bandWidthCost?

I've never heared of something like that.

#5292783 Hybrid Frustum Traced Shadows

Posted by JoeJ on 21 May 2016 - 02:36 PM

Instead of rastering potential light blocking triangles and doing the shadow test based on the rasterization result,

they do the shadow test directly on triangles itself by testing each texel against the planes build from its edges and light source position.

To ensure no edge-intersecting texel is missed, conservative rasterization is necessary.


Pro is robustness leading to pixel perfect shadows (reminds me to shadow volumes).


Con is trashing the eintire idea behind the efficiency of shadow maps.

There should be better ways to burn the PC vs. console advantages... just my opinion :)