Jump to content

  • Log In with Google      Sign In   
  • Create Account

JoeJ

Member Since 30 Aug 2006
Online Last Active Today, 01:33 PM

#5308602 Seriously weird bug in opencl

Posted by on Today, 12:49 AM

I would tell you that learning OpenCL is currently a waste of time when it comes to game programming.

 

I think this is a very bad advise.

 

OpenCL is the easiest way to do and learn GPU compute - you need one line of code to upload to GPU memory, with Vulkan you need 50.

Same ratio for the simple thing of executing a kernel.

 

OpenGL compute shaders are also a lot more cumbersome to use than OpenCL,

and at least in the past OpenCL was twice(!) as fast on Nvidia, and a bit faster on AMD.

You would not make such a statement if you would have taken the time to test this yourself.

OpenCL is not popular in game dev, but that's simply our own fault.

 

AFAIK SpirV is core with OpenCL 2.1, NV is at 1.2, AMD is at 2.0.

Only Vulkan really uses it now, but no one writes SpirV directly - it's just intermediate byte code used by compilers, so not relevant for learning.

Feature wise OpenCL 1.2 language and GLSL are very similar - anything you learn can be adapted to the other easily.




#5308096 Bullet - position a collision capsule for an animated character

Posted by on 26 August 2016 - 02:59 PM

I'll try to giv an example. Say AI wants to have the initially still standing character walk a straight line from a to b with constant velocity 5.

We then calculate the force necessery to change capsule velocity from zero to 5.

 

But we don't take ground friction into account, so we reach only a velocity of 3,

and because of that the capsule position is between the predicted animation target and its initial position.

 

Let's call this the position error and assume this error is smaller than our threshold,

so we render the character at predicted position and use the animation to set our next simulation target.

 

After some simulation steps the error becomes smaller and smaller and everything is fine.

So this is the compensation part of the example.

 

Then something bad happens: There is a heavy big crate in the way and AI was not aware of this - character tries to walk through which is impossible.

The capsule gets stuck, and the difference beween capsule and animation becomes larger and larger each step, and when it's larger than the threshold (10cm or something),

it's time to change the animation (keep a max distance of 10cm from rendered character to capsule), notify AI of the obstacle to stop walking and whatever.

 

 

This should work for the moving platform example as you expect (although the trick might filter out some of the cool sliding you expect, similar to how a low pass filter removes details).

Clamping force should be done always, also to prevent the capsule from moving too heavy crates and to keep simulation stable.

 

I think you can get some robust mechanics from that, but will not help to get life like animations - combining animation and simulation always makes this harder :)




#5307979 Bullet - position a collision capsule for an animated character

Posted by on 26 August 2016 - 12:39 AM

I think you should not precompute for the whole animations - stuff will drift appart not only due to friction, but also because of integration errors.

Instead i'd make the body follow the animation and allow a error threshold.

As long as the body / animation difference is smaller than the threshold, keep animation and compensate the error for physics in the next timestep,

otherwise change the animation so the difference is not larger than threshold (but keep trying to compensate physics error).

 

Setting force instead velocity as you say should work better (engine can fix bad input like trying to push something inside a wall),

It depends on physics engine and maybe the selected solver, but you should see improvements using forces, like less jitter on the wall example.

 

Here's some code to get the force from a target velocity.

currentVel is the actual body velocity.

Finally you should clamp resulting force to a maximum magnitude to prevent supermen and physics blow ups.

If you set this maximum large enough, the capsule should follow also with ground friction and it will push light weighted obstacles out of the way.

E.g. for the friction case it will automatically calculate a larger force because actual measured velocity will be initially low.

There will be some lag, oszillations or even jitter, but the error threshold talked above should be able to hide those things.

 

 

inline sVec3 ConvertLinVelToForce ( sVec3 &targetVel, sVec3 &currentVel, float timestep, float mass)
{
    sVec3 force ((targetVel - currentVel) * (mass / timestep));
    return force;
}



#5307819 Bullet - position a collision capsule for an animated character

Posted by on 25 August 2016 - 05:19 AM

Havok has animated ragdolls built in (at least it was so 10 years back).

You could take a look how their stuff works (how it reacts to obstacles etc).

 

Personally i do it the hard way using Newton physics engine: Full simulation of the whole skeleton, balancing controller etc.

Newton allows to create powered joints that work stable enough for a walking character (but i need at least 90 Hz).

There are plans for a built in simulated character feature in the (near?) future.

 

Because you are only interested in animation, it's a lot easier and that should work with Bullet too (although in my experience it's the worst engine i know when it comes to stability).

The main problem is the question how simulation should affect things, how to handle feedback from physics by procedural animation.

Your question about friction may be only the first of a infinite number...

 

 

EDIT:

Oh sorry - you talk about a SINGLE capsule, not one capsule per bone as i thought :)

 

Have you tried not to move the capsule by character, but the other way around?

I made a simple capsule character controller by attaching a upvector constraint to keep it upright despite friction and applying forces to move it at target speed.

You can tweak this easily to your needs and then use the capsule velocity to calculate matching animation speed.




#5305981 render huge amount of objects

Posted by on 15 August 2016 - 09:14 AM

Storing trees in linear arrays is always good, most if the tree structure remains static (e.g. a character).

That does not mean you have to process the whole tree even if there ar only a few changes.

The advantage is cache friendly linear memory access. You get this also for partial updates, if you use a nice memory order (typically sorted by tree level as hodgman said, and all children of any node in gapless order).

 

However, 100 is a small number and i can't imagine tree traversal or transformation can cause such low fps.

Do you upload each transform individually to GPU? It seems you map / unmap buffers for each object - that's slow and is probably the reason.

Use a single buffer containing all transforms instead so you have only one upload per frame.

Also make sure all render data (transforms and vertices) are on gpu memory and not on host memory.




#5305794 Blade Runner-ish city mood not working, could use some direction

Posted by on 14 August 2016 - 02:22 PM

In addition i miss fog and city density.

Your city looks sparse and still too bright, Bladerunner is dark, dense neverending city - no horizion, just more buildings everywhere.




#5305651 need a algorithm to update skin mesh global AABB

Posted by on 13 August 2016 - 01:06 PM

A more accurate but still fast method is to have a bounding capsule for each skeleton bone, and per frame calculate bounding box from all capsules.

Fast if the CPU already knows bone transforms, nearly as good as doing per vertex and worth for near / expensive characters.




#5303147 How Do You Deal With Errors On Gpus? Do You At All?

Posted by on 30 July 2016 - 02:24 AM

Anyone with time and a Fiji GPU is welcome to try the test case i've sent to AMD: https://github.com/JoeJGit/OpenCL_Fiji_Bug_Report

Includes project and if you dare - binaries (zipped only).

The bug is reproduceable and happens only with 32 bit version (the log output should show increasing numbers, but i get chaos).




#5301228 External Level Editor?

Posted by on 18 July 2016 - 10:58 AM

https://wiki.blender.org/index.php/Extensions:2.6/Py/Scripts/UV/TextureAtlas

 

Looks good. (Haven't tried myself yet)




#5297248 Is there any reason to prefer procedural programming over OOP

Posted by on 19 June 2016 - 02:33 PM

I'm constantly moving away from OOP over the years. The idea of inheritance never made much sense to me - it just complicates things and forces you to make decissions about software design. To me that's just blah blah and i prefer to spend this time on solving real problems.

 

So i ended up using C with classes style, but i moved away from that too, mainly because of this:

 

Class member functions hide some of the data they use because you don't know what member variables they access without looking at the implementation.

This way it's hard to see data complexity, which is important to optimizing / refactoring.

Often i ended up making member functions static, forcing me to add all data to the function parameters - just to see how many there are (ALWAYS more than you would expect).

 

Next i realized that static member functions can be used from anywhere, how practical.

So why did i still using classes?

My answer was simply: To group related functions together by 'topic', so i cand find them somehow.

 

But there is something better to do this: Namespaces.

With namespaces it's possible to group stuff in hirarchies without any restrictions or problems known from inheritance.

 

Today i create a classes very rarely, using them only as an interface to a large system which is implemented mostly procedural under the hood.

But i still use a lot of small structs with member functions for trivial functionality like indexing arrays or un/packing.




#5297100 Is C++11's library a magic bullet for ALL multicore/multithread programming?

Posted by on 17 June 2016 - 11:37 PM

Let's say we have 3 tasks to do:

Software occlusion culling (front to back dependency -> serial algorithm -> not good to parallelize)

Animating 100 characters

Physics simulation (100 islands of rigid bodies in contact)

 

The easy route would be to use one thread per task - maybe suboptiomal, but good enough if your speedup is about the number of cores.

The hard and error prone route would be trying to parallelize the occlusion culling - ending up with a small speedup for a lot of work and debugging time.

 

The practical route would be: One thread for occlision culling, the others are free to parallelize a job system processing all characters and after that all physics islands.

If a single character would be very fast, we would coose to process e.g. 4 characters per job to hide the synchronization costs.

 

std::async and other high level functionality can be used to achieve this, my approach using atomics is more the low level kind.

Looking at http://en.cppreference.com/w/cpp/thread/async i tend to think: There is no control on the creation of threads (which is expensive), there is no guarantee multi threading is used at all. So i'll probably never use it, but probably it's just a matter of personal preference.




#5297002 Is C++11's library a magic bullet for ALL multicore/multithread programming?

Posted by on 17 June 2016 - 01:18 PM

I've had a similar thread a while back: http://www.gamedev.net/topic/679252-multithreading-c11-vs-openmp/

 

The little Job system i made there may have some advantages over your current approach:

You divide Stuff in four 'groups' - but what if one group finishes early and another takes very long? The Job system balances itself here (if the jobs are small enough).

Also, if those groups do very different things, they need to access differnt memory. With the Job system it's easy to parallelize a single workload to utilize cache better.




#5296100 Multithreading - C++11 vs. OpenMP

Posted by on 11 June 2016 - 11:43 AM

It's not a binary tree i'm using, i just used it as a worst case example.

 

In case someone is interested, here's the correct code i'm using now. After two days using it i'm sure this time :)

It serves well enough as a minimal job system and i was able to speed up my application by the number for cores, also stuff where i would have expected bandwidth limits.

 

Usage example is:

int num_threads = std::thread::hardware_concurrency();
num_threads = min (64, max (4, num_threads));
std::thread threads[64];
 
ThreadsForLevelLists workList;
workList.AddLevel (0, 4); // first execute nodes 0-3
workList.AddLevel (4, 6); // then nodes 4-6, ensuring the previous level is done
//...
 
workList.SetJobFunction (ComputeBoundingBox, this);
for (int tID=1; tID<num_threads; tID++) threads[tID] = std::thread (ThreadsForLevelLists::sProcessDependent, &workList);
workList_LevelCounters_TopDown.ProcessDependent();
for (int tID=1; tID<num_threads; tID++) threads[tID].join();

 

Instead of dividing each levels work by the number of cores, it uses work stealing of small batches.

Advantage is that this compensates different runtimes of the job function.

 

 

 

struct ThreadsForLevelLists
{
    // Calls Threads in order:
    // for (int level=0; level<numLevels; level++)
    // {
    //     for (int i=0; i<numLevels; o++) jobFunction (levelStartIndex + i);
    //     barrier in case of ProcessDependent() to ensure previous level has been completed
    // }

    enum
    {
        MAX_LEVELS = 32,
    };

    int firstIteration[MAX_LEVELS];
    unsigned int firstIndex[MAX_LEVELS+1];

    int numLevels;
    void (*jobFunction)(const int, void*);
    void* data;

    std::atomic<int> workIndex;
    std::atomic<int> workDone;
    int iterations;

    ThreadsForLevelLists ()
    {
        numLevels = 0;
        firstIndex[0] = 0;
        workIndex = 0;
        workDone = 0;
    }

    void Reset ()
    {
        workIndex = 0;
        workDone = 0;
    }

    void SetJobFunction (void (*jobFunction)(const int, void*), void *data, int iterations = 64)
    {
        Reset ();
        this->jobFunction = jobFunction;
        this->data = data;
        this->iterations = iterations;
    }

    void AddLevel (const int levelStartIndex, const unsigned int size)
    {
assert (numLevels < MAX_LEVELS);
        firstIteration[numLevels] = levelStartIndex;
        firstIndex[numLevels+1] = firstIndex[numLevels] + size;
        numLevels++;
    }

    void ProcessDependent ()
    {
        const unsigned int wEnd = firstIndex[numLevels];
        int level = 0;
        int levelReady = 0;

        for (;;)
        {
            int wI = workIndex.fetch_add (iterations);        
            if (wI >= wEnd) break;

            int wTarget = min (wI + iterations, wEnd);
            while (wI != wTarget)
            {
                while (wI >= firstIndex[level+1]) level++;

                int wMax = min (wTarget, firstIndex[level+1]);            
                int numProcessed = wMax - wI;

                for (;;)
                {
                    int dI = workDone.load();        
                    while (dI >= firstIndex[levelReady+1]) levelReady++;
                    if (levelReady >= level) break;
                    std::this_thread::yield();

                    // todo: optionally store a pointer to another ThreadsForLevelLists and process it instead of yielding
                }

                int indexOffset = firstIteration[level] - firstIndex[level];
                for (; wI < wMax; wI++)
                    jobFunction (indexOffset + wI, data);

                workDone.fetch_add (numProcessed);
            }
        }
    }

    void ProcessIndependent ()
    {
        const unsigned int wEnd = firstIndex[numLevels];
        int level = 0;

        for (;;)
        {
            int wI = workIndex.fetch_add (iterations);        
            if (wI >= wEnd) break;

            int wTarget = min (wI + iterations, wEnd);
            while (wI != wTarget)
            {
                while (wI >= firstIndex[level+1]) level++;

                int wMax = min (wTarget, firstIndex[level+1]);            
                
                int indexOffset = firstIteration[level] - firstIndex[level];
                for (; wI < wMax; wI++)
                    jobFunction (indexOffset + wI, data);
            }
        }
    }

    static void sProcessDependent (ThreadsForLevelLists *ptr) // todo: move Process() to cpp file to avoid the need for a static function
    {
        ptr->ProcessDependent();
    }
    static void sProcessIndependent (ThreadsForLevelLists *ptr) // todo: move Process() to cpp file to avoid the need for a static function
    {
        ptr->ProcessIndependent();
    }
};



#5295752 Multithreading - C++11 vs. OpenMP

Posted by on 09 June 2016 - 04:09 AM

Thanks!

I'll stick at C++11 - enough functionality to learn, maybe i'll look into OS (much) later :)

 

I found the reason for the performance differences.

In cases where my C++11 approach was slow, that's because threats have been created and joined for each iteration of an outer loop.

Even the outer loop has only 11 iterations and the single threaded runtime is 27ms, this caused the slow down.

 

I've fixed this by removing the outer loop, but it's still necessary to finish all work in order.

Simple problem, but this already requires ugly and hard to read code like below - maybe i can improve it.

 

However - with this code i can't measure a performance difference between C++11 and OpenMP anymore :)

void ProcessMT (std::atomic<int> *workIndex, std::atomic<int> *workDone, const int iterations)
    {
        const int endIndex = levelLists.size;
        for (;;)
        {
            int workStart = workIndex->fetch_add(iterations);
            int workEnd = min (workStart + iterations, endIndex);

            int level = levelLists.GetLevel(workStart);
            int levelDownAtIndex = (level <= 0 ? INT_MAX : levelLists.GetFirstIndex (level-1) );
 
            while (levelLists.GetLevel(workDone->load()) > level+1) std::this_thread::yield(); // don't skip over a whole level

            int i=workStart;
            while (i<workEnd)
            {
                int safeEnd = min(levelDownAtIndex, workEnd);
                int progress = safeEnd - i;

                for (; i<safeEnd; i++)
                    DoTheWork (i);
 
                workDone->fetch_add(progress);

                if (i==levelDownAtIndex)
                {
                    while (workDone->load() < workStart) std::this_thread::yield(); // wait until current level has been processed by all other threads
                        
                    level--;
                    levelDownAtIndex = (level <= 0 ? INT_MAX : levelLists.GetFirstIndex (level-1) );
                }
            }

            if (workEnd == endIndex) break;
        }
    }
 



#5295677 Multithreading - C++11 vs. OpenMP

Posted by on 08 June 2016 - 02:31 PM

I try to find some 'best performance practices' about multi threading before looking at implementing more advanced job systems.

But i don't understand the results - either OpenMP or C++11 is a lot faster.

It's always possible in my tests to get 4x speedup at my i7, so it's surely no hardware limit.

 

Example one, C++11 speedup is slightly above 4, OpenMP speedup only 2:

 

C++11:

void MyClass::ProcessMT (const int tID, std::atomic<int> *work, const int iterations)
    {
        for (;;) // each virtual core steals a batch of given work iterations and processes them until nothing is left.
        {
            const int workStart = work->fetch_add(iterations);
            const int workEnd = min (workStart + iterations, myTotalWorkAmount);

            for (int i=workStart; i<workEnd; i++)
                    DoTheWork (i);

            if (workEnd == size) break;
        }
    }
 
void MyClass::Update()
{
unsigned int num_threads = std::thread::hardware_concurrency(); // returning 8 on i7
num_threads = min (64, max (4, num_threads));
std::thread threads[64];
 
std::atomic<int> work = 0;
for (int tID=1; tID<num_threads; tID++) threads[tID] = std::thread (&MyClass::ProcessMT, this, tID, &work, 64);
ProcessMT (0, &work, 64);
for (int tID=1; tID<num_threads; tID++) threads[tID].join();
}

 

OpenMP for the same thing is simply:

#pragma omp parallel for
            for (int i=0; i<myTotalWorkAmount; i++)
               DoTheWork (i);
 

 

I have used this method on various different tasks.

There are also cases where OpenMP is 4 times and my C++11 method is only 2 times faster - exactly the opposite behaviour.

 

I tried to tune the iteration count and ensured not to create any threads that would have no work - only minor improvements.

The tasks itself are nothing special - no dynamic allocations or something.

 

My guess is, the only solid method is to create presistent worker threads and keeping them alive

instead of creating new threads for each new kind of task.

But this alone can not explain the behaviour i see.

 

Maybe C++11 (VC2013, Win10) is not that ready yet, similar like OpenMP is not really useable for games?

I'd really like to use C++11 instead of looking at libraries or OS, but as long as i'm not on par with OpenMP everywhere? <_<

 

Maybe you can share some experience, have some thoughts or suggest a better method...

 

 

 






PARTNERS