[Flocking] How to improve performance?

Started by
17 comments, last by noisecrime 9 years, 11 months ago

Also as far as I'm aware dynamic memory allocation in C# is fast (I read it's just a pointer increment, not sure how it works).

Dynamic Memory Allocation is fast in .Net 4+ and in newer versions of Mono. Unfortunately Unity still uses Mono 2.6 and that has a pretty bad garbage collector, so you have to be careful of memory allocation and especially destroying if you need good performance. I recommend you read the following article series: http://www.gamasutra.com/blogs/WendelinReich/20131109/203841/C_Memory_Management_for_Unity_Developers_part_1_of_3.php

But I still think your best solution would be to slice up your algorithm using Coroutines. Heavy calculations are best spread over multiple frames, if you don't need precise real time adjustments. You don't need to calculate the positions of every agent 30+ times per second.

If you don't know about Unitys Coroutines, I gladly provide a few articles about it.

Advertisement

When using the following steering behaviors I had the results below - using 500 boids.
Time to build K-D tree: 2.34ms. So this is not the problem.

Time to update boid steerings (all 500 boids, average time of N frames)
Behaviors used: Cohesion, Separation, VelocityMatch, ObstacleAvoidance => 46.77ms
Behaviors used: Cohesion, Separation, VelocityMatch, ObstacleAvoidance => 39.55ms
Behaviors used: Cohesion, Separation, VelocityMatch, ObstacleAvoidance => 24.44ms

2.36ms is 14% of a 60Hz frame -- quite a large percentage of time! If every task in your game took this long, you could only have 7 different tasks making up your entire game logic (and still run at 60fps).

A good way to get some perspective when profiling these numbers, is to pick a framerate that you'd ideally like to have your game running at (e.g. 60fps or 30fps) and then convert the frame-times (ms) into percentages (e.g. a 60Hz frame gives you a budget of 16.667ms per frame):


Task                              Time      60Hz%  30Hz%
K-D Tree                       => 2.34ms  / 14%  / 7%
VelocityMatch                  => 7.22ms  / 43%  / 22%
Cohesion                       => 22.33ms / 134% / 67%
Separation + ObstacleAvoidance => 17.22ms / 103% / 52%
                                  49.11ms / 294% / 148% < totals

I wouldn't be surprised if it weren't possible to reimplement your entire boids system to process those 500 objects in under 2ms total wink.png

When using the following steering behaviors I had the results below - using 500 boids.
Time to build K-D tree: 2.34ms. So this is not the problem.

Time to update boid steerings (all 500 boids, average time of N frames)
Behaviors used: Cohesion, Separation, VelocityMatch, ObstacleAvoidance => 46.77ms
Behaviors used: Cohesion, Separation, VelocityMatch, ObstacleAvoidance => 39.55ms
Behaviors used: Cohesion, Separation, VelocityMatch, ObstacleAvoidance => 24.44ms

2.36ms is 14% of a 60Hz frame -- quite a large percentage of time! If every task in your game took this long, you could only have 7 different tasks making up your entire game logic (and still run at 60fps).

A good way to get some perspective when profiling these numbers, is to pick a framerate that you'd ideally like to have your game running at (e.g. 60fps or 30fps) and then convert the frame-times (ms) into percentages (e.g. a 60Hz frame gives you a budget of 16.667ms per frame):


Task                              Time      60Hz%  30Hz%
K-D Tree                       => 2.34ms  / 14%  / 7%
VelocityMatch                  => 7.22ms  / 43%  / 22%
Cohesion                       => 22.33ms / 134% / 67%
Separation + ObstacleAvoidance => 17.22ms / 103% / 52%
                                  49.11ms / 294% / 148% < totals

I wouldn't be surprised if it weren't possible to reimplement your entire boids system to process those 500 objects in under 2ms total wink.png

Hodge is quite correct, that's a pretty horrible update per frame to be hitting. I also don't see a performance time for:


int neighbors = (maxNeighbors == 0 ? 0 : alglib.kdtreequeryaknn(kdt, point, maxNeighbors, aknnApproxVal));
alglib.kdtreequeryresultsxy(kdt, ref results);

Tree searches are expensive, even via kd-tree's. The first step of that search is going down the tree looking for the closest tree node, then it has to search up and out for the neighbors. If nothing else, this is blowing huge chunks of memory bandwidth and most likely showing up as most of your per boid update. Doing this from within every boid update, almost guarantee'd to be O(log n) or worse in cost.

My suggestion for this would be to move the creation of the list of neighbors out to a system designed to linearize the cost of maintaining such information. You can generally google for: "Spatial Awareness" with things like "c++ library" and get some starting points. I wrote a library which I tested with flocking a while back and even did a vid capture:

It's ugly but that's 2000 boids flocking and the most costly thing is the rendering. WIthout rendering, I pushed that system to 10000 boids and it still ran at a steady 60fps while barely touching the cpu. It's designed for AI usage which includes a bit of a different set of requirements but the basic "aware of" list is what flocking requires and as such the overall system was well tested by flocking.

Anyway, you are highly unlikely to be able to optimize a per-object solution such as this beyond a certain point. If you follow all the current suggestions, you might drop 5-10ms off the frame times but you'll still have all the same problems, just pushed out a bit further. Try some Googles for "AI Spatial Awareness" including "c++ library" and other trailers and you might find some ideas. You'll be looking for items which supply "aware of" lists. They can be implemented in many ways though I favor sweep and prune due to it having the fewest edge cases and the most consistent per frame processing time no matter what is going on.

Alright, I moved the array allocations in the Flock class's constructor. And in the rebuildKDTree function I do not create the array every frame.

The time dropped from ~46 ms to ~36 ms. That was pretty big. By the way, I had forgotten that Unity uses Mono and not .NET tongue.png

About co-routines: Seems I do not have access to startCoroutine because the class that handles the updates does not derive from MonoBehavior. However I did it myself: I manually update only n boids every frame (basically what I do is set a flag in each behavior that indicates whether it will use its old value or calculate a new one).

This is my update function:


public override void Update(float dt)
    {
        if (Time.frameCount == currentFrame) return;
        currentFrame = Time.frameCount;

        //------------
        kdBuildTimer.Start();

        flock.RebuildKdTree();
        anchorFlock.RebuildKdTree();

        kdBuildTimer.Stop();
        //------------

        updateTimer.Start();

        for(int i = 0; i < entries.Count; ++i)
        {
            (((entries[i].behavior as PrioritySteering).Groups[1] as BlendedSteering).Behaviors[0].behaviour as Separation).useOldValues = true;
            (((entries[i].behavior as PrioritySteering).Groups[1] as BlendedSteering).Behaviors[1].behaviour as Cohesion).useOldValues = true;
            (((entries[i].behavior as PrioritySteering).Groups[1] as BlendedSteering).Behaviors[2].behaviour as VelocityMatch).useOldValues = true;
        }

        int limit = (current + 100 < entries.Count ? current + 100 : entries.Count);
        for(; current < limit; ++current)
        {
            (((entries[current].behavior as PrioritySteering).Groups[1] as BlendedSteering).Behaviors[0].behaviour as Separation).useOldValues = false;
            (((entries[current].behavior as PrioritySteering).Groups[1] as BlendedSteering).Behaviors[1].behaviour as Cohesion).useOldValues = false;
            (((entries[current].behavior as PrioritySteering).Groups[1] as BlendedSteering).Behaviors[2].behaviour as VelocityMatch).useOldValues = false;
        }

        if( current == entries.Count )
            current = 0;

        // update birds and calculate the center of the flock
        for (int i = 0; i < entries.Count; ++i)
            UpdateSteering(dt, entries[i]);

        updateTimer.Stop();
        ++frameCount;
        Debug.Log("Time to build K-D tree: " + kdBuildTimer.ElapsedMilliseconds / (float)frameCount + ", Time to update: " + updateTimer.ElapsedMilliseconds / (float)frameCount);
    }


The k-d tree is built every frame however. I guess I can do something about that too - re-build it every x seconds (where x < 1.0 of course).

So now by having 500 boids and updating 100 every frame I get 25 fps (time to update is about 16ms) instead of 15 fps (and 34 ms to update).

But the motion of the boids is not the same when updating different number of boids every frame. But I guess this is expected. By the way isn't this "cheating"? That I'm not updating all every frame? Or is this a standard thing to do in AI?

Another thing: If I set more boids (5000 for example smile.png ) I still get very long update times even when I still update 100 boids per frame. This is because those that get updaded still do queries on a huge k-d tree... Can I do something about this? What I was thinking was having something that "remembers" where the closest neighbors were so that next time they won't be far away. But how can I know this without re-calculating everything?

It's pretty standard to not run expensive stuff every frame, and AI is often expensive. =)


About co-routines: Seems I do not have access to startCoroutine because the class that handles the updates does not derive from MonoBehavior. However I did it myself: I manually update only n boids every frame (basically what I do is set a flag in each behavior that indicates whether it will use its old value or calculate a new one).

Well, you can start a coroutine from another class (that is a monobehaviour). It just needs to return IEnumerator, that's all. There are some tricks to start coroutines from non-monobehaviour classes as well, but that is pretty deep stuff and I don't think it will help at all in your situation. If you have found a way to spread your calculation over multiple frames - good.


Another thing: If I set more boids (5000 for example ) I still get very long update times even when I still update 100 boids per frame. This is because those that get updaded still do queries on a huge k-d tree... Can I do something about this? What I was thinking was having something that "remembers" where the closest neighbors were so that next time they won't be far away. But how can I know this without re-calculating everything?

And well, for that you can use a heuristic algorithm that uses the last known position, and velocity vector of an agent and tries to "guess" where it will be after a few frames. Of course, the longer you "guess" the more errors in calculation you will have. So you need to balance your update times well, so that you still got good flocking behaviour.

Performance wise two bottlenecks comes to mind.

1. updating the structure which holds the boids (kd-tree)

2. scanning for boids for the steering behaviour.

1. Using a kd-tree (with dynamic allocation) could be an issue. I would recommend sweep'n'prune, too, which is basically a fix array (for each axis) which could be sorted (insertion or tim sort) in almost O(n), due to the fact, that the coherence of the moving boids is really low.

2. If using sweep'n'prune, you can scan the surrounding in O(log n) using a modified bineary search, which sums up to O(n *log n). To improve this further on, you can cache the scan result for a few frames. Just increase the radius to scan a start set of neighbor boids, then hold this set and update it each time. The maximum movement of a boids is a threshold to when the cache will get invalid (in this case the probability that an unknown boid is entering the steering context is quite high). If you don't need a 100% perfect context information at each frame, you can really save a lot of time with a cache.

With this improvements you can get your time complexity from O(n*n) to almost O(n).

Might it be possible to create 'flock' lists (list of unit members within a defined 'flock entity - I'd do it with a link list pointer chain var in the unit object itself) where for a short time current grouping subset data is used for the interacting 'flocking' spacing comparisons, and then at an adaquately frequent interval the spacial tree is rebuilt/adjusted and the flock lists are rebuilt to be used again for some period ?

Cluster Flocking ...

The main idea is to eliminate alot of repeated heavyweight processing overhead (particularly with high frame rates and only small relative movement increments of all the units) Spacial tree is modified less frequently than it is used... The flock subsets might only have to be checked brute force within their own members 1/2N^2 when N is fairly small might allow a 'good enuf' approximation.

Possibly some 'smart' interflock calculations might be done/needed to detect flocks in the same vicinity -- possibly to force a variable interval for a subset recalculation using the full spacial processing.

Observing testing to tune for the right intervals and how to keep the subsets groups small enough for it to have any advantage over the simpler constant application....

--------------------------------------------[size="1"]Ratings are Opinion, not Fact

Whilst everyone is correct in terms of addressing memory allocations and re-engineering your overall algorithms being used will give the best improvements, I thought it odd that no-one had pointed out that you are using a standard distance check instead of a squared distance check. I.e. instead of using Vector3.Distance() use Vector3.sqrMagnitude() and provide your MaxDistance as a squared value. Though these days i've no idea of the relative performance benefits of doing this, probably not as much as it used to, but still worth doing IMHO.

I'd also maybe look into where you convert the KD Tree results into Unity types (Vector3). For example you convert vel but its only used if several conditional checks pass, would it not be more prudent to do the conversion only when you needed to add it to the avgVeL?

Indeed I would maybe question why you convert to Unity types at all within the function, leave the vectors as doubles (unless there is a tangible benefit in casting to and working with floats) and only convert to Unity Vector3 the avgCenter and avgVeL at the end of the function or in the calling function. It would mean writing your own sqrDistance and dotproduct methods, but they are easy enough, just don't forget to convert your character.pos from Vector3 to a double array in the calling function. In other words its probably far better to keep all your data in one format/type and convert at the end, instead of converting on the fly, especially when some of the data may not even be used.

However as I said at the beginning you probably find better gains through changing your approach or algorithms, but the few optimisations above are pretty straightforward and simple to implement.

For repeating an action infrequently like updating your KDtree you could use InvokeRepeating() in Unity.

Its not clear to me but are you calling GetNeighborhoodInfo() for each boid for each Cohesion, Separation, VelocityMatch? If so shouldn't you simply be sharing the results between all three behaviours and just calculate it once per AI Tick using the highest K value you need?

Finally a few points about Unity. If you have a spare email address you should be able to sign up for a 30 day demo of Pro, getting you access to the profiler. Alternatively you could try emailing Unity Technologies and asking for a Pro demo key, explaining that you are using it for your thesis and supply evidence, as they are quite a nice company and might be willing to help you out - at least it doesn't hurt to try.

This topic is closed to new replies.

Advertisement