# Parallelizing animation blending

This topic is 2922 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

[Posting this here, because I believe the topic isn't strictly related to animation blending.] Recently I have been playing with skeletal animation and now I want to parallelize the step that creates the final matrices send to the graphicscard. That is, I want to blend several animations at the same time. At first I thought that it should be pretty straightforward, but my approach isn't scaling very well. I have many "actors" each with their own set of bone matrices. I have one animation, with different keyframes, which is shared by all actors (read only). When I process two actors concurrently I notice a certain speedboost, but the processing of one actor is already higher than before, suggesting that they get in the way of each other. Three actors concurrently yields practically no further benefit, as every single one now takes a lot of time. This is a very vague and unspecific description of my problem, yet it would be cool if you could brainstorm with me. Is the scenario not so good for parallelism as I thought? Could it be a problem that all threads read from the same source? Looking forward to any hint! P.S.: Regarding the "read from the same source" problem. I tried to give each actor there individual copy of the animation (instead of all pointing to the same) and that made performance noticeable worse. Maybe a cache related thing. P.P.S.: std::map::find() is called extensively from all threads. That should be ok, right?

##### Share on other sites
Your issue could be more about where you write the output to. If lots of threads are writing output to the same cache line (cache lines are usually 64 or 128 bytes) it can cause cache thrashing as the memory gets synced up.

Reading from the same cache line on multiple threads isn't a problem.

Look at the 6th example here for details.

##### Share on other sites
Thanks for the answer and the link.

In my current setup a thread takes an actor and starts animation. Then it takes another one and so on. There is a critical section around the "taking".

I then modified it so, that a thread takes a bunch of actors and starts animating them. The idea was to give each thread actors that maybe are closer together in memory. It didn't change much at all.

If I indeed have a problem with writing to the same cache line, how could I solve it?

##### Share on other sites
Quote:
 Original post by B_oldIf I indeed have a problem with writing to the same cache line, how could I solve it?

Have each thread write to data that is sufficiently far apart. This might involve having each thread write to it's own buffer and then combining them afterwards (which might also be do-able in parallel).

Quote:
 P.P.S.: std::map::find() is called extensively from all threads. That should be ok, right?

There are no guarantees, but it's probably ok as long as you don't have other threads modifying the map (which of course includes the use of operator[]).

Quote:
 In my current setup a thread takes an actor and starts animation. Then it takes another one and so on. There is a critical section around the "taking".

Can you spell this out a bit more explicitly. Do you have a multi-consumer queue of actors? Or does the taking happen on the level of tasks? i.e.

parallel_for_each(threadPool, actors.begin(), actors.end(), do_some_animation_stuff());

##### Share on other sites
Quote:
 Original post by the_eddHave each thread write to data that is sufficiently far apart. This might involve having each thread write to it's own buffer and then combining them afterwards (which might also be do-able in parallel).

That's what I tried by giving each thread several actors. Of course, there could still be adjacent memory sections.

Quote:
 Original post by the_eddCan you spell this out a bit more explicitly. Do you have a multi-consumer queue of actors? Or does the taking happen on the level of tasks? i.e.

Hmm. What is the difference between the two?
The program running for each thread looks like this:
while (true){	m_common->m_lock.enter();                //get a new task		void *subtask = m_common->m_job->getSubtask();		if (!subtask)		{			m_common->m_lock.leave();			m_threadFinishedEvent->signal();			break;		}	m_common->m_lock.leave();        //process the task	m_common->m_job->processSubtask(subtask);}

I always thought about it as tasks, but it sounds like the multi-consumer-queue you mentioned, doesn't it?

##### Share on other sites
You should probably try using a profiler to work out what's going on. For example Intel's VTune has a free trial available, and can profile multi threaded code.

Here's some other possibilities for the lack of speedup with more threads:

- You've hit memory bandwidth limits, instead of processing power limits.

- Getting a new task is taking a long time compared to the time to process a task. For example if it takes about the same time to get a new task as it does to process one then any more than two threads will bottleneck on the task acquisition lock.

- Any other locks could also make extra threads less effective. Memory allocation will often use a global lock for example.

- If the OS is trying to save power by cutting down the clock speed of inactive CPU cores, then that can hit performance where some cores are mostly idle.

- Threads are sharing CPU cores either due to HyperThreading or the cores being busy with other tasks.

##### Share on other sites
How are you handling the bone->world transformation for the individual bones?

That problem is a reduction, and can be solved in parallel easily with most task-parallel libraries. Only requirement is that the reduction operation be associative, which matrix multiplication is. How it works isn't really all that important, although you can imagine it kind of like if you are multiplying 12 matrices together and you have 4 cores, you put a separator every 3 matrices. First core multiplies the first 3, second more multiplies the second 3, etc. End result of this is 4 matrices. Now first core multiplies the first two, second core multiplies the second two and you end up with 2 matrices. Finally, a single core multiplies them to get the result.

Task-based parallelism libraries like Intel TBB and Microsoft Concurrency Runtime can automatically parallelize a reduce for you without you needing to do anything special.

This will get you parallel computation of bone -> world transformations.

If you're not noticing much speedup on the blend, try swapping the order of the for loops. I assume you have some nested for loop and you're parallelizing on the outer loop. Swap the outer and inner loops and then try parallelizing on the outer loop.

##### Share on other sites
The thing with the profiler is probably a good idea. I already did my own measurements and, as I noted before, they suggest that the processing of one animation takes more time the more threads I have going.
I don't think getting a task takes a long time.
HyperThreading is impossible (AMD) but it could still be, that threads share a core of course. How would I know?

BTW, I'm really curious about the difference between that multi-consumer-queue and tasks.

##### Share on other sites
Modern profilers can show you a visualization of the concurrent execution per core. For example, Intel Parallel Studio, Visual Studio 2010 Concurrency Visualizer, etc. Maybe AMD PerfStudio, I'm not sure since I haven't used it. They basically show you something like a horizontal bar graph where each bar is a core. But it's not really a bar, it's really just a set of intervals, where a certain color means "something is running at this time". Then, you can click on that interval and get a callstack. It's really useful for visualizing how much execution overlap there actually is. I'm pretty sure they can show you things like false sharing etc as well.

##### Share on other sites
Quote:
 Original post by B_oldHmm. What is the difference between the two?

The parallel_for_each code doesn't require any locks (assuming the actors are stored in something like a std::vector where random access is both available and thread-safe).

Of course, there might still be locks in the thread-pool, but it has a better chance at avoiding bottlenecks when spreading the load (through work stealing and having distinct queues for each thread, for example).

Also, the scoped locking idiom is highly recommended.

##### Share on other sites
Thanks for the replies, guys!

Quote:
 Original post by the_eddThe parallel_for_each code doesn't require any locks (assuming the actors are stored in something like a std::vector where random access is both available and thread-safe). Of course, there might still be locks in the thread-pool, but it has a better chance at avoiding bottlenecks when spreading the load (through work stealing and having distinct queues for each thread, for example).Also, the scoped locking idiom is highly recommended.

Actually I'm trying to make a thread-pool that can be re-used for different purposes. I need to lock when I get a new actor. How could it possibly work without?

The best thing to do probably is to consult a profiler and maybe try to use TBB for comparison.

##### Share on other sites
Just a small update:

I increased the amount of animations that are blended per actor, without increasing the amount of actors. That way the difference between 2 and 3 threads became clearly visible, although its still not as big a jump as 1 to 2. But I guess it could mean that I am hitting another bottleneck here.

##### Share on other sites
Quote:
 Original post by B_oldActually I'm trying to make a thread-pool that can be re-used for different purposes. I need to lock when I get a new actor. How could it possibly work without?

Get a new actor from where? What do you mean?

Quote:
 The best thing to do probably is to consult a profiler and maybe try to use TBB for comparison.

A good plan.

##### Share on other sites
Quote:
 Original post by the_eddGet a new actor from where? What do you mean?

What I meant is that I have a std::vector of actors and enter a critical section when I pick an actor from the list and increment the iterator. Than I process that actors animation, while other threads are free to get an actor.

##### Share on other sites
That isn't a great way to do it; ideally each thread in your thread group is allocate a non-overlapping range from that vector to work on so that you don't have to lock and unlock things in order to get at the next chunk of data.

So, if you had 4 threads and 100 actors in your vector then each thread would work on a range of 25 actors each, so they can access the vector directly without worrying about locking or unlocking.

It gets more complex when you introduce work stealing but the above remove the need for locks.

##### Share on other sites
Quote:
 Original post by phantomThat isn't a great way to do it; ideally each thread in your thread group is allocate a non-overlapping range from that vector to work on so that you don't have to lock and unlock things in order to get at the next chunk of data.So, if you had 4 threads and 100 actors in your vector then each thread would work on a range of 25 actors each, so they can access the vector directly without worrying about locking or unlocking.It gets more complex when you introduce work stealing but the above remove the need for locks.

Hmm, I think I understand.

How does one handle cases where the actors are stored in a non-random-access container (std::list maybe)?

One advantage of acquiring one actor at a time is that the load is more or less automatically spread over the threads (if there are a few particularly "hefty" actors around...). I guess that is when work stealing comes into play, but I have yet to look that term up.

##### Share on other sites
Quote:
 Original post by B_oldHow does one handle cases where the actors are stored in a non-random-access container (std::list maybe)?

With more difficulty :)

It depends on the relative cost of incrementing the iterator to processing the task associated with each element. There's no one hard-and-fast rule for this, unfortunately.

If you have bidirectional iterators it can sometimes even by worth creating a vector of pointers to the elements so that you have a random access range. EDIT: actually, forward iterators should be work too.

Quote:
 One advantage of acquiring one actor at a time is that the load is more or less automatically spread over the threads (if there are a few particularly "hefty" actors around...).

You also get this when you split the range in to N chunks for the N threads in your thread pool, so there's no advantage to doing this one element at a time for a random access range.

Actually, if the range is particularly large and you're not using work stealing, it's often a good idea to split up the range in to k*N pieces instead which can spread the load more evenly if some tasks are much heftier than others.

I think perhaps you have this idea in your head that your container of actors should itself act as a queue in to which threads can grab stuff to work on. If that's the case, I don't think it's a particularly good idea. A queue may not be the best way to think of the data. A pool is more appropriate in this specific case, I think.

[Edited by - the_edd on February 20, 2010 6:35:20 AM]

##### Share on other sites
Quote:
Original post by the_edd
Quote:
 Original post by B_oldOne advantage of acquiring one actor at a time is that the load is more or less automatically spread over the threads (if there are a few particularly "hefty" actors around...).

You also get this when you split the range in to N chunks for the N threads in your thread pool, so there's no advantage to doing this one element at a time for a random access range.

If the first (or any) thread gets all the heftiest tasks, the other threads will have to wait for it?

Quote:
 Original post by the_eddI think perhaps you have this idea in your head that your container of actors should itself act as a queue in to which threads can grab stuff to work on. If that's the case, I don't think it's a particularly good idea. A queue may not be the best way to think of the data. A pool is more appropriate in this specific case, I think.

My idea is this: I have an object that provides an interface to fetch tasks. I pass that object to the thread pool and the threads start fetching tasks and processing them. Is that what you mean by the idea in my head? :)
I like the idea of a pool as well, I just needed a way to ensure that different threads never fetch the same task.

##### Share on other sites
Quote:
 Original post by B_oldIf the first (or any) thread gets all the heftiest tasks, the other threads will have to wait for it?

Yes, that's definitely a worry. But see my comment about k*N chunks. Also, now's a good time to look up work stealing :)

It all comes down to balancing the costs of:

- sequence iteration

Different strategies will work better for different kinds of work. If you know that some tasks will take much more time than others, then small range-chunk sizes are certainly something to try.

Quote:
 My idea is this: I have an object that provides an interface to fetch tasks. I pass that object to the thread pool and the threads start fetching tasks and processing them. Is that what you mean by the idea in my head? :)

More or less :)

Quote:
 I like the idea of a pool as well, I just needed a way to ensure that different threads never fetch the same task.

Express it as an algorithm and let it worry about that e.g. parallel_for_each().

The parallel_for_each function we use at work looks something like this (pseudo code):

// Ignoring any special cases for non-random-access iteratorstemplate< /* blah */ >void parallel_for_each(thread_pool threads, RAIter begin, RAIter end, const UnaryFunctor &f, std::size_t granularity = deduce_granularity){    task_pool tasks;    range_chunker rc(begin, end, threads.size(), granularity);    while ((chunk = rc.next())        tasks.add(make_task(chunk.begin, chunk.end, f));    future<void> completion = threads.take(tasks);    completion.check();}

The algorithm itself ensures that tasks are distributed correctly. Surely you don't want to have to do that manually every time?

##### Share on other sites
Thanks for the reply!

My approach goes in the general direction you outlined. Apart from being less general/flexible and the locking that is...

I'd say I wasn't that far off the mark and you guys me gave some useful inspiration. Thanks!