Sign in to follow this  
B_old

Parallelizing animation blending

Recommended Posts

[Posting this here, because I believe the topic isn't strictly related to animation blending.] Recently I have been playing with skeletal animation and now I want to parallelize the step that creates the final matrices send to the graphicscard. That is, I want to blend several animations at the same time. At first I thought that it should be pretty straightforward, but my approach isn't scaling very well. I have many "actors" each with their own set of bone matrices. I have one animation, with different keyframes, which is shared by all actors (read only). When I process two actors concurrently I notice a certain speedboost, but the processing of one actor is already higher than before, suggesting that they get in the way of each other. Three actors concurrently yields practically no further benefit, as every single one now takes a lot of time. This is a very vague and unspecific description of my problem, yet it would be cool if you could brainstorm with me. Is the scenario not so good for parallelism as I thought? Could it be a problem that all threads read from the same source? Looking forward to any hint! P.S.: Regarding the "read from the same source" problem. I tried to give each actor there individual copy of the animation (instead of all pointing to the same) and that made performance noticeable worse. Maybe a cache related thing. P.P.S.: std::map::find() is called extensively from all threads. That should be ok, right?

Share this post


Link to post
Share on other sites
Your issue could be more about where you write the output to. If lots of threads are writing output to the same cache line (cache lines are usually 64 or 128 bytes) it can cause cache thrashing as the memory gets synced up.

Reading from the same cache line on multiple threads isn't a problem.

Look at the 6th example here for details.

Share this post


Link to post
Share on other sites
Thanks for the answer and the link.

In my current setup a thread takes an actor and starts animation. Then it takes another one and so on. There is a critical section around the "taking".

I then modified it so, that a thread takes a bunch of actors and starts animating them. The idea was to give each thread actors that maybe are closer together in memory. It didn't change much at all.

If I indeed have a problem with writing to the same cache line, how could I solve it?

Share this post


Link to post
Share on other sites
Quote:
Original post by B_old
If I indeed have a problem with writing to the same cache line, how could I solve it?


Have each thread write to data that is sufficiently far apart. This might involve having each thread write to it's own buffer and then combining them afterwards (which might also be do-able in parallel).

Quote:

P.P.S.: std::map::find() is called extensively from all threads. That should be ok, right?


There are no guarantees, but it's probably ok as long as you don't have other threads modifying the map (which of course includes the use of operator[]).

Quote:

In my current setup a thread takes an actor and starts animation. Then it takes another one and so on. There is a critical section around the "taking".


Can you spell this out a bit more explicitly. Do you have a multi-consumer queue of actors? Or does the taking happen on the level of tasks? i.e.


parallel_for_each(threadPool, actors.begin(), actors.end(), do_some_animation_stuff());

Share this post


Link to post
Share on other sites
Quote:
Original post by the_edd
Have each thread write to data that is sufficiently far apart. This might involve having each thread write to it's own buffer and then combining them afterwards (which might also be do-able in parallel).

That's what I tried by giving each thread several actors. Of course, there could still be adjacent memory sections.

Quote:
Original post by the_edd
Can you spell this out a bit more explicitly. Do you have a multi-consumer queue of actors? Or does the taking happen on the level of tasks? i.e.

Hmm. What is the difference between the two?
The program running for each thread looks like this:

while (true)
{
m_common->m_lock.enter();

//get a new task
void *subtask = m_common->m_job->getSubtask();

if (!subtask)
{
m_common->m_lock.leave();

m_threadFinishedEvent->signal();

break;
}

m_common->m_lock.leave();

//process the task
m_common->m_job->processSubtask(subtask);
}


I always thought about it as tasks, but it sounds like the multi-consumer-queue you mentioned, doesn't it?

Share this post


Link to post
Share on other sites
You should probably try using a profiler to work out what's going on. For example Intel's VTune has a free trial available, and can profile multi threaded code.

Here's some other possibilities for the lack of speedup with more threads:

- You've hit memory bandwidth limits, instead of processing power limits.

- Getting a new task is taking a long time compared to the time to process a task. For example if it takes about the same time to get a new task as it does to process one then any more than two threads will bottleneck on the task acquisition lock.

- Any other locks could also make extra threads less effective. Memory allocation will often use a global lock for example.

- If the OS is trying to save power by cutting down the clock speed of inactive CPU cores, then that can hit performance where some cores are mostly idle.

- Threads are sharing CPU cores either due to HyperThreading or the cores being busy with other tasks.

Share this post


Link to post
Share on other sites
How are you handling the bone->world transformation for the individual bones?

That problem is a reduction, and can be solved in parallel easily with most task-parallel libraries. Only requirement is that the reduction operation be associative, which matrix multiplication is. How it works isn't really all that important, although you can imagine it kind of like if you are multiplying 12 matrices together and you have 4 cores, you put a separator every 3 matrices. First core multiplies the first 3, second more multiplies the second 3, etc. End result of this is 4 matrices. Now first core multiplies the first two, second core multiplies the second two and you end up with 2 matrices. Finally, a single core multiplies them to get the result.

Task-based parallelism libraries like Intel TBB and Microsoft Concurrency Runtime can automatically parallelize a reduce for you without you needing to do anything special.

This will get you parallel computation of bone -> world transformations.


If you're not noticing much speedup on the blend, try swapping the order of the for loops. I assume you have some nested for loop and you're parallelizing on the outer loop. Swap the outer and inner loops and then try parallelizing on the outer loop.

Share this post


Link to post
Share on other sites
The thing with the profiler is probably a good idea. I already did my own measurements and, as I noted before, they suggest that the processing of one animation takes more time the more threads I have going.
I don't think getting a task takes a long time.
HyperThreading is impossible (AMD) but it could still be, that threads share a core of course. How would I know?

BTW, I'm really curious about the difference between that multi-consumer-queue and tasks.

Share this post


Link to post
Share on other sites
Modern profilers can show you a visualization of the concurrent execution per core. For example, Intel Parallel Studio, Visual Studio 2010 Concurrency Visualizer, etc. Maybe AMD PerfStudio, I'm not sure since I haven't used it. They basically show you something like a horizontal bar graph where each bar is a core. But it's not really a bar, it's really just a set of intervals, where a certain color means "something is running at this time". Then, you can click on that interval and get a callstack. It's really useful for visualizing how much execution overlap there actually is. I'm pretty sure they can show you things like false sharing etc as well.

Share this post


Link to post
Share on other sites
Quote:
Original post by B_old
Hmm. What is the difference between the two?

The parallel_for_each code doesn't require any locks (assuming the actors are stored in something like a std::vector where random access is both available and thread-safe).

Of course, there might still be locks in the thread-pool, but it has a better chance at avoiding bottlenecks when spreading the load (through work stealing and having distinct queues for each thread, for example).

Also, the scoped locking idiom is highly recommended.

Share this post


Link to post
Share on other sites
Thanks for the replies, guys!

Quote:
Original post by the_edd
The parallel_for_each code doesn't require any locks (assuming the actors are stored in something like a std::vector where random access is both available and thread-safe).

Of course, there might still be locks in the thread-pool, but it has a better chance at avoiding bottlenecks when spreading the load (through work stealing and having distinct queues for each thread, for example).

Also, the scoped locking idiom is highly recommended.

Actually I'm trying to make a thread-pool that can be re-used for different purposes. I need to lock when I get a new actor. How could it possibly work without?

The best thing to do probably is to consult a profiler and maybe try to use TBB for comparison.

Share this post


Link to post
Share on other sites
Just a small update:

I increased the amount of animations that are blended per actor, without increasing the amount of actors. That way the difference between 2 and 3 threads became clearly visible, although its still not as big a jump as 1 to 2. But I guess it could mean that I am hitting another bottleneck here.

Share this post


Link to post
Share on other sites
Quote:
Original post by B_old
Actually I'm trying to make a thread-pool that can be re-used for different purposes. I need to lock when I get a new actor. How could it possibly work without?

Get a new actor from where? What do you mean?

Quote:
The best thing to do probably is to consult a profiler and maybe try to use TBB for comparison.

A good plan.

Share this post


Link to post
Share on other sites
Quote:
Original post by the_edd
Get a new actor from where? What do you mean?

What I meant is that I have a std::vector of actors and enter a critical section when I pick an actor from the list and increment the iterator. Than I process that actors animation, while other threads are free to get an actor.

Share this post


Link to post
Share on other sites
That isn't a great way to do it; ideally each thread in your thread group is allocate a non-overlapping range from that vector to work on so that you don't have to lock and unlock things in order to get at the next chunk of data.

So, if you had 4 threads and 100 actors in your vector then each thread would work on a range of 25 actors each, so they can access the vector directly without worrying about locking or unlocking.

It gets more complex when you introduce work stealing but the above remove the need for locks.

Share this post


Link to post
Share on other sites
Quote:
Original post by phantom
That isn't a great way to do it; ideally each thread in your thread group is allocate a non-overlapping range from that vector to work on so that you don't have to lock and unlock things in order to get at the next chunk of data.

So, if you had 4 threads and 100 actors in your vector then each thread would work on a range of 25 actors each, so they can access the vector directly without worrying about locking or unlocking.

It gets more complex when you introduce work stealing but the above remove the need for locks.

Hmm, I think I understand.

How does one handle cases where the actors are stored in a non-random-access container (std::list maybe)?

One advantage of acquiring one actor at a time is that the load is more or less automatically spread over the threads (if there are a few particularly "hefty" actors around...). I guess that is when work stealing comes into play, but I have yet to look that term up.

Share this post


Link to post
Share on other sites
Quote:
Original post by B_old
How does one handle cases where the actors are stored in a non-random-access container (std::list maybe)?

With more difficulty :)

It depends on the relative cost of incrementing the iterator to processing the task associated with each element. There's no one hard-and-fast rule for this, unfortunately.

If you have bidirectional iterators it can sometimes even by worth creating a vector of pointers to the elements so that you have a random access range. EDIT: actually, forward iterators should be work too.

Quote:

One advantage of acquiring one actor at a time is that the load is more or less automatically spread over the threads (if there are a few particularly "hefty" actors around...).

You also get this when you split the range in to N chunks for the N threads in your thread pool, so there's no advantage to doing this one element at a time for a random access range.

Actually, if the range is particularly large and you're not using work stealing, it's often a good idea to split up the range in to k*N pieces instead which can spread the load more evenly if some tasks are much heftier than others.

I think perhaps you have this idea in your head that your container of actors should itself act as a queue in to which threads can grab stuff to work on. If that's the case, I don't think it's a particularly good idea. A queue may not be the best way to think of the data. A pool is more appropriate in this specific case, I think.

[Edited by - the_edd on February 20, 2010 6:35:20 AM]

Share this post


Link to post
Share on other sites
Quote:
Original post by the_edd
Quote:
Original post by B_old
One advantage of acquiring one actor at a time is that the load is more or less automatically spread over the threads (if there are a few particularly "hefty" actors around...).

You also get this when you split the range in to N chunks for the N threads in your thread pool, so there's no advantage to doing this one element at a time for a random access range.

If the first (or any) thread gets all the heftiest tasks, the other threads will have to wait for it?

Quote:
Original post by the_edd
I think perhaps you have this idea in your head that your container of actors should itself act as a queue in to which threads can grab stuff to work on. If that's the case, I don't think it's a particularly good idea. A queue may not be the best way to think of the data. A pool is more appropriate in this specific case, I think.

My idea is this: I have an object that provides an interface to fetch tasks. I pass that object to the thread pool and the threads start fetching tasks and processing them. Is that what you mean by the idea in my head? :)
I like the idea of a pool as well, I just needed a way to ensure that different threads never fetch the same task.

Share this post


Link to post
Share on other sites
Quote:
Original post by B_old
If the first (or any) thread gets all the heftiest tasks, the other threads will have to wait for it?

Yes, that's definitely a worry. But see my comment about k*N chunks. Also, now's a good time to look up work stealing :)

It all comes down to balancing the costs of:

- task creation
- sequence iteration
- task submission
- task execution
- per-task processing overhead

Different strategies will work better for different kinds of work. If you know that some tasks will take much more time than others, then small range-chunk sizes are certainly something to try.

Quote:

My idea is this: I have an object that provides an interface to fetch tasks. I pass that object to the thread pool and the threads start fetching tasks and processing them. Is that what you mean by the idea in my head? :)

More or less :)

Quote:
I like the idea of a pool as well, I just needed a way to ensure that different threads never fetch the same task.

Express it as an algorithm and let it worry about that e.g. parallel_for_each().

The parallel_for_each function we use at work looks something like this (pseudo code):


// Ignoring any special cases for non-random-access iterators
template< /* blah */ >
void parallel_for_each(thread_pool threads, RAIter begin, RAIter end, const UnaryFunctor &f, std::size_t granularity = deduce_granularity)
{
task_pool tasks;
range_chunker rc(begin, end, threads.size(), granularity);
while ((chunk = rc.next())
tasks.add(make_task(chunk.begin, chunk.end, f));

future<void> completion = threads.take(tasks);
completion.check();
}


The algorithm itself ensures that tasks are distributed correctly. Surely you don't want to have to do that manually every time?

Share this post


Link to post
Share on other sites
Thanks for the reply!

My approach goes in the general direction you outlined. Apart from being less general/flexible and the locking that is...

I'd say I wasn't that far off the mark and you guys me gave some useful inspiration. Thanks!

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this