Jump to content
  • Advertisement
Sign in to follow this  
B_old

Parallelizing animation blending

This topic is 3044 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

[Posting this here, because I believe the topic isn't strictly related to animation blending.] Recently I have been playing with skeletal animation and now I want to parallelize the step that creates the final matrices send to the graphicscard. That is, I want to blend several animations at the same time. At first I thought that it should be pretty straightforward, but my approach isn't scaling very well. I have many "actors" each with their own set of bone matrices. I have one animation, with different keyframes, which is shared by all actors (read only). When I process two actors concurrently I notice a certain speedboost, but the processing of one actor is already higher than before, suggesting that they get in the way of each other. Three actors concurrently yields practically no further benefit, as every single one now takes a lot of time. This is a very vague and unspecific description of my problem, yet it would be cool if you could brainstorm with me. Is the scenario not so good for parallelism as I thought? Could it be a problem that all threads read from the same source? Looking forward to any hint! P.S.: Regarding the "read from the same source" problem. I tried to give each actor there individual copy of the animation (instead of all pointing to the same) and that made performance noticeable worse. Maybe a cache related thing. P.P.S.: std::map::find() is called extensively from all threads. That should be ok, right?

Share this post


Link to post
Share on other sites
Advertisement
Your issue could be more about where you write the output to. If lots of threads are writing output to the same cache line (cache lines are usually 64 or 128 bytes) it can cause cache thrashing as the memory gets synced up.

Reading from the same cache line on multiple threads isn't a problem.

Look at the 6th example here for details.

Share this post


Link to post
Share on other sites
Thanks for the answer and the link.

In my current setup a thread takes an actor and starts animation. Then it takes another one and so on. There is a critical section around the "taking".

I then modified it so, that a thread takes a bunch of actors and starts animating them. The idea was to give each thread actors that maybe are closer together in memory. It didn't change much at all.

If I indeed have a problem with writing to the same cache line, how could I solve it?

Share this post


Link to post
Share on other sites
Quote:
Original post by B_old
If I indeed have a problem with writing to the same cache line, how could I solve it?


Have each thread write to data that is sufficiently far apart. This might involve having each thread write to it's own buffer and then combining them afterwards (which might also be do-able in parallel).

Quote:

P.P.S.: std::map::find() is called extensively from all threads. That should be ok, right?


There are no guarantees, but it's probably ok as long as you don't have other threads modifying the map (which of course includes the use of operator[]).

Quote:

In my current setup a thread takes an actor and starts animation. Then it takes another one and so on. There is a critical section around the "taking".


Can you spell this out a bit more explicitly. Do you have a multi-consumer queue of actors? Or does the taking happen on the level of tasks? i.e.


parallel_for_each(threadPool, actors.begin(), actors.end(), do_some_animation_stuff());

Share this post


Link to post
Share on other sites
Quote:
Original post by the_edd
Have each thread write to data that is sufficiently far apart. This might involve having each thread write to it's own buffer and then combining them afterwards (which might also be do-able in parallel).

That's what I tried by giving each thread several actors. Of course, there could still be adjacent memory sections.

Quote:
Original post by the_edd
Can you spell this out a bit more explicitly. Do you have a multi-consumer queue of actors? Or does the taking happen on the level of tasks? i.e.

Hmm. What is the difference between the two?
The program running for each thread looks like this:

while (true)
{
m_common->m_lock.enter();

//get a new task
void *subtask = m_common->m_job->getSubtask();

if (!subtask)
{
m_common->m_lock.leave();

m_threadFinishedEvent->signal();

break;
}

m_common->m_lock.leave();

//process the task
m_common->m_job->processSubtask(subtask);
}


I always thought about it as tasks, but it sounds like the multi-consumer-queue you mentioned, doesn't it?

Share this post


Link to post
Share on other sites
You should probably try using a profiler to work out what's going on. For example Intel's VTune has a free trial available, and can profile multi threaded code.

Here's some other possibilities for the lack of speedup with more threads:

- You've hit memory bandwidth limits, instead of processing power limits.

- Getting a new task is taking a long time compared to the time to process a task. For example if it takes about the same time to get a new task as it does to process one then any more than two threads will bottleneck on the task acquisition lock.

- Any other locks could also make extra threads less effective. Memory allocation will often use a global lock for example.

- If the OS is trying to save power by cutting down the clock speed of inactive CPU cores, then that can hit performance where some cores are mostly idle.

- Threads are sharing CPU cores either due to HyperThreading or the cores being busy with other tasks.

Share this post


Link to post
Share on other sites
How are you handling the bone->world transformation for the individual bones?

That problem is a reduction, and can be solved in parallel easily with most task-parallel libraries. Only requirement is that the reduction operation be associative, which matrix multiplication is. How it works isn't really all that important, although you can imagine it kind of like if you are multiplying 12 matrices together and you have 4 cores, you put a separator every 3 matrices. First core multiplies the first 3, second more multiplies the second 3, etc. End result of this is 4 matrices. Now first core multiplies the first two, second core multiplies the second two and you end up with 2 matrices. Finally, a single core multiplies them to get the result.

Task-based parallelism libraries like Intel TBB and Microsoft Concurrency Runtime can automatically parallelize a reduce for you without you needing to do anything special.

This will get you parallel computation of bone -> world transformations.


If you're not noticing much speedup on the blend, try swapping the order of the for loops. I assume you have some nested for loop and you're parallelizing on the outer loop. Swap the outer and inner loops and then try parallelizing on the outer loop.

Share this post


Link to post
Share on other sites
The thing with the profiler is probably a good idea. I already did my own measurements and, as I noted before, they suggest that the processing of one animation takes more time the more threads I have going.
I don't think getting a task takes a long time.
HyperThreading is impossible (AMD) but it could still be, that threads share a core of course. How would I know?

BTW, I'm really curious about the difference between that multi-consumer-queue and tasks.

Share this post


Link to post
Share on other sites
Modern profilers can show you a visualization of the concurrent execution per core. For example, Intel Parallel Studio, Visual Studio 2010 Concurrency Visualizer, etc. Maybe AMD PerfStudio, I'm not sure since I haven't used it. They basically show you something like a horizontal bar graph where each bar is a core. But it's not really a bar, it's really just a set of intervals, where a certain color means "something is running at this time". Then, you can click on that interval and get a callstack. It's really useful for visualizing how much execution overlap there actually is. I'm pretty sure they can show you things like false sharing etc as well.

Share this post


Link to post
Share on other sites
Quote:
Original post by B_old
Hmm. What is the difference between the two?

The parallel_for_each code doesn't require any locks (assuming the actors are stored in something like a std::vector where random access is both available and thread-safe).

Of course, there might still be locks in the thread-pool, but it has a better chance at avoiding bottlenecks when spreading the load (through work stealing and having distinct queues for each thread, for example).

Also, the scoped locking idiom is highly recommended.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

Participate in the game development conversation and more when you create an account on GameDev.net!

Sign me up!