Sign in to follow this  
Alundra

Animation theading

Recommended Posts

Alundra    2316

Hi all,

I have a class Animator who is responsible to animate one mesh.

Each animated mesh in the scene has an animator so.

Animation take a lot of performance so threading need to be added.

Since I'm new in threading, my idea is to have an AnimatorManager, each instance is created using it.

The AnimatorManager has an array of Animator* and an array of thread who is the num of Core of the CPU.

The AnimatorManager will have a function Update who will update all Animator* from the array using threading.

Is it the good way or a better way need to be used ?

Thanks for the help

Share this post


Link to post
Share on other sites
TheComet    3896

What graphics library are you using?

 

I would recommend implementing GPU accelerated animation rather than messing around with CPU threading. This way, the transformations of animation frames are offloaded to the GPU using vertex programs and immensly increase performance.

 

Ogre3D supports this, for instance: http://www.ogre3d.org/docs/manual/manual_76.html

Share this post


Link to post
Share on other sites
Alundra    2316

I already do GPU skinning using vertex shader (HLSL and GLSL).

The animation system is a layered based system, each AnimatedMesh component contains an animator.

GPU Skinning help a lot on performance but it's not enough, it's why threading is needed, to split all update.

I'm new in threading so I don't know a lot about all that, my idea is just to have a manager and update that by threading.

Share this post


Link to post
Share on other sites
L. Spiro    25621

Perhaps you are not utilizing the GPU properly; normally the GPU is plenty fast enough to handle skinning, but there are more than one ways to lose performance.

Beating the GPU by using threads is non-trivial and can only happen if you really know what you are doing—it is not a learning point.

Make sure you are actually using the GPU correctly before you try to use threading to gain performance.  You may be inclined to post your vertex shader(s).

 

 

L. Spiro

Share this post


Link to post
Share on other sites
Alundra    2316

My vertex shader used for meshes who need skinning :

struct VS_INPUT
{
  float4 Position : POSITION;
  float3 Normal   : NORMAL;
  float2 TexCoord : TEXCOORD0;
  float4 Tangent  : TANGENT;
  float4 Weights  : WEIGHTS;
  uint4  Indices  : BONEINDICES;
};

struct VS_OUTPUT
{
  float4 Position : SV_POSITION;
  float3 Normal   : NORMAL;
  float4 Tangent  : TANGENT;
  float2 TexCoord : TEXCOORD0;
  float4 PosVS    : TEXCOORD1;
};

cbuffer WVP_WVIT_CBUFFER : register( b0 )
{
  float4x4 WorldView;
  float4x4 Projection;
  float4x4 WorldViewInverseTranspose;
};

cbuffer MESH_SKINNED_CBUFFER : register( b1 )
{
  float4x4 BoneMatrices[ 96 ];
};

VS_OUTPUT main( in VS_INPUT Input )
{
  VS_OUTPUT Output = (VS_OUTPUT)0;
  float4 SkinnedPos = float4( 0.0f, 0.0f, 0.0f, 0.0f );
  float3 SkinnedNormal = float3( 0.0f, 0.0f, 0.0f );
  float3 SkinnedTangent = float3( 0.0f, 0.0f, 0.0f );
  for( int i = 0; i < 4; ++i )
  {
    if( Input.Weights[ i ] > 0.0f )
    {
      SkinnedPos += mul( Input.Position, BoneMatrices[ Input.Indices[ i ] ] ) * Input.Weights[ i ];
      SkinnedNormal += mul( Input.Normal, (float3x3)BoneMatrices[ Input.Indices[ i ] ] ) * Input.Weights[ i ];
      SkinnedTangent += mul( Input.Tangent.xyz, (float3x3)BoneMatrices[ Input.Indices[ i ] ] ) * Input.Weights[ i ];
    }
  }
  float4 ViewPosition = mul( SkinnedPos, WorldView );
  Output.Position = mul( ViewPosition, Projection );
  Output.Normal = mul( SkinnedNormal, (float3x3)WorldViewInverseTranspose );
  Output.Tangent = float4( mul( SkinnedTangent, (float3x3)WorldView ), Input.Tangent.w );
  Output.TexCoord = Input.TexCoord;
  Output.PosVS = ViewPosition;
  return Output;
}

Share this post


Link to post
Share on other sites
AgentC    2352

What graphics library are you using?

 

I would recommend implementing GPU accelerated animation rather than messing around with CPU threading. This way, the transformations of animation frames are offloaded to the GPU using vertex programs and immensly increase performance.

 

Ogre3D supports this, for instance: http://www.ogre3d.org/docs/manual/manual_76.html

 

Even if the GPU handles actual vertex skinning according to the bone matrices it is given, the calculations to yield the final bone matrices (sample and blend between keyframes, blend if multiple animations, transform local-space bone transforms into the world space, multiply with inverse bind pose) can easily be a per-frame CPU hotspot if there are many (50+) characters onscreen, and thus will benefit from threading.

 

In addition to just threading the animation work, workload can also be reduced:

- Make sure you're not calculating animation for characters outside view frustum

- When characters are far away, you can get away with not updating the animation every frame (a primitive form of LOD)

 

However if we're talking about only a few or a few tens of characters the CPU side of animation shouldn't be a significant hotspot.

Edited by AgentC

Share this post


Link to post
Share on other sites
L. Spiro    25621

Firstly, remove the branch from the for loop.  Iterate over all 4 weights regardless of them being 0 or not.  Negative weights should not be allowed by the CPU end.

 

Secondly, you only need to upload as many bones are as referenced by the part of the model you are rendering.  For instance, a mech-machine will likely be broken into 1 mesh for each leg, 1 or 2 or so for the body, some for the weapons, etc.
You aren’t rendering the entire model all in one pass, but in multiple passes in which smaller parts of the model are rendered at a time.  If you are rendering the front-left leg, there is no reason to send bone information for the back-right leg.  Reducing the number of bones you send reduces bandwidth heavily and will be one of the largest gains in performance you will see.

 

 

The rest of my suggestions may be exactly the same performance or may be faster, so you would have to test.  The shader compiler will likely be smart enough not to perform array look-ups every time, but you can be sure by storing Input.Weights[ i ] to a temporary and using that instead of repeated array access.  Same thing with Input.Indices[ i ] and possibly even BoneMatrices[ Input.Indices[ i ] ].

Try various combinations of storing these to temporaries, benchmark, and repeat.

 

 

L. Spiro

Share this post


Link to post
Share on other sites
mhagain    13430

You should also look at your bone matrix multiplication and upload code; it's possible that you may have bottlenecks there that are solvable without even having to consider threading as an option.

Edited by mhagain

Share this post


Link to post
Share on other sites
Alundra    2316

Firstly, remove the branch from the for loop.  Iterate over all 4 weights regardless of them being 0 or not.  Negative weights should not be allowed by the CPU end.

I thought now branching was fast enough to avoid mul of matrix, thanks to give me the info that it still better to do that instead of a branch.

Is it the same for a diffuse texture, send a white texture to sample it instaead of a branch ?

Secondly, you only need to upload as many bones are as referenced by the part of the model you are rendering.

I already do that yea, each geometry of the mesh is split with a bone array inside and each geometry is split by material.

You should also look at your bone matrix multiplication and upload code; it's possible that you may have bottlenecks there that are solvable without even having to consider threading as an option.

I already don't update the final bone matrix array if no animation needs to be played.

Edited by Alundra

Share this post


Link to post
Share on other sites
L. Spiro    25621


I thought now branching was fast enough to avoid mul of matrix, thanks to give me the info that it still better to do that instead of a branch.
Is it the same for a diffuse texture, send a white texture to sample it instaead of a branch ?

It is very much worth testing.

 

On the CPU side, are you using SSE2 (at minimum) for matrix multiplication?

 

 

L. Spiro

Share this post


Link to post
Share on other sites
phil_t    8084

Can you clarify if you're CPU-bound or GPU-bound?

It sounds like you are CPU-bound (calculating the bone matrices?) - but some folks are offering shader optimizations, which won't make a difference if you're being limited on the CPU.

Share this post


Link to post
Share on other sites
Alundra    2316

I have profile using very sleepy during 1min and we can see that operator* of the matrix is the heaviest function called on the list :

CMatrix4::operator* = 2.43s (exclusive)

the second on the list is QuaternionSlerp :

QuaternionSlerp = 1.57s (exclusive)

QuaternionSlerp could be replaced by QuaternionNLerp only I think, I do a check inside to do a NLerp :

const float CosPhi = QuaternionDot( q1, NewQ2 );
if( CosPhi > ( 1.0f - 0.001f ) )

but since the most of time angle is low, this check could be removed and go just for QuaternionNLerp.

My actual performance is on a map of 250 000 triangles with 10 characters animated of 50 bones with textures and directional lighting : 530 FPS.

Edited by Alundra

Share this post


Link to post
Share on other sites
Khatharr    8812

How many calls were there to CMatrix4::operator*?

 

It's taking up 4.05% of your CPU time, so it almost sounds like... *drumroll, please* you should be offloading a lot of that work to the GPU through the use of vertex shaders.

 

Can you walk us step-by-step through the process you use to render a single entity once?

Share this post


Link to post
Share on other sites
Alundra    2316

I think it's not here that the heaviest code is, my send of data for a skinned to the GPU has nothing fancy :

1) Bind VertexBuffer/IndexBuffer

2) Bind VertexShader

3) Update constant buffer, on this part I do InverseBindPose*FinalTransform[ i ].

4) for loop of material subset

5) Check if we have a material

6) Bind PixelShader

7) Update constant buffer/textures

8) Draw the subset

Share this post


Link to post
Share on other sites
phil_t    8084

My actual performance is on a map of 250 000 triangles with 10 characters animated of 50 bones with textures and directional lighting : 530 FPS.

 

530FPS! What makes you think you have a performance problem? What are your performance goals?

 

The matrix operations you listed are taking up 4% of your CPU time. Assuming that's all in the bone calculations, and you were able to successfully divide the work onto 4 cores, it would now be taking 1% of your CPU time. So the CPU time for a frame is now 97% of what it used to be (i.e. of questionable benefit considering the added complexity).

 

If you're really worried your CPU bone matrix calculations are a performance issue, make a build where you can turn them off at will (i.e. just not update them each frame). Does it affect performance?

Edited by phil_t

Share this post


Link to post
Share on other sites
Alundra    2316

I just would tried to make some threading to see how that works, so I gave an idea of threading but I don't know where the best place is for threading an animation system.

About threading, using SSE2 and doesn't use operator* and use a function with a pointer to a matrix can win performance too.

I have to say too that operator* is used in Actor::Update so a boolean to avoid update of transform when not needed need to be added.

Edited by Alundra

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this