Public Group

Making skeletal animation more efficient

This topic is 3858 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

Recommended Posts

Hello. I have at last, after over half a year of learning OpenGL and C++, created a basic skeletal animation system which I just got working today. =D The problem is, just one skinned model takes up about 40-50% of my CPU. Often games have many characters on the screen at once. I'm creating this topic because I need some help in making the code more efficient. Most of it isn't OpenGL or Cg specific so I posted it here. There are 3 parts to the code: calculating the animation pose, calculating the matrices to offset the vertices from the skeleton and the animation matrices, and drawing. There are 3 parts to the animation system: skeletons (a set of "local" matrices), animations (a set of "local" matrices to replace the skeleton's for each keyframe), and models When I say "local" I mean local transformation, relative to that bone's origin (as opposed to the model's origin). "up the chain" in my weird slang means multiplying the local matrices of a skeleton from the root up in order to get the matrices relative to the model's origin. I am using CML (a math library) for the matrices on the CPU. Basically how it works is: each vertex has bone index and weight data (duh). The skeletons consist of a bunch of "local" matrices for bones, with parents and children bones. The animations are like a big set of skeletons for each frame, except stored as quats and vectors, not matrices, for interpolation. AnimatedPose[256] is the "active" set of "local" matrices for the current frame in the animation. VertexTransform[256] is the final set of matrices passed to the GPU to transform vertices. When the skeleton is loaded and the matrices are multiplied "up the chain" and then inverse'd to find the "inverse bind pose". This is stored in the skeleton object. When an animation is calculated, first what happens is the current time into the animation is determined. Then the two "surrounding" frames (consisting of the "local" quats and vectors) in the animation data are interpolated to match that time (using SLERP and LERP). Next, the result is turned converted to matrices. Now there are local "matrices" representing that time into the animation. This is stored in AnimatedPose. To "combine" animations (for example, overlay an arm movement animation on a running animation) just call the SetAnimation function with multiple animations. When the current animation is all set up, the matrices in AnimatedPose are then multiplied "up the chain" and then multiplied with the skeleton's inverse bind pose (which was calculated when the skeleton was loaded) in order to find the VertexTransform pose (the set of matrices that the vertices are multiplied with). Then the VertexTransform pose is passed to the GPU and the model is drawn using OpenGL's VBOs. Here is my code. (You don't need to read it all of course because it is fairly long, but if you want to check how I did something, for example interpolation, I posted it all here.)
struct Bone
{
cml::matrix44f_c local, invbind; // local matrix and inverse bind matrix
BYTE parent; // parent bone
BYTE childcount; // num of children
BYTE child[4]; // child bone IDs (each bone is ID'd 0-255)
Bone();
};

class Skeleton
{
private:
BYTE BoneCount;
Bone * Bn; // the bone data
void CalcInvBindIter( BYTE bone ); // part of the function to calculate inverse bind pose for all bones in the skeleton

public:
void CalcInvBind();// function to calculate inverse bind pose for all bones in the skeleton
Skeleton();
~Skeleton();

void SetData( BYTE p_BoneCount, Bone * p_Bn );

UINT GetBoneCount();
friend void SetSkeleton( int SkelID );
friend void SetSkeletonIter( BYTE bone, Skeleton * skel );
};

struct BoneFrame // structure holding the matrices for each frame in the animation for a bone
{
BYTE boneid; // the bone's id
cml::quaternionf_p * Rotations; // the transformations aren't actually matrices,
cml::vector3f * Translations, * Scalings; // they are actually quats and vectors
BoneFrame();
~BoneFrame();
};

class Animation
{
private:
UINT FrameCount, AnimLen, * FrameTime; // FrameCount = number of frames, AnimLen = animation length in MS, *FrameTime = an array of UINTS, one element for each frame, telling the time in MS at that frame
BYTE BoneCount; // # of bones in that animation (for example, the animation may only include 3 bones in the arm, so this number is 3)
BoneFrame * frm; // the frame data
public:
Animation();
~Animation();

UINT GetAnimLength();
void SetData( UINT p_FrameCount, UINT p_AnimLen, UINT * p_FrameTime, BYTE p_BoneCount, BoneFrame * p_frm );
friend void SetAnimation( int AnimID, UINT AnimTime );
};

class DynamicModel // animated model (don't ask me why I called it "dynamic", I have no idea
{
private:
UINT VertCount, NormCount, TexCoordCount, TriCount;
UINT mod_ID; // ID for the VBO
DVertex * Vert; // vertex data
Normal * Norm; // normal "
TextureCoord * TexCoord; // texcoord "
Triangle * Tri; // triangle "
float xmin, ymin, zmin, xmax, ymax, zmax, radius; // bounding box stuff
bool generated;
friend class Sorter;

public:
DynamicModel();
~DynamicModel();

void SetData( int p_VertCount, int p_NormCount, int p_TexCoordCount, int p_TriCount,
UINT p_mod_ID, DVertex * p_Vert, Normal * p_Norm, TextureCoord * p_TexCoord, Triangle * p_Tri,
float p_xmin, float p_ymin, float p_zmin, float p_xmax, float p_ymax, float p_zmax, float p_radius );

UINT GetVertexCount(); // stuff...
UINT GetNormalCount();
UINT GetTextureCoordCount();
UINT GetTriangleCount();
float GetXMin();
float GetYMin();
float GetZMin();
float GetXMax();
float GetYMax();
float GetZMax();
UINT GetModID();
};

class State_Manager
{
...
public:
cml::matrix44f_c AnimatedPose[256]; // the animated "local" bone pose
cml::matrix44f_c VertexTransform[256]; // the final vertex transform matrices
...
};


Here is my code to calculate the matrices for the current frame. The SetAnimation function is called before the SetSkeleton function:
void SetSkeleton( int SkelID )
{
Skeleton * skel = engine::skeleton_manager->GetElement( SkelID );
if (skel == NULL)
return;

SetSkeletonIter( 0, skel ); // multiply "up the chain"

for (int i = 0; i < skel->BoneCount; i++) // multiply by inverse bind to get the "canceled out" translation
engine::state_manager->VertexTransform *= skel->Bn.invbind;

for (int i = 0; i < engine::state_manager->active_bones && i < MAX_BONES; i++) // send the bones to the shader
{
float * mat = cml::transpose( engine::state_manager->VertexTransform ).data(); // get a pointer to the transpose of the matrix (because Cg can apparently only do row major matrices or something)
for (int p = 0; p < 3; p++) // send over each row of the matrix - only need to send a 3x4 matrix, not 4x4
{
CGparameter param = cgGetArrayParameter( engine::state_manager->SP.bones, i*3+p );
cgGLSetParameter4f( param, mat[p*4], mat[p*4+1], mat[p*4+2], mat[p*4+3] );
}
}

engine::state_manager->active_bones = skel->GetBoneCount();
}

#define _sm_vtx_tf	engine::state_manager->VertexTransform // I wouldn't normally do this, but honestly...
#define _sm_anim	engine::state_manager->AnimatedPose

void SetSkeletonIter( BYTE bone, Skeleton * skel )
{
if (bone == 0) // calculate the "final transform" up the chain of animated local bones
_sm_vtx_tf[bone] = _sm_anim[bone];
else
_sm_vtx_tf[bone] = _sm_vtx_tf[skel->Bn[bone].parent] * _sm_anim[bone];
for (BYTE b = 0; b < skel->Bn[bone].childcount; b++)
SetSkeletonIter( skel->Bn[bone].child, skel );
}

#undef _sm_vtx_tf
#undef _sm_anim

void SetAnimation( int AnimID, UINT AnimTime )
{
Animation * anim = engine::animation_manager->GetElement( AnimID );
if (anim == NULL)
return;

UINT i; // find the surrounding frames
for (i = 0; i < anim->FrameCount; i++)
{
if (AnimTime < anim->FrameTime)
{
i--;
break;
}
if (AnimTime == anim->FrameTime)
break;
}

float percent = (float)(AnimTime - anim->FrameTime) / (float)(anim->FrameTime[i+1] - anim->FrameTime); // get the % in between the frames

cml::quaternionf_p rot; // these are for calculating the interpolation
cml::vector3f scale, trans;
cml::matrix44f_c * mat;

for (BYTE b = 0; b < anim->BoneCount; b++) // for each bone in the animation, apply interpolation
{
mat = &engine::state_manager->AnimatedPose[anim->frm.boneid]; // pointer to the matrix to edit

rot = cml::slerp( anim->frm.Rotations, anim->frm.Rotations[i+1], percent ); // slerp the quats
scale = cml::lerp( anim->frm.Scalings, anim->frm.Scalings[i+1], percent ); // lerp the scale
trans = cml::lerp( anim->frm.Translations, anim->frm.Translations[i+1], percent ); // lerp the translation

cml::identity( *mat ); // set the matrix to identity
cml::matrix_scale( *mat, scale ); // then add in the transformations
cml::matrix_rotation_quaternion( *mat, rot );
cml::matrix_set_translation( *mat, trans );
}
}


Finally my drawing code (not really needed but I'll post it anyway):
void DrawDynamicModel( int ID, int TexID, float x, float y, float z, float scale, float xvec, float yvec, float zvec, float angle, int color )
{
...
glPushMatrix();
glTranslatef( x, y, z );
glRotatef( angle, xvec, yvec, zvec );
glScalef( scale, scale, scale );

glColor4f( (float)GetRed( color )/255.0f, (float)GetGreen( color )/255.0f, (float)GetBlue( color )/255.0f, (float)GetAlpha( color )/255.0f );

engine::state_manager->ActivateShader(); // this is the correct order... remember to set matrix after transformations!
cgGLSetStateMatrixParameter( engine::state_manager->SP.modelViewProj, CG_GL_MODELVIEW_PROJECTION_MATRIX, CG_GL_MATRIX_IDENTITY );
...
glEnableClientState( GL_VERTEX_ARRAY ); // position : POSITION
glEnableClientState( GL_NORMAL_ARRAY ); // normals : NORMAL
glEnableClientState( GL_TEXTURE_COORD_ARRAY ); // texcoord : TEXCOORD0
cgGLEnableClientState( engine::state_manager->SP.indices ); // indices
cgGLEnableClientState( engine::state_manager->SP.weights ); // weights

glBindBufferARB( GL_ARRAY_BUFFER, mod->GetModID() ); // this must go BEFORE setting the pointers
glVertexPointer( 3, GL_FLOAT, sizeof_VBODVertex, BUFFER_OFFSET( 0 ) );
glNormalPointer( GL_FLOAT, sizeof_VBODVertex, BUFFER_OFFSET( 12 ) );
glTexCoordPointer( 2, GL_FLOAT, sizeof_VBODVertex, BUFFER_OFFSET( 24 ) );
cgGLSetParameterPointer( engine::state_manager->SP.indices, 4, GL_FLOAT, sizeof_VBODVertex, BUFFER_OFFSET( 32 ) );
cgGLSetParameterPointer( engine::state_manager->SP.weights, 4, GL_FLOAT, sizeof_VBODVertex, BUFFER_OFFSET( 48 ) );

glDrawArrays( GL_TRIANGLES, 0, mod->GetTriangleCount()*3 );

glDisableClientState( GL_VERTEX_ARRAY );
glDisableClientState( GL_NORMAL_ARRAY );
glDisableClientState( GL_TEXTURE_COORD_ARRAY );
cgGLDisableClientState( engine::state_manager->SP.indices );
cgGLDisableClientState( engine::state_manager->SP.weights );

glPopMatrix();
}

//and here is the cg shader

struct input // vertex model
{
float3 position	: POSITION;
float3 normal	: NORMAL;
float4 color	: COLOR0;
float2 texcoord	: TEXCOORD0;
float4 indices;
float4 weights;
};

struct output
{
float4 position	: POSITION;
float2 texcoord	: TEXCOORD0;
float4 color	: COLOR0;
};

output main( input IN,							// vertex
uniform float4x4 modelViewProj,	// modelview matrix
uniform float4 bones[90] )			// bones (30*3)
{
output OUT;

float3 pos = float3( 0.0f, 0.0f, 0.0f );
//	float3 norm = float3( 0.0f, 0.0f, 0.0f );

for (int i = 0; i < 4; i++)
{
if (IN.weights > 0.0f)
{
float3x4 matrix = float3x4( bones[IN.indices*3],
bones[IN.indices*3+1],
bones[IN.indices*3+2] );
pos += IN.weights * mul( matrix, float4( IN.position, 1.0f ) );
//			norm += IN.weights * mul( matrix, float4( IN.normal, 0.0f ) ); // for now, no normals calculations
}
}

OUT.position = mul( modelViewProj, float4( pos, 1.0f ) );
OUT.color = IN.color;
//	OUT.normal = norm.normalize();
OUT.texcoord = IN.texcoord;

return OUT;
}


I know that's a lot of code, you don't need to look through it all, I just thought I'd post everything I had so that it is all there. Anyway, I need techniques on optimizing the performance so I can get many characters on the screen animating at once. Also, I have to leave right now but when I return I will edit to make it more readable and explain what the parts of the code do better. [Edited by - Gumgo on May 26, 2008 4:39:19 PM]

Share on other sites
"40-50% CPU usage" isn't a very meaningful metric, particularly if you're rendering with vsync disabled. How many more milliseconds does it take to draw a frame with skeletal animation than without?

Share on other sites
Fristly, con grats on a working skel anim system.

Can you tell us a bit more about the test you've performed?

40-50% of the cpu dosnt mean much wihout a bit of context (at least to me).
If it is the only thing your doing is anmating and skinning of one model and your not vsynced, then yes it will take up most of your frame because its all your doing.

Better might be to time the execution of the operations. Then work that out as a percentage of your dream frame rate.

Also test variations of data.
Number of bones, number of weights per vert.
Chart your performance so you can know who thing you will be able to handle.

My \$0.02

Share on other sites
Thanks for the responses!

I'll try to find a high precision timer to do tests with. Also, for now I will leave the actual drawing part and just time the matrix calculations (the drawing is slow also but I'll focus on that next). I'll add the results tomorrow or something.
Edit:
Okay here is a bit of information. For a 22 bone model (times are in microseconds):
SetAnimation | SetSkeleton  809        |   1103  695        |   1120  694        |   1087  693        |   1108  701        |   1281  690        |   1119  691        |   1072  704        |   1083  909        |   1073  719        |   1070

So it looks like around 700 for SetAnimation and 1100 for SetSkeleton. So nearly 2ms per model, and that is not including drawing them. Definitely need to improve it.

[Edited by - Gumgo on May 26, 2008 1:35:42 AM]

Share on other sites
Here is the process for each bone:
SetAnimation| 1 quaternion slerp (rotation)| 2 vector lerps (scaling, translation)| 1 matrix "scaling" operation (vector)| 1 matrix "rotating" operation (quat)| 1 matrix "translation" operation (vector)SetSkeleton| 2 matrix multiplications ("up the chain" and inverse bind)| 1 transpose (to pass to shader)

So total I have: 1 slerp, 2 lerp, ~3 matrix multiplications, 1 transpose, per bone per frame.

One thing I can think of that may help a bit is possibly reducing the scaling, translating, and rotating operations into one matrix multiplication operation. How would I do this?

Other optimizations?

Share on other sites
Even with this extra info, it's hard to judge whether your library even needs optimization at all. The 40-50% CPU time would mean it monopolizes one core on a dual core machine. This may sound like a problem, but if you're running your app without VSyncing it may be pushing out hundreds of frames per sec, so it's only natural it'll top out at 50% CPU usage.

As Armand said:

Quote:
 Better might be to time the execution of the operations. Then work that out as a percentage of your dream frame rate.

So when running at 50 FPS, your frametime would be 20 ms and the setup time of 2 ms per model would indeed seem a bit expensive. However, I think these numbers may be a bit inaccurate since you're only measuring them for one model, which may include various overheads (like shader switching) that don't apply when rendering further models. Have you tried simply rendering multiple models and seeing how that works out?

Quote:
 One thing I can think of that may help a bit is possibly reducing the scaling, translating, and rotating operations into one matrix multiplication operation. How would I do this?

I think the easiest way to do this would be to convert all transforms to matrices (assuming you have your rotation in quaternions) and multiple the matrices together to one bone matrix. However, I also think multiplying the matrices to obtain this combined matrix would actually be more expensive than simply applying the transforms to a vector consecutively (please correct me if I'm wrong though).

Share on other sites
Quote:
 Original post by GumgoSo total I have: 1 slerp, 2 lerp, ~3 matrix multiplications, 1 transpose, per bone per frame.

There's no way that amount of math takes 2 milliseconds. Are you profiling in release mode? Your profiling code should basically do the skeleton setup hundreds of thousands of times between the start and stop timings.

Share on other sites
Without even rendering, just calculating the matrices for 10 models without changing any shader settings or anything, I'm getting numbers like this:
17185174061711717634174411776621869173852070017021

17-20ms just for 10 model calculations.
Quote:
 There's no way that amount of math takes 2 milliseconds. Are you profiling in release mode? Your profiling code should basically do the skeleton setup hundreds of thousands of times between the start and stop timings.

The 2ms applies to a 22-bone model, but still, 2ms is not at all what I want. And I'm not in release mode... didn't think of that though, I'll try that.

EDIT:
[wow]
Release mode times for 10 22-bone models:
570559557558552533524536532529

I had no idea release mode made such a huuuuge difference! What makes it run so much faster? (30x!?!?)
Anyway, looks like that problem is taken care of, unless there's anything else that can be done. Looks like it is time to address the rendering part of the issue now...

EDIT: Okay, the drawing is going slow because of the vertex shader. With the matrix multiplication code in the shader, it is taking about 5.5ms to draw the character. It has 244 triangles (732 verts) and most joints only have 2 bones at most, usually 1 though. What happens in the shader (Cg) is for each bone affecting the vertex, first, if its weight is 0, the loop continues to the next bone. Then if its weight isn't 0, a float3x4 matrix is created by combining 3 float4 vectors. The matrix is multiplied with the weight and with the vertex's position and added to the "final position" float3 for the vertex.

[Edited by - Gumgo on May 26, 2008 7:48:06 PM]

Share on other sites
Quote:
 Original post by GumgoAnd I'm not in release mode... didn't think of that though, I'll try that.

Always profile in release mode with full optimizations. Debug timings are pretty much useless.

Share on other sites
Quote:
 Original post by GumgoI had no idea release mode made such a huuuuge difference! What makes it run so much faster? (30x!?!?)

The fact that your debugger isn't running?

1. 1
2. 2
Rutin
19
3. 3
khawk
18
4. 4
A4L
14
5. 5

• 12
• 16
• 26
• 10
• 44
• Forum Statistics

• Total Topics
633767
• Total Posts
3013737
×