Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

I recently started to optimise my Engine. Using VS2012 Performance analysing tool, I quickly found the biggest bottlenecks.

By far the biggest one are my animations. I already managed to go up from 100Fps with 1 simple animated model to 150Fps with 50. While that is not bad, I still need it to get better.

I fixed everything I could, but I don't know how to continue. The major performance-eaters are

D3DXQuaternionInverse

and

D3DXQuaternionMultiply

which are an essential part of the computation that bone-based animations need, right?

Is there a way to work around those or are there cheaper ways to compute bb animations?

Those two take around 50%, the other 50% are beeing lost to stuff like

( All CPU % in %-of-the-full-time in an uncapped application )

With 5% of the complete CPU time each. I don't understand why those are so very expensive.

Stuff like sqrt, sin and maybe even / are concidered expensive, but why are the basic math operations also a problem?

I am adding the full code of the function in question below, adding the % of CPU to the end of each line which is over 0.1%.

I tried to make it as readable as possible, but the Copy-paste into this forum still screws it up, I'm sorry.

Spoiler

uint numVCurrentSubset;
Vertex *tempVert;
Weight* tempWeight;
joint* tempJoint;
D3DXQUATERNION tempJointOrientation, tempWeightPos, tempJointOrientationConjugate, tempWeightNormal;
D3DXVECTOR3 rotatedPoint;
D3DXQUATERNION temp1, temp2;
D3DXVECTOR3 rotatedPoint2;
D3DXQUATERNION temp3, temp4;
ModelSubset *currentSubset;
for ( int k = 0; k < numSubsets; k++){
currentSubset = subsets[k];
numVCurrentSubset = currentSubset->vertices.size();
for ( uint i = 0; i < numVCurrentSubset; ++i ){
tempVert = currentSubset->vertices[i]; // ---- 0.3%
tempVert->pos.x = 0.0f; tempVert->pos.y = 0.0f; tempVert->pos.z = 0.0f;
tempVert->normal.x = 0.0f; tempVert->normal.y = 0.0f; tempVert->normal.z = 0.0f;
// Sum up the joints and weights information to get vertex's position and normal
for ( int j = 0; j < tempVert->WeightCount; ++j ){ // ---- 0.6% (only this line, not whole loop)
tempWeight = currentSubset->weights[tempVert->StartWeight + j]; // ---- 0.5%
tempJoint = interpolatedSkeleton[tempWeight->jointID]; // ---- 0.5%
// Convert joint orientation and weight pos to vectors for easier computation
tempJointOrientation.x = tempJoint->orientation.x;
tempJointOrientation.y = tempJoint->orientation.y;
tempJointOrientation.z = tempJoint->orientation.z;
tempJointOrientation.w = tempJoint->orientation.w;
tempWeightPos.x = tempWeight->pos.x;
tempWeightPos.y = tempWeight->pos.y;
tempWeightPos.z = tempWeight->pos.z;
tempWeightPos.w = 0.0f;
// We will need to use the conjugate of the joint orientation quaternion
D3DXQuaternionInverse(&tempJointOrientationConjugate, &tempJointOrientation); // ---- 20.0%
// Calculate vertex position (in joint space, eg. rotate the point around (0,0,0)) for this weight using the joint orientation quaternion and its conjugate
// We can rotate a point using a quaternion with the equation "rotatedPoint = quaternion * point * quaternionConjugate"
D3DXQuaternionMultiply(&temp1, &tempJointOrientation, &tempWeightPos); // ---- 3.5%
D3DXQuaternionMultiply(&temp2, &temp1, &tempJointOrientationConjugate); // ---- 3.5%
rotatedPoint.x = temp2.x;rotatedPoint.y = temp2.y;rotatedPoint.z = temp2.z;
// Now move the verices position from joint space (0,0,0) to the joints position in world space, taking the weights bias into account
tempVert->pos.x += ( tempJoint->pos.x + rotatedPoint.x ) * tempWeight->bias; // ---- 5.1 % (???)
tempVert->pos.y += ( tempJoint->pos.y + rotatedPoint.y ) * tempWeight->bias;
tempVert->pos.z += ( tempJoint->pos.z + rotatedPoint.z ) * tempWeight->bias;
// Compute the normals for this frames skeleton using the weight normals from before
// We can comput the normals the same way we compute the vertices position, only we don't have to translate them (just rotate)
tempWeightNormal.x = tempWeight->normal.x;
tempWeightNormal.y = tempWeight->normal.y;
tempWeightNormal.z = tempWeight->normal.z;
tempWeightNormal.w = 0.0f;
// Rotate the normal
D3DXQuaternionMultiply(&temp3, &tempJointOrientation, &tempWeightPos); // ---- 4.6 %
D3DXQuaternionMultiply(&temp4, &temp3, &tempJointOrientationConjugate); // ---- 6.2 %
rotatedPoint2.x = temp4.x; rotatedPoint2.y = temp4.y; rotatedPoint2.z = temp4.z;
// Add to vertices normal and ake weight bias into account
tempVert->normal.x -= rotatedPoint2.x * tempWeight->bias; // ---- 4.9 %
tempVert->normal.y -= rotatedPoint2.y * tempWeight->bias;
tempVert->normal.z -= rotatedPoint2.z * tempWeight->bias;
}
currentSubset->vertices[i]->pos = tempVert->pos; // -- 0.3%
currentSubset->vertices[i]->normal = -tempVert->normal; // --- 6.1 %
//D3DXVec3Normalize(¤tSubset->vertices[i]->normal, &-tempVert->normal); // can be done on GPU, otherwise ---- 15.0%
}
// Put the positions into the vertices for this subset
for(uint i = 0; i < numVCurrentSubset; i++){ // ---- alltogether 5%, but I can work around this one
currentSubset->verts[i].pos = currentSubset->vertices[i]->pos;
currentSubset->verts[i].normal = currentSubset->vertices[i]->normal;
currentSubset->verts[i].texcoord = currentSubset->vertices[i]->texCoord;
}
// Copy to the GPU
D3D11_MAPPED_SUBRESOURCE mappedVertBuff;
d3dev->devCon->Map(currentSubset->vertBuff, 0, D3D11_MAP_WRITE_DISCARD, 0, &mappedVertBuff);
VertexPosNormalTex *updatedV; updatedV = (VertexPosNormalTex *)mappedVertBuff.pData;
memcpy(updatedV, currentSubset->verts, numVCurrentSubset*sizeof(VertexPosNormalTex));
d3dev->devCon->Unmap(currentSubset->vertBuff, 0);
}

I am new to profiling and optimising, so any tips are welcome!

Stuff like sqrt, sin and maybe even / are concidered expensive, but why are the basic math operations also a problem?

Probably because you're doing TONS of those basic math operations.

Why are you doing all the per vertex skinning on the CPU? This is something that can easily be done entirely on the GPU (which will do many verts in parallel, taking advantage of the easy parallelism of the problem).

I suppose that would be a good idea. I never did that because I somehow assumed that pushing "all that stuff" to the GPU would create as much traffic as this eats away CPU.

So, how do I go about it. I update only the bones on the CPU, then I push those to the GPU (per model in a constant buffer) and just add weight-information to each vertex.

Can I do the computation on the Vertex-Shader or should I use a computation shader?

I suppose that would be a good idea. I never did that because I somehow assumed that pushing "all that stuff" to the GPU would create as much traffic as this eats away CPU.

All you have to do is send the bone transformations to the GPU, which is going to be significantly less data than having to send an entire mesh worth of skinned verts each frame (assuming your meshes have more than a small handful of verts).

So, how do I go about it. I update only the bones on the CPU, then I push those to the GPU (per model in a constant buffer) and just add weight-information to each vertex.
Can I do the computation on the Vertex-Shader or should I use a computation shader?

That's exactly right. Add weight/index info to each vert, then send bone information to the vertex shader via a constant buffer. It'll be FAR more natural to do this in the vertex shader, rather than a compute shader. Do your bone transformations on each vert, then do the regular model-view-projection transformation just like normal.

Briefly, a track modifies a single property of a single element of a matrix. POSITION.Y = 1 track. ROTATION.X = 1 track. ROTATION.XYZ = 3 tracks.

After a track has been updated and its interpolated value used to modify a single property, the matrix is rebuilt. See links above. This is faster and more accurate than quaternion math.

Matrices are built for each bone.

For each draw call, the bone matrices that affect the current mesh are uploaded to the shader and skinning is performed in the vertex shader. The max number of influences per vertex is usually 4.

Between dropping the use of quaternions entirely and moving skinning to the shaders, you basically have to rewrite your animation code from scratch. You can find shader samples for this in the DirectX samples that ship with the SDK.

Thanks a bunch Spiro, that is extremly helpfull! I'll rewrite everything using the model you mention in your other posts.

There is a good number of tutorials that use quaternions or suggest them.

That is not the first time something like that occurs. I have found a lot of tutorials that use the worst method possible. Really gotta be careful on who I listen to :/

Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.