GPU Skinning and frame interpolation

Started by
14 comments, last by theagentd 12 years, 3 months ago
Hello. I've been trying to do skinning, and so far I think I've kind of figured out the math behind it. I've successfully loaded an MD5 mesh and I'm working on implementing CPU skinning for a small animation following this guide: http://3dgep.com/?p=1053. I think I can get software skinning working quite easily but I definitely want to do the skinning in a vertex shader. I found another tutorial on the same website that covers this: http://3dgep.com/?p=1356. The "problem" is that it seems to do the bone interpolation on the CPU and then submit the data each frame to the GPU. I just transforming 100 bones shouldn't be that CPU heavy, but I might potentially have a thousand instances of the same mesh but in different animations. Therefore I also want to use instancing to reduce the CPU. Since my game is an RTS game the GPU is severely underused even though I offloaded fog of war rendering, so trading GPU cycles for CPU cycles is a good thing in my case.

I could just multi-thread the bone interpolation for an almost linear increase in performance, but it I still thought that it should be possible to off-load almost everything onto the GPU. That's when I stumbled over this white paper: http://developer.dow...gWhitePaper.pdf. It seems to do exactly what I want by storing bone matrices in a texture, which was something I had thought of doing. However, the implementation in the white-paper does not seem to have any kind of bone interpolation, and simply rounds the frame to the closest frame (though this isn't written anywhere). I'm 99.9% sure they don't re-upload the bone matrices each frame since they seem to keep the bone data for each animation and frame, not for each individual instance. Losing interpolation seems like a huge step backwards, so I would definitely not implement skinning if that turned out to be the cost.

I figured I could just upload even "rawer" bone data to my animation texture, meaning I'd keep a 3D vector and a quaternion per bone instead of a matrix and then do the interpolation between the two frames in the vertex shader. The amount of data sampled would only increase by about 33%:

1 matrix per weight = RGBA 32-bit float x 3
2 vectors and 2 quaternions = RGBA 32-bit float x 4

I would also have to upload additional static data to the weights (a position for each weight). The problem is the additional logic needed to transform each vertex since the interpolation would have to be redone for each vertex. I think this additional cost will be almost unnoticeable though since the vertex shader should be bandwidth / texture limited anyway. If there happens to be built-in functions to do slerp (I've found mix(...), but I'm not sure if it's the right one) I think the additional logic cost would be negligible.

In short, I'd port this exact function to a GLSL vertex shader:

for (int i = 0; i < m.vertices.size(); i++) {
Vertex v = m.vertices.get(i);
float x = 0, y = 0, z = 0;
for (int k = 0; k < v.weightCount; k++) {
Weight w = m.weights.get(v.startWeight + k); //v.startWeight = index in list of weights
Joint j = bindPoseJoints.get(w.joint); //Joint contains position and orientation. I'd be using the animation joints, not the bind pose joints of course.
rot(j.orientation, w.position, temp); //Quaternion rotation of weight position, temp is a temporary Vec3.
Vector3f.add(temp, j.position, temp); //Add joint position
temp.scale(w.bias);
x += temp.x;
y += temp.y;
z += temp.z;
}
vertexData[mesh].putFloat(x).putFloat(y).putFloat(z); //Load data into an array to send it to OpenGL
}


I'm pretty much a skinning n00b, but these are my thoughts on it. The main problem is the amount of (static) data needed per vertex (4 x vec3 per vertex for the weight position), but if the cost is acceptable I strongly suspect that this will have better performance than doing the interpolation on the CPU, at least in my case.
Advertisement
I think it's better to bind a texture containing already interpolated matrices.
Interpolate on the CPU using appropriate structure (most likely 3D vectors for translation and scaling, and a Quaternion for rotation).

With each vertex attribute, pass your weights (you probably want to maximize them to a reasonable number. I think most people use a maximum of 4 weights per vertex), and use them as look-up values to the texture.

I have never done GPU skinning yet, but I plan to do it pretty soon and I think that's how I am going to do it. Sounds reasonable on both CPU and GPU, each doing the appropriate task.
I realize that the GPU would have to interpolate 1 to 4 bones per vertex. With an average of 2.5 weights per vertex and a around 1 000 vertices, that's about 2 500 bone interpolations per instance. A CPU implementation would only have to process each bone once, which in the case of my test model means only 33 interpolations per instance.

In the end I think I'll just implement both just to see if and how much slower the GPU version is...

EDIT: Oh, and the function in my first post does not contain any frame interpolation. Doh!
Things that don't change: weights and bone indices. So its pretty obvious to keep those on the GPU. Also keep the initial vertex position (bind pose). Take your for loop and put it on the GPU. For each object, compute the current frame matrix, and send all those bones down for that object. If 2 of the same character models are used with 1 running and 1 walking, then you have to send the walk bones for the one, and the run bone matrices for the other.

NBA2K, Madden, Maneater, Killing Floor, Sims http://www.pawlowskipinball.com/pinballeternal


I realize that the GPU would have to interpolate 1 to 4 bones per vertex. With an average of 2.5 weights per vertex and a around 1 000 vertices, that's about 2 500 bone interpolations per instance. A CPU implementation would only have to process each bone once, which in the case of my test model means only 33 interpolations per instance.

In the end I think I'll just implement both just to see if and how much slower the GPU version is...

EDIT: Oh, and the function in my first post does not contain any frame interpolation. Doh!


Having just done that I can tell you that with running all four weights per vertex per bone (33 you say? you're animating Bob, aren't you? :) ) on the GPU, you'll be looking at around 30-100 fold increase in speed compared to smart skinning per-weight on the CPU. The relevant timings for me in debug mode are ~1-1.5 milliseconds for skinning on the CPU and 0.025-0.1 milliseconds on the GPU (this is using a 3 year old mobile version of GF 240GT). This assumes bones are animated/interpolated on the CPU and passed to the shader via a uniform.

[quote name='theagentd' timestamp='1327020506' post='4904419']
I realize that the GPU would have to interpolate 1 to 4 bones per vertex. With an average of 2.5 weights per vertex and a around 1 000 vertices, that's about 2 500 bone interpolations per instance. A CPU implementation would only have to process each bone once, which in the case of my test model means only 33 interpolations per instance.

In the end I think I'll just implement both just to see if and how much slower the GPU version is...

EDIT: Oh, and the function in my first post does not contain any frame interpolation. Doh!


Having just done that I can tell you that with running all four weights per vertex per bone (33 you say? you're animating Bob, aren't you? smile.png ) on the GPU, you'll be looking at around 30-100 fold increase in speed compared to smart skinning per-weight on the CPU. The relevant timings for me in debug mode are ~1-1.5 milliseconds for skinning on the CPU and 0.025-0.1 milliseconds on the GPU (this is using a 3 year old mobile version of GF 240GT). This assumes bones are animated/interpolated on the CPU and passed to the shader via a uniform.
[/quote]
Yes, of course the GPU is faster than the CPU, but the real fight is between the hybrid CPU/GPU solution and a pure GPU solution. You're talking about the "hybrid" one, were the bone interpolation is done on the CPU and all per-vertex calculations are done on the GPU. In my first post I suggested offloading the interpolation to the GPU too, which would slightly lessen the amount of CPU cycles and bandwidth needed for a possibly big GPU hit. You're comparing the CPU/GPU solution to a pure CPU solution. And yes, I hope to get a few Bobs swinging their lanterns around soon. =D

In my first post I suggested offloading the interpolation to the GPU too, which would slightly lessen the amount of CPU cycles and bandwidth needed for a possibly big GPU hit. You're comparing the CPU/GPU solution to a pure CPU solution. And yes, I hope to get a few Bobs swinging their lanterns around soon. =D


Don't bother. Precompute your skeleton at whatever framerate that your target is (say 40-50) for all animations and round to the closest precomputed frame when rendering. You'll end up with zero computation time and the memory footprint is negligible.
Its also good to note that not much animated stuff actually is rendered in any game other than an RTS. In that case you could worry about doing other things.

NBA2K, Madden, Maneater, Killing Floor, Sims http://www.pawlowskipinball.com/pinballeternal


[quote name='theagentd' timestamp='1327025055' post='4904438']
In my first post I suggested offloading the interpolation to the GPU too, which would slightly lessen the amount of CPU cycles and bandwidth needed for a possibly big GPU hit. You're comparing the CPU/GPU solution to a pure CPU solution. And yes, I hope to get a few Bobs swinging their lanterns around soon. =D


Don't bother. Precompute your skeleton at whatever framerate that your target is (say 40-50) for all animations and round to the closest precomputed frame when rendering. You'll end up with zero computation time and the memory footprint is negligible.
[/quote]
But what if I want a super slow-motion effect that effectively drops the game speed to 1/10th? That would force me to precompute waaay too many frames for each animation. Also, I'd prefer the accuracy of real-time bone interpolation. That's a good suggestion though. Maybe precomputing a few extra frames to increase the framerate of each animation to 3x, and then not using slerp but simple lerp+normalizing on the quaternion in a vertex shader would produce accurate enough motion. You could even precompute with bicubic interpolation too.

Each frame requires about a kilobyte of data, Each animation may be shared between different meshes if they have the same skeleton layout, so the number of animations won't be that many. Let's say 20 units with different animation sets, each with 10 different 2 second animations. That'd be 20 x 10 x 2x24 frames of raw data --> each frame is 33 bones x 7 floats x 4 bytes per float --> 20x10x2x24x33x7x4 = 8.45MBs of data. Multiplying the number of frames by 4 is still just around 35MBs of data, so precomputing is a very possible solution. Nice idea!



Its also good to note that not much animated stuff actually is rendered in any game other than an RTS. In that case you could worry about doing other things.

I don't really understand what you're saying...? ._.

But what if I want a super slow-motion effect that effectively drops the game speed to 1/10th? That would force me to precompute waaay too many frames for each animation. Also, I'd prefer the accuracy of real-time bone interpolation. That's a good suggestion though. Maybe precomputing a few extra frames to increase the framerate of each animation to 3x, and then not using slerp but simple lerp+normalizing on the quaternion in a vertex shader would produce accurate enough motion. You could even precompute with bicubic interpolation too.


Question: is the "what if" actually a realistic expectation? Will you be implementing bullet time?

If the answer is yes, then you could do some dynamic branching - if the required animation speed drops below a certain threshold (say, the precomputed framerate), then interpolate the skeleton for that model on the fly. Note that bone interpolation is cheap for an average model. 30-60 bones isn't that big of a deal really and it's okay to do it when needed. The real optimization you should be looking into here is the actual number of updates per second. If you're running hundreds of models at 100 FPS, then updating them each frame is going to affect said FPS. If you update each skeleton every other frame, you won't be compromising any visual quality and you'll have effectively halved the work load. What is really expensive here is the skinning, which is done on the GPU. Note that the Bob model actually has a relatively low poly count (around 800 vertices) - take a proper model with 10k vertices and 50 bones and you'll be looking at roughly the same animation load on the CPU, but a 15 times more expensive skinning phase on the GPU. Vertices ramp up as time goes on. There's really no need to ramp up the number of bones in most cases.


Each frame requires about a kilobyte of data, Each animation may be shared between different meshes if they have the same skeleton layout, so the number of animations won't be that many. Let's say 20 units with different animation sets, each with 10 different 2 second animations. That'd be 20 x 10 x 2x24 frames of raw data --> each frame is 33 bones x 7 floats x 4 bytes per float --> 20x10x2x24x33x7x4 = 8.45MBs of data. Multiplying the number of frames by 4 is still just around 35MBs of data, so precomputing is a very possible solution. Nice idea!


It's likelier you want to store your animation as a list of matrices, not 7 floats, for faster access. Also, think about it realistically - if your model comes at 24 FPS, then multiplying it by 2 will be enough for any human. 100 keyframes per second will give you smooth slow motion, which you very likely won't be needing.


Its also good to note that not much animated stuff actually is rendered in any game other than an RTS. In that case you could worry about doing other things.

I don't really understand what you're saying...? ._.

[/quote]

What dpadam450 means is that the only genre in which you will realistically encounter a large number of models that need to be animated individually is an RTS games. In a general case you'll have 10 models tops running around at one time, which is a beeze to animate on the CPU.

This topic is closed to new replies.

Advertisement