GPU Skinning and frame interpolation

theagentd · 2012-01-21T05:29:41

Hello. I've been trying to do skinning, and so far I think I've kind of figured out the math behind it. I've successfully loaded an MD5 mesh and I'm working on implementing CPU skinning for a small animation following this guide: http://3dgep.com/?p=1053. I think I can get software skinning working quite easily but I definitely want to do the skinning in a vertex shader. I found another tutorial on the same website that covers this: http://3dgep.com/?p=1356. The "problem" is that it seems to do the bone interpolation on the CPU and then submit the data each frame to the GPU. I just transforming 100 bones shouldn't be that CPU heavy, but I might potentially have a thousand instances of the same mesh but in different animations. Therefore I also want to use instancing to reduce the CPU. Since my game is an RTS game the GPU is severely underused even though I offloaded fog of war rendering, so trading GPU cycles for CPU cycles is a good thing in my case. I could just multi-thread the bone interpolation for an almost linear increase in performance, but it I still thought that it should be possible to off-load almost everything onto the GPU. That's when I stumbled over this white paper: http://developer.dow...gWhitePaper.pdf. It seems to do exactly what I want by storing bone matrices in a texture, which was something I had thought of doing. However, the implementation in the white-paper does not seem to have any kind of bone interpolation, and simply rounds the frame to the closest frame (though this isn't written anywhere). I'm 99.9% sure they don't re-upload the bone matrices each frame since they seem to keep the bone data for each animation and frame, not for each individual instance. Losing interpolation seems like a huge step backwards, so I would definitely not implement skinning if that turned out to be the cost. I figured I could just upload even "rawer" bone data to my animation texture, meaning I'd keep a 3D vector and a quaternion per bone instead of a matrix and then do the interpolation between the two frames in the vertex shader. The amount of data sampled would only increase by about 33%: 1 matrix per weight = RGBA 32-bit float x 3 2 vectors and 2 quaternions = RGBA 32-bit float x 4 I would also have to upload additional static data to the weights (a position for each weight). The problem is the additional logic needed to transform each vertex since the interpolation would have to be redone for each vertex. I think this additional cost will be almost unnoticeable though since the vertex shader should be bandwidth / texture limited anyway. If there happens to be built-in functions to do slerp (I've found mix(...), but I'm not sure if it's the right one) I think the additional logic cost would be negligible. In short, I'd port this exact function to a GLSL vertex shader: for (int i = 0; i < m.vertices.size(); i++) { Vertex v = m.vertices.get(i); float x = 0, y = 0, z = 0; for (int k = 0; k < v.weightCount; k++) { Weight w = m.weights.get(v.startWeight + k); //v.startWeight = index in list of weights Joint j = bindPoseJoints.get(w.joint); //Joint contains position and orientation. I'd be using the animation joints, not the bind pose joints of course. rot(j.orientation, w.position, temp); //Quaternion rotation of weight position, temp is a temporary Vec3. Vector3f.add(temp, j.position, temp); //Add joint position temp.scale(w.bias); x += temp.x; y += temp.y; z += temp.z; } vertexData[mesh].putFloat(x).putFloat(y).putFloat(z); //Load data into an array to send it to OpenGL } I'm pretty much a skinning n00b, but these are my thoughts on it. The main problem is the amount of (static) data needed per vertex (4 x vec3 per vertex for the weight position), but if the cost is acceptable I strongly suspect that this will have better performance than doing the interpolation on the CPU, at least in my case.

Graphics and GPU Programming Programming OpenGL

Started by theagentd January 19, 2012 09:47 AM

14 comments, last by theagentd 12 years, 3 months ago

Hodgman

52,717

January 20, 2012 11:30 AM

Regarding the static vertex data, most of the implementations I've seen use a UByte*4 for the associated bone indicies and a UByte*4 for the weights for those indices. This limits each vertex to being associated with only 4 bones, and if a vertex is associated with less bones, then it also performs same math as if it were associated with 4 but it uses weights of 0.0 for the extra bones.
I've usually seen the dynamic/animated bone data represented as a 4x3 (or 3x4) matrix containing rotation/scale/translation transforms relative to the bind-pose.

Also, think about it realistically - if your model comes at 24 FPS, then multiplying it by 2 will be enough for any human. 100 keyframes per second will give you smooth slow motion, which you very likely won't be needing.

Where does the magic number 24 (or 48) come from? ;P

What dpadam450 means is that the only genre in which you will realistically encounter a large number of models that need to be animated individually is an RTS games. In a general case you'll have 10 models tops running around at one time, which is a beeze to animate on the CPU.
[/quote]Modern FPS games often have ~50 characters on-screen at once
I'm doing a sports game at the moment with 30 characters, each with 60 bones, and who all have multiple different animation sources blended together unpredictably and IK applied on top -- the whole skeletal update part is still fairly cheap and only takes up a few milliseconds.

I'd personally just implement it in a way that is easily understood first (especially if I was fairly new to skinned animation, which admittedly, I am) and work on writing a more optimal version after I got the basic one working if it actually turns out to be performing badly.

. 22 Racing Series .

theagentd

990

Author

January 20, 2012 02:20 PM

@Irreversible
To be honest I probably won't be implementing any bullet-time effects, but I will have changeable game speed, which could drop the game speed to a very low value. I still think doing the interpolation in real-time is more accurate, since even if the animation speed matches the game FPS it would still be more accurate to do the interpolation for the exact time. Maybe it really is an unnoticeable difference in 99.999% of all cases. I might not be able to afford the additional cost of lots of slerps each frame even with multithreaded joint interpolation, so getting rid of it and just keeping the precomputed bone matrices in GPU memory might be the best choice anyway. Memory is something I can afford to use more of, so precomputing to about 60-120 frames per second should give enough smoothness in all possible cases. Now I know what the animation quality setting does in games... >_>

I am actually making a real-time strategy game, so I might be having about 100 units on the screen at the same time.

@Hodgman
I've read up quite a lot on GPU skinning and I have more than enough experience with shaders to implement this. Storing the joint translation and orientation in a matrix is probably the best idea as it eliminates the weight positions that would have to be stored per vertex otherwise. I'm loading MD5 meshes and animations, so the maximum number of weights per vertex that format supports is 4, so I'll just stick with that. It also doesn't support joint scales, so that simplifies it further. If using MD5 is a bad idea for some reason, please stop me now!!!
24 frames per second comes from the specific model I'm animating.

In other news, I just managed to get my software skinning working, so Bob is (happily?) waving his lantern around. FPS dropped from 83 FPS to 14 due to the skinning being done on the CPU (well, with 1000 instances though xD). Next I'll move the skinning to a vertex shader but keep joint interpolation on the CPU which is was the standard approach, right? Lastly I'll try a pure GPU solution with precomputed joints stored in a texture.

EDIT: My software implementation is obviously bottlenecked by the skinning. Skinning takes about 65% of the frame time at the moment, possibly a lot more if you count methods that are shared with other parts of the game.

irreversible

2,900

January 20, 2012 02:29 PM

Regarding the static vertex data, most of the implementations I've seen use a UByte*4 for the associated bone indicies and a UByte*4 for the weights for those indices. This limits each vertex to being associated with only 4 bones, and if a vertex is associated with less bones, then it also performs same math as if it were associated with 4 but it uses weights of 0.0 for the extra bones.
I've usually seen the dynamic/animated bone data represented as a 4x3 (or 3x4) matrix containing rotation/scale/translation transforms relative to the bind-pose.

Incidentally, I don't have this working yet, but I'm packing indexes with a ratio of 3:1 into float vectors while maintaining 8-bit precision (I haven't done the actual math as to what the maximum practical precision is, but the packing is the same as RGB2Float), limiting the model to 255 bones, which should be enough in even the most fringe cases, but it enables more concurrently influencing bones without increasing storage. As for packing weights into a byte values, that results in a precision of 0.0039. I'm actually fairly curious as to whether this is enough (if it is, I'll definitely want to pack my weights as well). Incidentally, I'm limiting myself to 4 concurrent data streams since I'm using transform feedback to do the skinning, which supports 4 bones at most for now as the largest vector stream that can be passed to TF is vec4, which limits the number of weights that can be blended.

[quote name='irreversible' timestamp='1327056930' post='4904544']Also, think about it realistically - if your model comes at 24 FPS, then multiplying it by 2 will be enough for any human. 100 keyframes per second will give you smooth slow motion, which you very likely won't be needing.

Where does the magic number 24 (or 48) come from? ;P
[/quote]

Oh, that's from the Bob model discussed above

Modern FPS games often have ~50 characters on-screen at once
I'm doing a sports game at the moment with 30 characters, each with 60 bones, and who all have multiple different animation sources blended together unpredictably and IK applied on top -- the whole skeletal update part is still fairly cheap and only takes up a few milliseconds.
[/quote]

A fair point, but it really boils down to what the game is about. I'm personally targeting a non-kinematic solution (which, admittedly, begs the question why would one need skeletal animation anyway?).

dpadam450

2,403

January 20, 2012 06:41 PM

What I'm saying is 2 things, which someone on gamedev that is a moderater apparently doesn't understand so they rate down.

Don't over-optimize something that doesn't need it. Whatever method you do with probably be fine, unless you are really drawing a massive amount or even moderate amout of animated stuff. Unless you have an artist to make 50 models for an FPS (which I find way too high a statistic anyway), then don't worry to much about a bottleneck that may or may not exist for your specific game. Most cases just on the cpu take all the bones between last frame and the next keyframe, blend those bones into new ones and send them down to the GPU.

NBA2K, Madden, Maneater, Killing Floor, Sims http://www.pawlowskipinball.com/pinballeternal

dpadam450

2,403

January 20, 2012 06:43 PM

I'm personally targeting a non-kinematic solution (which, admittedly, begs the question why would one need skeletal animation anyway?).[/quote]
Kinematics is moving, so your probably thinking of inverse kinematics, or inverse of momement. If you have an animated character, it has bones created from art in order to make frames of animation. Any 3d object has a skeleton.

NBA2K, Madden, Maneater, Killing Floor, Sims http://www.pawlowskipinball.com/pinballeternal