Jump to content

  • Log In with Google      Sign In   
  • Create Account

We're offering banner ads on our site from just $5!

1. Details HERE. 2. GDNet+ Subscriptions HERE. 3. Ad upload HERE.


GPU Skinning and frame interpolation


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
15 replies to this topic

#1 theagentd   Members   -  Reputation: 602

Like
0Likes
Like

Posted 19 January 2012 - 03:47 AM

Hello. I've been trying to do skinning, and so far I think I've kind of figured out the math behind it. I've successfully loaded an MD5 mesh and I'm working on implementing CPU skinning for a small animation following this guide: http://3dgep.com/?p=1053. I think I can get software skinning working quite easily but I definitely want to do the skinning in a vertex shader. I found another tutorial on the same website that covers this: http://3dgep.com/?p=1356. The "problem" is that it seems to do the bone interpolation on the CPU and then submit the data each frame to the GPU. I just transforming 100 bones shouldn't be that CPU heavy, but I might potentially have a thousand instances of the same mesh but in different animations. Therefore I also want to use instancing to reduce the CPU. Since my game is an RTS game the GPU is severely underused even though I offloaded fog of war rendering, so trading GPU cycles for CPU cycles is a good thing in my case.

I could just multi-thread the bone interpolation for an almost linear increase in performance, but it I still thought that it should be possible to off-load almost everything onto the GPU. That's when I stumbled over this white paper: http://developer.dow...gWhitePaper.pdf. It seems to do exactly what I want by storing bone matrices in a texture, which was something I had thought of doing. However, the implementation in the white-paper does not seem to have any kind of bone interpolation, and simply rounds the frame to the closest frame (though this isn't written anywhere). I'm 99.9% sure they don't re-upload the bone matrices each frame since they seem to keep the bone data for each animation and frame, not for each individual instance. Losing interpolation seems like a huge step backwards, so I would definitely not implement skinning if that turned out to be the cost.

I figured I could just upload even "rawer" bone data to my animation texture, meaning I'd keep a 3D vector and a quaternion per bone instead of a matrix and then do the interpolation between the two frames in the vertex shader. The amount of data sampled would only increase by about 33%:

1 matrix per weight = RGBA 32-bit float x 3
2 vectors and 2 quaternions = RGBA 32-bit float x 4

I would also have to upload additional static data to the weights (a position for each weight). The problem is the additional logic needed to transform each vertex since the interpolation would have to be redone for each vertex. I think this additional cost will be almost unnoticeable though since the vertex shader should be bandwidth / texture limited anyway. If there happens to be built-in functions to do slerp (I've found mix(...), but I'm not sure if it's the right one) I think the additional logic cost would be negligible.

In short, I'd port this exact function to a GLSL vertex shader:
  for (int i = 0; i < m.vertices.size(); i++) {
	Vertex v = m.vertices.get(i);
	float x = 0, y = 0, z = 0;
	for (int k = 0; k < v.weightCount; k++) {
	 Weight w = m.weights.get(v.startWeight + k); //v.startWeight = index in list of weights
	 Joint j = bindPoseJoints.get(w.joint); //Joint contains position and orientation. I'd be using the animation joints, not the bind pose joints of course.
	 rot(j.orientation, w.position, temp); //Quaternion rotation of weight position, temp is a temporary Vec3.
	 Vector3f.add(temp, j.position, temp); //Add joint position
	 temp.scale(w.bias);
	 x += temp.x;
	 y += temp.y;
	 z += temp.z;
	}
	vertexData[mesh].putFloat(x).putFloat(y).putFloat(z); //Load data into an array to send it to OpenGL
   }

I'm pretty much a skinning n00b, but these are my thoughts on it. The main problem is the amount of (static) data needed per vertex (4 x vec3 per vertex for the weight position), but if the cost is acceptable I strongly suspect that this will have better performance than doing the interpolation on the CPU, at least in my case.

Sponsor:

#2 wolfscaptain   Members   -  Reputation: 200

Like
0Likes
Like

Posted 19 January 2012 - 03:19 PM

I think it's better to bind a texture containing already interpolated matrices.
Interpolate on the CPU using appropriate structure (most likely 3D vectors for translation and scaling, and a Quaternion for rotation).

With each vertex attribute, pass your weights (you probably want to maximize them to a reasonable number. I think most people use a maximum of 4 weights per vertex), and use them as look-up values to the texture.

I have never done GPU skinning yet, but I plan to do it pretty soon and I think that's how I am going to do it. Sounds reasonable on both CPU and GPU, each doing the appropriate task.

#3 theagentd   Members   -  Reputation: 602

Like
0Likes
Like

Posted 19 January 2012 - 06:48 PM

I realize that the GPU would have to interpolate 1 to 4 bones per vertex. With an average of 2.5 weights per vertex and a around 1 000 vertices, that's about 2 500 bone interpolations per instance. A CPU implementation would only have to process each bone once, which in the case of my test model means only 33 interpolations per instance.

In the end I think I'll just implement both just to see if and how much slower the GPU version is...

EDIT: Oh, and the function in my first post does not contain any frame interpolation. Doh!

#4 dpadam450   Members   -  Reputation: 934

Like
0Likes
Like

Posted 19 January 2012 - 06:54 PM

Things that don't change: weights and bone indices. So its pretty obvious to keep those on the GPU. Also keep the initial vertex position (bind pose). Take your for loop and put it on the GPU. For each object, compute the current frame matrix, and send all those bones down for that object. If 2 of the same character models are used with 1 running and 1 walking, then you have to send the walk bones for the one, and the run bone matrices for the other.

#5 irreversible   Crossbones+   -  Reputation: 1382

Like
0Likes
Like

Posted 19 January 2012 - 07:45 PM

I realize that the GPU would have to interpolate 1 to 4 bones per vertex. With an average of 2.5 weights per vertex and a around 1 000 vertices, that's about 2 500 bone interpolations per instance. A CPU implementation would only have to process each bone once, which in the case of my test model means only 33 interpolations per instance.

In the end I think I'll just implement both just to see if and how much slower the GPU version is...

EDIT: Oh, and the function in my first post does not contain any frame interpolation. Doh!


Having just done that I can tell you that with running all four weights per vertex per bone (33 you say? you're animating Bob, aren't you? :) ) on the GPU, you'll be looking at around 30-100 fold increase in speed compared to smart skinning per-weight on the CPU. The relevant timings for me in debug mode are ~1-1.5 milliseconds for skinning on the CPU and 0.025-0.1 milliseconds on the GPU (this is using a 3 year old mobile version of GF 240GT). This assumes bones are animated/interpolated on the CPU and passed to the shader via a uniform.

#6 theagentd   Members   -  Reputation: 602

Like
0Likes
Like

Posted 19 January 2012 - 08:04 PM


I realize that the GPU would have to interpolate 1 to 4 bones per vertex. With an average of 2.5 weights per vertex and a around 1 000 vertices, that's about 2 500 bone interpolations per instance. A CPU implementation would only have to process each bone once, which in the case of my test model means only 33 interpolations per instance.

In the end I think I'll just implement both just to see if and how much slower the GPU version is...

EDIT: Oh, and the function in my first post does not contain any frame interpolation. Doh!


Having just done that I can tell you that with running all four weights per vertex per bone (33 you say? you're animating Bob, aren't you? Posted Image ) on the GPU, you'll be looking at around 30-100 fold increase in speed compared to smart skinning per-weight on the CPU. The relevant timings for me in debug mode are ~1-1.5 milliseconds for skinning on the CPU and 0.025-0.1 milliseconds on the GPU (this is using a 3 year old mobile version of GF 240GT). This assumes bones are animated/interpolated on the CPU and passed to the shader via a uniform.

Yes, of course the GPU is faster than the CPU, but the real fight is between the hybrid CPU/GPU solution and a pure GPU solution. You're talking about the "hybrid" one, were the bone interpolation is done on the CPU and all per-vertex calculations are done on the GPU. In my first post I suggested offloading the interpolation to the GPU too, which would slightly lessen the amount of CPU cycles and bandwidth needed for a possibly big GPU hit. You're comparing the CPU/GPU solution to a pure CPU solution. And yes, I hope to get a few Bobs swinging their lanterns around soon. =D

#7 irreversible   Crossbones+   -  Reputation: 1382

Like
0Likes
Like

Posted 19 January 2012 - 08:29 PM

In my first post I suggested offloading the interpolation to the GPU too, which would slightly lessen the amount of CPU cycles and bandwidth needed for a possibly big GPU hit. You're comparing the CPU/GPU solution to a pure CPU solution. And yes, I hope to get a few Bobs swinging their lanterns around soon. =D


Don't bother. Precompute your skeleton at whatever framerate that your target is (say 40-50) for all animations and round to the closest precomputed frame when rendering. You'll end up with zero computation time and the memory footprint is negligible.

#8 dpadam450   Members   -  Reputation: 934

Like
-1Likes
Like

Posted 19 January 2012 - 08:51 PM

Its also good to note that not much animated stuff actually is rendered in any game other than an RTS. In that case you could worry about doing other things.

#9 theagentd   Members   -  Reputation: 602

Like
0Likes
Like

Posted 19 January 2012 - 10:07 PM


In my first post I suggested offloading the interpolation to the GPU too, which would slightly lessen the amount of CPU cycles and bandwidth needed for a possibly big GPU hit. You're comparing the CPU/GPU solution to a pure CPU solution. And yes, I hope to get a few Bobs swinging their lanterns around soon. =D


Don't bother. Precompute your skeleton at whatever framerate that your target is (say 40-50) for all animations and round to the closest precomputed frame when rendering. You'll end up with zero computation time and the memory footprint is negligible.

But what if I want a super slow-motion effect that effectively drops the game speed to 1/10th? That would force me to precompute waaay too many frames for each animation. Also, I'd prefer the accuracy of real-time bone interpolation. That's a good suggestion though. Maybe precomputing a few extra frames to increase the framerate of each animation to 3x, and then not using slerp but simple lerp+normalizing on the quaternion in a vertex shader would produce accurate enough motion. You could even precompute with bicubic interpolation too.

Each frame requires about a kilobyte of data, Each animation may be shared between different meshes if they have the same skeleton layout, so the number of animations won't be that many. Let's say 20 units with different animation sets, each with 10 different 2 second animations. That'd be 20 x 10 x 2x24 frames of raw data --> each frame is 33 bones x 7 floats x 4 bytes per float --> 20x10x2x24x33x7x4 = 8.45MBs of data. Multiplying the number of frames by 4 is still just around 35MBs of data, so precomputing is a very possible solution. Nice idea!


Its also good to note that not much animated stuff actually is rendered in any game other than an RTS. In that case you could worry about doing other things.

I don't really understand what you're saying...? ._.

#10 irreversible   Crossbones+   -  Reputation: 1382

Like
0Likes
Like

Posted 20 January 2012 - 04:55 AM

But what if I want a super slow-motion effect that effectively drops the game speed to 1/10th? That would force me to precompute waaay too many frames for each animation. Also, I'd prefer the accuracy of real-time bone interpolation. That's a good suggestion though. Maybe precomputing a few extra frames to increase the framerate of each animation to 3x, and then not using slerp but simple lerp+normalizing on the quaternion in a vertex shader would produce accurate enough motion. You could even precompute with bicubic interpolation too.


Question: is the "what if" actually a realistic expectation? Will you be implementing bullet time?

If the answer is yes, then you could do some dynamic branching - if the required animation speed drops below a certain threshold (say, the precomputed framerate), then interpolate the skeleton for that model on the fly. Note that bone interpolation is cheap for an average model. 30-60 bones isn't that big of a deal really and it's okay to do it when needed. The real optimization you should be looking into here is the actual number of updates per second. If you're running hundreds of models at 100 FPS, then updating them each frame is going to affect said FPS. If you update each skeleton every other frame, you won't be compromising any visual quality and you'll have effectively halved the work load. What is really expensive here is the skinning, which is done on the GPU. Note that the Bob model actually has a relatively low poly count (around 800 vertices) - take a proper model with 10k vertices and 50 bones and you'll be looking at roughly the same animation load on the CPU, but a 15 times more expensive skinning phase on the GPU. Vertices ramp up as time goes on. There's really no need to ramp up the number of bones in most cases.

Each frame requires about a kilobyte of data, Each animation may be shared between different meshes if they have the same skeleton layout, so the number of animations won't be that many. Let's say 20 units with different animation sets, each with 10 different 2 second animations. That'd be 20 x 10 x 2x24 frames of raw data --> each frame is 33 bones x 7 floats x 4 bytes per float --> 20x10x2x24x33x7x4 = 8.45MBs of data. Multiplying the number of frames by 4 is still just around 35MBs of data, so precomputing is a very possible solution. Nice idea!


It's likelier you want to store your animation as a list of matrices, not 7 floats, for faster access. Also, think about it realistically - if your model comes at 24 FPS, then multiplying it by 2 will be enough for any human. 100 keyframes per second will give you smooth slow motion, which you very likely won't be needing.

Its also good to note that not much animated stuff actually is rendered in any game other than an RTS. In that case you could worry about doing other things.


I don't really understand what you're saying...? ._.


What dpadam450 means is that the only genre in which you will realistically encounter a large number of models that need to be animated individually is an RTS games. In a general case you'll have 10 models tops running around at one time, which is a beeze to animate on the CPU.

#11 Hodgman   Moderators   -  Reputation: 31173

Like
0Likes
Like

Posted 20 January 2012 - 05:30 AM

Regarding the static vertex data, most of the implementations I've seen use a UByte*4 for the associated bone indicies and a UByte*4 for the weights for those indices. This limits each vertex to being associated with only 4 bones, and if a vertex is associated with less bones, then it also performs same math as if it were associated with 4 but it uses weights of 0.0 for the extra bones.
I've usually seen the dynamic/animated bone data represented as a 4x3 (or 3x4) matrix containing rotation/scale/translation transforms relative to the bind-pose.

Also, think about it realistically - if your model comes at 24 FPS, then multiplying it by 2 will be enough for any human. 100 keyframes per second will give you smooth slow motion, which you very likely won't be needing.

Where does the magic number 24 (or 48) come from? ;P

What dpadam450 means is that the only genre in which you will realistically encounter a large number of models that need to be animated individually is an RTS games. In a general case you'll have 10 models tops running around at one time, which is a beeze to animate on the CPU.

Modern FPS games often have ~50 characters on-screen at once Posted Image
I'm doing a sports game at the moment with 30 characters, each with 60 bones, and who all have multiple different animation sources blended together unpredictably and IK applied on top -- the whole skeletal update part is still fairly cheap and only takes up a few milliseconds.

I'd personally just implement it in a way that is easily understood first (especially if I was fairly new to skinned animation, which admittedly, I am) and work on writing a more optimal version after I got the basic one working if it actually turns out to be performing badly.

#12 theagentd   Members   -  Reputation: 602

Like
0Likes
Like

Posted 20 January 2012 - 08:20 AM

@Irreversible
To be honest I probably won't be implementing any bullet-time effects, but I will have changeable game speed, which could drop the game speed to a very low value. I still think doing the interpolation in real-time is more accurate, since even if the animation speed matches the game FPS it would still be more accurate to do the interpolation for the exact time. Maybe it really is an unnoticeable difference in 99.999% of all cases. I might not be able to afford the additional cost of lots of slerps each frame even with multithreaded joint interpolation, so getting rid of it and just keeping the precomputed bone matrices in GPU memory might be the best choice anyway. Memory is something I can afford to use more of, so precomputing to about 60-120 frames per second should give enough smoothness in all possible cases. Now I know what the animation quality setting does in games... >_>

I am actually making a real-time strategy game, so I might be having about 100 units on the screen at the same time.


@Hodgman
I've read up quite a lot on GPU skinning and I have more than enough experience with shaders to implement this. Storing the joint translation and orientation in a matrix is probably the best idea as it eliminates the weight positions that would have to be stored per vertex otherwise. I'm loading MD5 meshes and animations, so the maximum number of weights per vertex that format supports is 4, so I'll just stick with that. It also doesn't support joint scales, so that simplifies it further. If using MD5 is a bad idea for some reason, please stop me now!!!
24 frames per second comes from the specific model I'm animating.


In other news, I just managed to get my software skinning working, so Bob is (happily?) waving his lantern around. FPS dropped from 83 FPS to 14 due to the skinning being done on the CPU (well, with 1000 instances though xD). Next I'll move the skinning to a vertex shader but keep joint interpolation on the CPU which is was the standard approach, right? Lastly I'll try a pure GPU solution with precomputed joints stored in a texture.

EDIT: My software implementation is obviously bottlenecked by the skinning. Skinning takes about 65% of the frame time at the moment, possibly a lot more if you count methods that are shared with other parts of the game.

#13 irreversible   Crossbones+   -  Reputation: 1382

Like
0Likes
Like

Posted 20 January 2012 - 08:29 AM

Regarding the static vertex data, most of the implementations I've seen use a UByte*4 for the associated bone indicies and a UByte*4 for the weights for those indices. This limits each vertex to being associated with only 4 bones, and if a vertex is associated with less bones, then it also performs same math as if it were associated with 4 but it uses weights of 0.0 for the extra bones.
I've usually seen the dynamic/animated bone data represented as a 4x3 (or 3x4) matrix containing rotation/scale/translation transforms relative to the bind-pose.


Incidentally, I don't have this working yet, but I'm packing indexes with a ratio of 3:1 into float vectors while maintaining 8-bit precision (I haven't done the actual math as to what the maximum practical precision is, but the packing is the same as RGB2Float), limiting the model to 255 bones, which should be enough in even the most fringe cases, but it enables more concurrently influencing bones without increasing storage. As for packing weights into a byte values, that results in a precision of 0.0039. I'm actually fairly curious as to whether this is enough (if it is, I'll definitely want to pack my weights as well). Incidentally, I'm limiting myself to 4 concurrent data streams since I'm using transform feedback to do the skinning, which supports 4 bones at most for now as the largest vector stream that can be passed to TF is vec4, which limits the number of weights that can be blended.

Also, think about it realistically - if your model comes at 24 FPS, then multiplying it by 2 will be enough for any human. 100 keyframes per second will give you smooth slow motion, which you very likely won't be needing.

Where does the magic number 24 (or 48) come from? ;P


Oh, that's from the Bob model discussed above Posted Image

Modern FPS games often have ~50 characters on-screen at once Posted Image
I'm doing a sports game at the moment with 30 characters, each with 60 bones, and who all have multiple different animation sources blended together unpredictably and IK applied on top -- the whole skeletal update part is still fairly cheap and only takes up a few milliseconds.


A fair point, but it really boils down to what the game is about. I'm personally targeting a non-kinematic solution (which, admittedly, begs the question why would one need skeletal animation anyway?).

#14 dpadam450   Members   -  Reputation: 934

Like
0Likes
Like

Posted 20 January 2012 - 12:41 PM

What I'm saying is 2 things, which someone on gamedev that is a moderater apparently doesn't understand so they rate down.

Don't over-optimize something that doesn't need it. Whatever method you do with probably be fine, unless you are really drawing a massive amount or even moderate amout of animated stuff. Unless you have an artist to make 50 models for an FPS (which I find way too high a statistic anyway), then don't worry to much about a bottleneck that may or may not exist for your specific game. Most cases just on the cpu take all the bones between last frame and the next keyframe, blend those bones into new ones and send them down to the GPU.

#15 dpadam450   Members   -  Reputation: 934

Like
0Likes
Like

Posted 20 January 2012 - 12:43 PM

I'm personally targeting a non-kinematic solution (which, admittedly, begs the question why would one need skeletal animation anyway?).

Kinematics is moving, so your probably thinking of inverse kinematics, or inverse of momement. If you have an animated character, it has bones created from art in order to make frames of animation. Any 3d object has a skeleton.

#16 theagentd   Members   -  Reputation: 602

Like
0Likes
Like

Posted 20 January 2012 - 11:29 PM

Thanks for the responses, everyone! I got some really interesting responses, so I'll probably be busy for a while now. =P




Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS