I could just multi-thread the bone interpolation for an almost linear increase in performance, but it I still thought that it should be possible to off-load almost everything onto the GPU. That's when I stumbled over this white paper: http://developer.dow...gWhitePaper.pdf. It seems to do exactly what I want by storing bone matrices in a texture, which was something I had thought of doing. However, the implementation in the white-paper does not seem to have any kind of bone interpolation, and simply rounds the frame to the closest frame (though this isn't written anywhere). I'm 99.9% sure they don't re-upload the bone matrices each frame since they seem to keep the bone data for each animation and frame, not for each individual instance. Losing interpolation seems like a huge step backwards, so I would definitely not implement skinning if that turned out to be the cost.
I figured I could just upload even "rawer" bone data to my animation texture, meaning I'd keep a 3D vector and a quaternion per bone instead of a matrix and then do the interpolation between the two frames in the vertex shader. The amount of data sampled would only increase by about 33%:
1 matrix per weight = RGBA 32-bit float x 3
2 vectors and 2 quaternions = RGBA 32-bit float x 4
I would also have to upload additional static data to the weights (a position for each weight). The problem is the additional logic needed to transform each vertex since the interpolation would have to be redone for each vertex. I think this additional cost will be almost unnoticeable though since the vertex shader should be bandwidth / texture limited anyway. If there happens to be built-in functions to do slerp (I've found mix(...), but I'm not sure if it's the right one) I think the additional logic cost would be negligible.
In short, I'd port this exact function to a GLSL vertex shader:
for (int i = 0; i < m.vertices.size(); i++) {
Vertex v = m.vertices.get(i);
float x = 0, y = 0, z = 0;
for (int k = 0; k < v.weightCount; k++) {
Weight w = m.weights.get(v.startWeight + k); //v.startWeight = index in list of weights
Joint j = bindPoseJoints.get(w.joint); //Joint contains position and orientation. I'd be using the animation joints, not the bind pose joints of course.
rot(j.orientation, w.position, temp); //Quaternion rotation of weight position, temp is a temporary Vec3.
Vector3f.add(temp, j.position, temp); //Add joint position
temp.scale(w.bias);
x += temp.x;
y += temp.y;
z += temp.z;
}
vertexData[mesh].putFloat(x).putFloat(y).putFloat(z); //Load data into an array to send it to OpenGL
}
I'm pretty much a skinning n00b, but these are my thoughts on it. The main problem is the amount of (static) data needed per vertex (4 x vec3 per vertex for the weight position), but if the cost is acceptable I strongly suspect that this will have better performance than doing the interpolation on the CPU, at least in my case.