also I think you should consider encoding the matrix palette transforms as a quat rotation + translation to take up half as many registers.
That's a tradeoff as you're going to need to convert it to a matrix for each bone in your vertex shader, so you could end up with less storage but at a cost of a high additional per-vertex computational overhead (if hardware supported quaternion transforms it would be different, of course).
vec4 orientation = u_quaternionArray[index];
float invS = 2.0 / dot(orientation, orientation);
vec3 s = orientation.xyz * invS;
vec3 w = orientation.w * s;
vec3 x = orientation.x * s;
vec3 y = orientation.yyz * s.yzz;
vec4 posAndScale = u_positionAndScaleArray[index];//xyz = translation, w = scale
mat4 objectToWorld = mat4(
posAndScale.w * vec4(1.0 - (y.x + y.z), x.y - w.z, x.z + w.y, 0.0),
posAndScale.w * vec4(x.y + w.z, 1.0 - (x.x + y.z), y.y - w.x, 0.0),
posAndScale.w * vec4(x.z - w.y, y.y + w.x, 1.0 - (x.x + y.x), 0.0),
vec4(posAndScale.xyz, 1.0));
Additional cost per vertex is not that big after all. Didn't even notice performance drop with iphone4s. It's quite gpu friendly ALU code after all.