Normals can also be compressed into 4 bytes without quality loss -- 11_11_10 is ok, 16_16 octahedral is better.
That means you can store normal + tangent in 8 bytes.
You probably don't need full 32-bit float precision for UV's, so you could put them in 16-bit half-floats, which reduces them from 8 bytes to 4 bytes total.
If colour is not HDR, then it can be stored in 8-bit per channel, reducing them from 16 bytes to 4 bytes.
Bone ID and weight can often be 8 bit, but we've got space in the structure leftover so let's say they need to be 16bit each.
This gives you a 32byte vertex -- a 2.5x improvement!
struct Vertex
{
float3 position; // 0 + 12 = 12 bytes
uint normal; // 12 + 4 = 16 bytes
uint tangent; // 16 + 4 = 20 bytes
half2 uv; // 20 + 4 = 24 bytes
uint color; // 24 + 4 = 28 bytes
uint bone_id_weight; // 28 + 4 = 32 bytes
}
You also need one bit to tell you whether the binormal is cross(normal,tangent) or -cross(normal,tangent), but you could squeeze that into one of the bits in the bone_id_weight field -- 15 bits for weight is still way more than required.
Is that sort of stuff worth it though? Reducing memory usage at the cost of increasing computation?
Computers double in computation power every two years, and double in memory bandwidth every 10 years... Which is another way to say that every decade the "bytes transferred per FLOP" performance of computers gets 32 times WORSE! :o (on the flipside, "FLOPS available per transferred byte" gets 32 times better each decade! :D )
Optimizing for memory bandwidth has always been important for GPU's, and is getting more important with every new generation of hardware :(