I haven't yet tried any of this because I want to get as much info as possible so I can "do it right". If I used your suggestion, how would I change the float3 into some type of rotation matrix? Would it be a float3x3 with [0,0] for x axis, [1,1] for y axis, and [2,2] for z axis? I don't actually know how to set it up.....
And the question remains, would I be sacrificing speed for a smaller size?
Look up quaternion <-> matrix conversions, or how to set up transformations directly as quaternions. It's easier to start with float4 and then apply the float3 optimization later.
As for speed, it depends on your bottleneck. If you've got ALU to spare but bandwidth or interpolators are a problem, you win from the conversion. This is probably the case for instancing.