Yes, I've seen that behavior as well when using compute shaders. My workaround was to use a StructuredBuffer instead and use 4 loads. In terms of the compiled assembly this isn't really any less efficient, since a float4x4 by will get split into 4 loads when compiled to assembly (you can only load 4 DWORDs at a time from structured buffers).
While searching for an answer I stumbled upon your blog, which is the just about the only place that this packing issue is mentioned (bad MSDN...). I actually used your advise and loaded the 4 vectors myself, just to find out that the skinning shader went from 46 instructions with column-major to 176 instructions with row-major.
I knew that row-major is more inefficient than row-major, but x4 instructions is just too much.