Ah that worked, thanks! Is there any reason why the DirectX math library uses row major and then decides to switch in HLSL to column?
Because DX math is for C / C++, and these languages, like most others, use row-major as "natural" layout: The elements of a row are consecutive in memory, assuming that the 1st index of a float[j] is the row, and the 2nd is the column.
But the HLSL compiler by default assumes that one register (4 floats) contains one column (column-major). Assuming row vectors, the order of multiplication is this:
v * M
Here, the compiler can create very efficient code, because the multiplication is just 4 dp4 (dot product) instructions. However, the C code packed the matrix "wrong", and setting the shader constants puts a row into one register, not a column, and so the calculation yields nonsense.
You have 3 options:
1. Change the order of multiplication, like already mentioned. This makes the compiler assume a column vector:
M * v
So you actually "cheat" by making an implicit matrix transpose.
2. Use the already mentioned compiler option so the compiler assumes that a register contains a row, not a column.
However, the problem is that the compiler must create less efficient code here (4 vector * scalar and addition); this needs 3 instructions more.
Option 3:
Transpose the matrix before setting it as shader constant. Now it is actually column-major, the multiplication order is the same as it is in your C++ code, and the GPU code is optimal.