public static Vector4 Add(Vector4 left, Vector4 right){ Vector4 vector; vector.X = left.X + right.X; vector.Y = left.Y + right.Y; vector.Z = left.Z + right.Z; vector.W = left.W + right.W; return vector;}
led me to believe (from experience with VC++/CLI) that the packed-single versions of the SSE instructions aren't used, and it instead does one scalar addition at a time. So I decide to get an actual x86 disassembly, and I'm disappointed to find this:
It's not using SSE instructions AT ALL!
I have a library that uses packed-single SSE instructions to do the most common ops, and I wrote it using VC++ intrinsic functions. It is, of course, native. Problem here is that switching from .Net to native code incurs overhead (as I've seen from ANTS profiler and in benchmarks).
I've heard of SlimGen, and know that it was made by some of the same guys that work on SlimDX. I'm wondering, why doesn't SlimDX use SlimGen to inject (or however it works) SSE packed-single versions of the vector operations? I noticed SlimGen is for .Net 2.0. Is it not possible with .Net 4.0?
AND the last question! Unfortunately, most vectors I see in use aren't vec4, but vec3. I can't load/store the whole vec3 into an XMM register with one instruction (or can I?) without going over boundaries (MOVAPS is gonna move 16 bytes, but we only have 12). I'd rather avoid using a bunch of instructions to load and shuffle individual (or pairs of) floating point values. How can I best solve this problem? Convert everything to vec4? Just use scalar instructions?
Thanks in advance