More questions!
Recently I've been working on making a high-performance vector library for use in my game engine. Specifically I'm trying to make a bunch of functions use SIMD instructions in order to speed things up. Intel's reference manuals (and Agner Fog's invaluable guide) have been of great assistance here, and my question is in relation to something I found in there. I ran across the following code snippet in the Optimization Reference Manual:
movaps xmm0, [eax]
mulps xmm0, [eax+16]
movhlps xmm1, xmm0
addps xmm0, xmm1
pshufd xmm1, xmm0, 1
addss xmm0, xmm1
movss [ecx], xmm0
for SSE/SSE2, and the following:
movaps xmm0, [eax]
mulps xmm0, [eax+16]
haddps xmm0, xmm0
movaps xmm1, xmm0
psrlq xmm0, 32
addss xmm0, xmm1
movss [eax], xmm0
for SSE3.
What I'd like to be able to do is take these assembly snippets and turn them into nice, standard C/C++ functions, presumably with the use of the __asm{} stuff. I went ahead and tried this using some __m128 intrinsic types and replaced the eax bits with references to these...
...and it blew up in my face. When I didn't get access violations I got some random number that wasn't at all what I had expected. If there are any x86 assembly gurus out there, would you mind explaining the correct way to set these up? I possess some familiarity with assembly and looked over the article here on SSE2, but it hasn't helped much. Additionally, I'd like for the assembly code to just drop the return value in whichever register it ultimately goes to and leave, but I haven't found anything on how to do this on the Interwebs. Help please?
clb: At the end of 2012, the positions of jupiter, saturn, mercury, and deimos are aligned so as to cause a denormalized flush-to-zero bug when computing earth's gravitational force, slinging it to the sun.