Inline assembly, c++, visual studio

Started by
5 comments, last by Christian Weis 13 years, 8 months ago
So, I want to do these operations using inline assembly. I'm using visual studio 2008 and I'm getting weird results towards the end. I think it has to do with the fpu register stack. norm2 comes out as garbage values, while c, b and a are what they are supposed to be. Anyway, this is my c code and my asm is beneath it.
       double a = 1.0, b = 2.0, norm2 = 0.0, two = 2.0, x = 3.0, y = 4.0, c;         c = a*a - b*b + x;         b = 2.0*a*b + y;         a = c;         norm2 = a*a + b*b;


and this is my assembly
 __asm   {    fld b ; st: b    fmul st(0), st(0) ; st: b * b    fld a ; st: a, b * b    fmul st(0), st(0) ; st: a * a, b * b    fsub st(0), st(1) ; st: a * a - b * b, b * b    fld x ; st: x, a * a - b * b, b * b    fadd st(0), st(1) ;  st: x + a * a - b * b,a * a - b * b, b * b    fstp c  ; our new c  st: a * a - b * b, b * b    fstp st(0)    fstp st(0) ; st :      fld b ; st: b    fld a ; st: a, b    fmul st(0), st(1) ; a * b, b    fld two ; 2, a * b, b    fmul st(0), st(1) ; a * b * 2, a * b, b    fld y ; y, a * b * 2, a * b, b    fadd st(0), st(1)    fstp b    fld c    fst a    fmul st(0), st(0)    fld b    fmul st(0), st(0)    fadd st(0), st(1)    fstp norm2  } 


Should I be popping the stack regularly? Does it slow down the app?

Thank you
Advertisement
Umm... why not just use registers?
st(0) - st(7) are the floating point registers, which is what I'm using to store the doubles.
Quote:Umm... why not just use registers?

heh, the FPU _register_ stack was before your time? :D

Quote:Should I be popping the stack regularly? Does it slow down the app?

There is no need to pop the stack `regularly' (in fact it helps to have intermediate results up there that you might be able to reuse via FXCH) - you must however avoid overflowing it.
Your code has a massive imbalance (missing lots of pops at the end) - if there's a preceding inline asm block with similar behavior, you might indeed see an overflow. (IIRC, the FPU control word has a TopOfStack field you could check.)
Incidentally, why write FPU code when SSE is pretty much universally available?
And if you must, a+a is faster than a*2, and there are instructions that pop the stack twice (FCOMPP).
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3

package up 2 at a time for the win ;)

#include <emmintrin.h>void func(double a[2], double b[2], const double x[2], const double y[2], double norm[2]){  // load args  __m128d a_ = _mm_loadu_pd(a);  __m128d b_ = _mm_loadu_pd(b);  __m128d x_ = _mm_loadu_pd(x);  __m128d y_ = _mm_loadu_pd(y);  // compute  __m128d aa = _mm_mul_pd(a_, a_);  __m128d bb = _mm_mul_pd(b_, b_);  __m128d ab = _mm_mul_pd(a_, b_);  //  c = a*a - b*b + x;  __m128d c = _mm_add_pd(x_, _mm_sub_pd(aa, bb));  //  b = 2.0*a*b + y;  b_ = _mm_add_pd(y_, _mm_add_pd(ab, ab));    // a = c;  a_ = c;  // norm2 = a*a + b*b;  __m128d norm2_ = _mm_add_pd(_mm_mul_pd(a_, a_), _mm_mul_pd(b_, b_));    // store  _mm_storeu_pd(a, a_);  _mm_storeu_pd(b, b_);  _mm_storeu_pd(norm2, norm2_);}
It might be instructive to see what kind of assembly the compiler itself would generate for your code. For MSVC 2008 you can use the /FA family of switches. Turning your code into a function, making a, b, x and y arguments and returning norm2 and feeding it to MSVC 2008 in a release build gets it to spit out:
_TEXT	SEGMENT_a$ = 8							; size = 8_b$ = 16						; size = 8_x$ = 24						; size = 8_y$ = 32						; size = 8?func@@YANNNNN@Z PROC					; func, COMDAT	push	ebp	mov	ebp, esp	fld	QWORD PTR _a$[ebp]	fmul	QWORD PTR _a$[ebp]	fld	QWORD PTR _b$[ebp]	fmul	QWORD PTR _b$[ebp]	fsubp	ST(1), ST(0)	fadd	QWORD PTR _x$[ebp]	fld	QWORD PTR _a$[ebp]	fadd	ST(0), ST(0)	fmul	QWORD PTR _b$[ebp]	fadd	QWORD PTR _y$[ebp]	fld	ST(1)	fmulp	ST(2), ST(0)	fld	ST(0)	fmulp	ST(1), ST(0)	faddp	ST(1), ST(0)	pop	ebp	ret	0

Some blank comment lines were removed for clarity.
I don't want to be harsh, but writing FPU optimized assembly code is almost always a waste of time. The FPU is actually so slow that any inline assembly code doesn't give you any noticable benefits. If this function is so time-critical to you then I recommend to use SSE intrinsics as proposed by RobTheBloke.

This topic is closed to new replies.

Advertisement