Back to General and Gameplay Programming

Inline assembly, c++, visual studio

General and Gameplay Programming Programming

Started by zodiacbrave August 05, 2010 04:13 PM

5 comments, last by Christian Weis 13 years, 8 months ago

zodiacbrave

122

Author

August 05, 2010 04:13 PM

So, I want to do these operations using inline assembly. I'm using visual studio 2008 and I'm getting weird results towards the end. I think it has to do with the fpu register stack. norm2 comes out as garbage values, while c, b and a are what they are supposed to be. Anyway, this is my c code and my asm is beneath it.

       double a = 1.0, b = 2.0, norm2 = 0.0, two = 2.0, x = 3.0, y = 4.0, c;         c = a*a - b*b + x;         b = 2.0*a*b + y;         a = c;         norm2 = a*a + b*b;

and this is my assembly

 __asm   {    fld b ; st: b    fmul st(0), st(0) ; st: b * b    fld a ; st: a, b * b    fmul st(0), st(0) ; st: a * a, b * b    fsub st(0), st(1) ; st: a * a - b * b, b * b    fld x ; st: x, a * a - b * b, b * b    fadd st(0), st(1) ;  st: x + a * a - b * b,a * a - b * b, b * b    fstp c  ; our new c  st: a * a - b * b, b * b    fstp st(0)    fstp st(0) ; st :      fld b ; st: b    fld a ; st: a, b    fmul st(0), st(1) ; a * b, b    fld two ; 2, a * b, b    fmul st(0), st(1) ; a * b * 2, a * b, b    fld y ; y, a * b * 2, a * b, b    fadd st(0), st(1)    fstp b    fld c    fst a    fmul st(0), st(0)    fld b    fmul st(0), st(0)    fadd st(0), st(1)    fstp norm2  }

Should I be popping the stack regularly? Does it slow down the app?

Thank you

Zahlman

1,682

August 06, 2010 01:00 AM

Umm... why not just use registers?

zodiacbrave

122

Author

August 06, 2010 01:51 AM

st(0) - st(7) are the floating point registers, which is what I'm using to store the doubles.

Jan Wassenberg

1,000

August 06, 2010 02:00 AM

Quote:Umm... why not just use registers?

heh, the FPU _register_ stack was before your time? :D

Quote:Should I be popping the stack regularly? Does it slow down the app?

There is no need to pop the stack `regularly' (in fact it helps to have intermediate results up there that you might be able to reuse via FXCH) - you must however avoid overflowing it.
Your code has a massive imbalance (missing lots of pops at the end) - if there's a preceding inline asm block with similar behavior, you might indeed see an overflow. (IIRC, the FPU control word has a TopOfStack field you could check.)
Incidentally, why write FPU code when SSE is pretty much universally available?
And if you must, a+a is faster than a*2, and there are instructions that pop the stack twice (FCOMPP).

E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3

RobTheBloke

2,553

August 06, 2010 05:36 AM

package up 2 at a time for the win ;)

#include <emmintrin.h>void func(double a[2], double b[2], const double x[2], const double y[2], double norm[2]){  // load args  __m128d a_ = _mm_loadu_pd(a);  __m128d b_ = _mm_loadu_pd(b);  __m128d x_ = _mm_loadu_pd(x);  __m128d y_ = _mm_loadu_pd(y);  // compute  __m128d aa = _mm_mul_pd(a_, a_);  __m128d bb = _mm_mul_pd(b_, b_);  __m128d ab = _mm_mul_pd(a_, b_);  //  c = a*a - b*b + x;  __m128d c = _mm_add_pd(x_, _mm_sub_pd(aa, bb));  //  b = 2.0*a*b + y;  b_ = _mm_add_pd(y_, _mm_add_pd(ab, ab));    // a = c;  a_ = c;  // norm2 = a*a + b*b;  __m128d norm2_ = _mm_add_pd(_mm_mul_pd(a_, a_), _mm_mul_pd(b_, b_));    // store  _mm_storeu_pd(a, a_);  _mm_storeu_pd(b, b_);  _mm_storeu_pd(norm2, norm2_);}

SiCrane

11,840

August 06, 2010 07:55 AM

It might be instructive to see what kind of assembly the compiler itself would generate for your code. For MSVC 2008 you can use the /FA family of switches. Turning your code into a function, making a, b, x and y arguments and returning norm2 and feeding it to MSVC 2008 in a release build gets it to spit out:

_TEXT	SEGMENT_a$ = 8							; size = 8_b$ = 16						; size = 8_x$ = 24						; size = 8_y$ = 32						; size = 8?func@@YANNNNN@Z PROC					; func, COMDAT	push	ebp	mov	ebp, esp	fld	QWORD PTR _a$[ebp]	fmul	QWORD PTR _a$[ebp]	fld	QWORD PTR _b$[ebp]	fmul	QWORD PTR _b$[ebp]	fsubp	ST(1), ST(0)	fadd	QWORD PTR _x$[ebp]	fld	QWORD PTR _a$[ebp]	fadd	ST(0), ST(0)	fmul	QWORD PTR _b$[ebp]	fadd	QWORD PTR _y$[ebp]	fld	ST(1)	fmulp	ST(2), ST(0)	fld	ST(0)	fmulp	ST(1), ST(0)	faddp	ST(1), ST(0)	pop	ebp	ret	0

Some blank comment lines were removed for clarity.

Christian Weis

August 07, 2010 08:10 PM

I don't want to be harsh, but writing FPU optimized assembly code is almost always a waste of time. The FPU is actually so slow that any inline assembly code doesn't give you any noticable benefits. If this function is so time-critical to you then I recommend to use SSE intrinsics as proposed by RobTheBloke.

Inline assembly, c++, visual studio

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Inline assembly, c++, visual studio

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines