• ### Announcements

#### Archived

This topic is now archived and is closed to further replies.

# How fast really is the VC++ inline assembler?

## Recommended Posts

OklyDokly    122
Hi I''m in the process of learning assembler to help speed up my maths library, and I''m having some problems getting the VC++ inline assembler up to full speed (VC++.NET 7). I''m using the FPU and I''m finding that this:
float f = 0.0f;
for (int i = 0; i < 10000; i ++)
{
result = (float)sqrt( (double)f );
f++;
}

is about 1.5 times faster than this:
for (int i = 0; i < 10000; i ++)
{
__asm
{
fmul ST(1) ST(0);
}
}

Does anyone understand why? Are there any flags I have to set in the project variables to optimise __asm somehow? Thanks

##### Share on other sites
petewood    819
maybe it''s being optimised away, if result isn''t being used.

##### Share on other sites
OklyDokly    122
Can I turn off that optimization somehow?

##### Share on other sites
Guest Anonymous Poster
I don''t quite understand how you''re comparing the code - they do completely different things!

The C code does the square root of a number that is increased per iteration (incidentally a good optimiser could probably optimise this away if it is aware that sqrt {it''d probably have to deal with sqrt as an intrinsic} is a unique operation [as in 1 to 1 mapping always] - if it did that it could probably calculate the whole loop in one go) - I''ve had MSVS.NET 2003 do this kind of thing for me. BTW: someone remind me, is ++ a valid operation on a double - logically I only really like it on discrete things, eg integers or iterators - for floating point stuff I tend to prefer x += 1.0; - the optimiser should be able to take care of that in the most optimal way I''m sure.

The assembly code mutliplies together 2 fpu registers, that do not appear to loaded with any content! IIRC you should get a hardware underflow exception as the FPU has no registers allocated on the fpu stack!!

The x86 asm code that is equivelent to what you wrote in the c code''s loop is
__asm {   fld    QWORD PTR [f]   fld1   fld    ST(1)   fsqrt   fstp   DWORD PTR [result]   fadd   fstp   QWORD PTR [f]}// is equivelent to... (given float result & double f)result = (float) sqrt( f );f += 1.0;

BTW: you don''t need to put a semi-colon after each statement in asm, I believe the instruction deliminator is a new line, a semi colon indicates the start of a comment in asm.

##### Share on other sites
Ryan_001    3475
Using assembly to do simple math operations won''t get you much speed up over the standard C++ optimizer. Compiler developers are pretty smart and the compilers do most of the obvious optimizations for simple things. Understanding the FPU instruction set is good to know, but don''t expect to speed up simple things like x+y=z (or x*x in this case). In fact, most likely your best implementation will be slower than the optimized version your compiler will spit out. On the other hand real speed increase will come from using the MMX, SSE, SSE2, (and soon SSE3) instruction sets (I''ve regularily gotten 10x - 100x speed increases with properly written MMX and SSE assembly). Also simple comparisons are often misleading due in part to the aggressive nature of optimzing compilers. As well memory alignment and cache hit / misses contribute alot to the overall speed of execution. So testing assembly code in a threoretical case is usually useless. The best way to write good assembly is to code an algorithm as best as possible using standard C++ (or C if thats what ur using). Once you''ve tweaked the algorithm out, then try replacing it with a simplified assembly version. Once the assembly version produces correct results (can take a few trys) then start to tweak that until it flys. The speed is there if you need it, you just have to know where to look.

##### Share on other sites
OklyDokly    122
Thanks for the information. I was originally trying to put John Carmack''s InvSqrt() code into assembly, as I was curious to see if I could speed it up this way. I got half way through writing this when I realised I had made it 3-4 times as slow as performing 1 / sqrt(). The original C version of the InvSqrt code was a lot faster than when I converted half of it into assembly.

So I developed a quick test where I multiplied two empty floating point registers, as I was wondering the difference in speed. Performing the assembly loop at the top of this post actually seems about the same speed as the InvSqrt function, which surely shouldn''t be the case. I understand now why sqrt() would be faster, due to MMX and SSE instructions, but I''m still wondering why my InvSqrt function slows down when I try converting it to assembly

##### Share on other sites
Guest Anonymous Poster
Hmm, as far as I remember there''s a post with the code & all you need to do is copy & paste the code - I assume thats where you got the idea from.

The ''carmack'' variant of the newton raphson method relies on the fact that everything is out of the FPU and done with the normal integer instructions... it is also highly dependant on using floats rather than doubles (it could probably be re-written to deal with doubles though).

I don''t quite understand how your test would help to be honest; to test against the carmack routine you should test it against
float result, f;// assign values as you will...result = 1.0f/sqrt( (double)f );

That is the equivelent code to the carmack routine & AFAIK should be slower.

##### Share on other sites
amag    152
quote:
I was originally trying to put John Carmack''s InvSqrt() code into assembly, as I was curious to see if I could speed it up this way.

Uh, I don''t mean to be rude or anything, but do you really believe that yourself?
A) Carmack is after all not your average coder.
B) You''re unlikely to write faster asm than the optimizer generates in any modern compiler.

Again I don''t want to take this away from you but on modern CPUs you''re unlikely to write asm that''s faster than what the optimizer can generate.
Learning asm is a good thing™ but that it will "speed up your maths library" is very unlikely, more likely it will slow it down.

##### Share on other sites
Promit    13246
To hammer the point in, realize that Carmack has been coding ASM for a long, long time, particularly in the realm of extremely fast graphics calculations. He did almost nothing else during the DOS era. He knows how to optimize code, and he''s really ****ing good at it.

##### Share on other sites
alnite    3436
quote:
Original post by amag
Learning asm is a good thing™ but that it will "speed up your maths library" is very unlikely, more likely it will slow it down.
[sarcasm]The same case as when you first learn VB and you will complain how slow it is. When you first learn ASM you will notice how slow it is compared to your C++ code, but since this is the ASM, you won't complain that it's slow.[/sarcasm]

In essence, every language has its own way of optimization.

[edited by - alnite on March 30, 2004 6:49:50 PM]

##### Share on other sites
OklyDokly    122
Tell you the truth, I wasn''t really hoping to get much optimization out of the ASM code, but around the same speed or just a tiny bit faster.

It just seemed as a good ground to start on, so I can later get some nice optimisations into my matrix, vector and quaternion classes, which I''m sure can be heavily optimised and will be very highly used

I''ll have to dig out that Carmack post again and see if there''s any asm in it.

##### Share on other sites
TerrorFLOP    100

I too am in the process of speeding up my math engine and have found that some of my more complex math routines can be speeded up by around 1000% (10x)!!!

However, here is the gothca:

Using the VC++ inline asm is POINTLESS for small routines... Like adding a handful of integers or floats...

Let me explain:

Consider the following code which computes the additon of two vectors... consisting of 4 floating point, 32 bit numbers:

////////////////////////////////////////////////////////////
__declspec (align (16)) union float4
{
__m128 m;
float v[4];
struct
{
float x;
float y;
float z;
float w;
};

inline float4 VectorAdd (float4 A, float4 B)
{
float4 result =
{ A.x + B.x, A.y + B.y, A.z + B.z, A.w + B.w };
return result;
}
////////////////////////////////////////////////////////////

Now, don''t worry if you don''t understand some of the details... But;

__declspec (align (16))

allows data in MS VC++ 6.0 to be aligned on the 16 byte boundary, which may allow for faster computation by the processor.

And the data type ''__m128'' is a 128 bit data type, which basically holds four, 32-bit floating point numbers, used by the SSE (Streaming ''Single Instruction Multiple Data'' Extensions) on your Pentium III+ processor.

Okay, now the above code executes on my computer at around 1 nanosecond... So the VC++ complier has done a pretty damn good job at optimization.

The equivalent asm code, using SSE which actually allows for FOUR floating point operations to be executed AT ONCE!!! I.e.:

/////////////////////////////////////////////////////////////
inline __m128 VectorAdd (__m128 A, __m128 B)
{
__m128 result;
__asm
{
movaps xmm0, A
movaps result, xmm0
}
return result
}
/////////////////////////////////////////////////////////////

Operates at EXACTLY the same speed... (Actually, it may even be a bit slower. I had to use an Intel asm ''intrinsic'' function to match VC++ 6.0!!!).

So, using the inline asm for SIMPLE tasks really is pointless.

However complex math routines definetly benefit from inline asm... Especially if you use SSE which totally craps on the FPU.

The FPU (or FPU''s on some machines) IS good if you want more acurracy... but for speed...its gotta be SSE...

I''m willing to put money on, that most games and time critical apps these days are written with SSE, SSE+ instructions wrapped up in time critical routines.

Hope this helps.

P.S. BTW... How da hell do you do CODE DUMPS on this thing???!!!

##### Share on other sites
Evil Steve    2017
<Off topic>
[ source ]
Source code here
[ /source ]
</Off topic>

##### Share on other sites
quote:
"Premature optimization is the root of all evil"
-D. Knuth

Finish your library, use it in a real app, profile the real app, find bottlenecks, do high level optimizations, if spikes are still there, maybe start thinking about asm : as others have already said, chances are you can't beat your compiler, even if it was VC6

That said, if you're doing it just for fun, and not for work, well, then have fun

[edited by - NicolasQuijano on April 1, 2004 12:34:10 PM]

##### Share on other sites
This is a bit off-topic, but to do a 32-bit square-root (in other words, with floats instead of doubles) you use sqrtf() like so:

float result = sqrtf( f );

That way you don''t have to do any dumb casting and I''m sure this function is faster than the 64-bit (double) sqrt.

Likewise, there''s 32-bit functions cosf, sinf, floorf, ceilf, etc.

~CGameProgrammer( );

-- Upload up to four 1600x1200 screenshots of your projects, registration optional. View all existing ones in the archives..

##### Share on other sites
TerrorFLOP    100
Mmm CGameProgrammer...

Perhaps I have my compiler on a high level for errors and
warnings, but I just KNOW my complier would complain about
the following code:

(Time to check out this code dump thingy...)

[ source ]
float result = sqrtf( f );
[ /source ]

I''d get a warning saying ''possible loss of data, converting
double to float'' or something like that.

I think using a C style cast or ''static_cast'' in C++ keeps the
compiler happy.

##### Share on other sites
quote:
Original post by TerrorFLOP
Mmm CGameProgrammer...

Perhaps I have my compiler on a high level for errors and
warnings, but I just KNOW my complier would complain about
the following code:

(Time to check out this code dump thingy...)

[ source ]
float result = sqrtf( f );
[ /source ]

I''d get a warning saying ''possible loss of data, converting
double to float'' or something like that.

I think using a C style cast or ''static_cast'' in C++ keeps the
compiler happy.

No you won''t. sqrtf takes a 32-bit float and returns a 32-bit float. Try it. That''s what the "f" stands for. sqrt, on the other hand, takes a double and returns a double.

~CGameProgrammer( );

-- Upload up to four 1600x1200 screenshots of your projects, registration optional. View all existing ones in the archives..

##### Share on other sites
OklyDokly    122
It does actually seem that operations like matrix multiplication are good ones to optimise and everything else is left well alone.

By the way, if you put ASM modules in your code and assembled them seperately before the link, would this be faster than using the inline assembler in general, or would it be around the same speed?