How fast really is the VC++ inline assembler?

Started by
17 comments, last by OklyDokly 20 years ago
Hi I''m in the process of learning assembler to help speed up my maths library, and I''m having some problems getting the VC++ inline assembler up to full speed (VC++.NET 7). I''m using the FPU and I''m finding that this:

float f = 0.0f;
for (int i = 0; i < 10000; i ++)
{
result = (float)sqrt( (double)f );
f++;
}
is about 1.5 times faster than this:

for (int i = 0; i < 10000; i ++)
{
__asm
{
fmul ST(1) ST(0);
}
}
Does anyone understand why? Are there any flags I have to set in the project variables to optimise __asm somehow? Thanks
Advertisement
maybe it''s being optimised away, if result isn''t being used.
Can I turn off that optimization somehow?
I don''t quite understand how you''re comparing the code - they do completely different things!

The C code does the square root of a number that is increased per iteration (incidentally a good optimiser could probably optimise this away if it is aware that sqrt {it''d probably have to deal with sqrt as an intrinsic} is a unique operation [as in 1 to 1 mapping always] - if it did that it could probably calculate the whole loop in one go) - I''ve had MSVS.NET 2003 do this kind of thing for me. BTW: someone remind me, is ++ a valid operation on a double - logically I only really like it on discrete things, eg integers or iterators - for floating point stuff I tend to prefer x += 1.0; - the optimiser should be able to take care of that in the most optimal way I''m sure.

The assembly code mutliplies together 2 fpu registers, that do not appear to loaded with any content! IIRC you should get a hardware underflow exception as the FPU has no registers allocated on the fpu stack!!

The x86 asm code that is equivelent to what you wrote in the c code''s loop is
__asm {   fld    QWORD PTR [f]   fld1   fld    ST(1)   fsqrt   fstp   DWORD PTR [result]   fadd   fstp   QWORD PTR [f]}// is equivelent to... (given float result & double f)result = (float) sqrt( f );f += 1.0;


BTW: you don''t need to put a semi-colon after each statement in asm, I believe the instruction deliminator is a new line, a semi colon indicates the start of a comment in asm.
Using assembly to do simple math operations won''t get you much speed up over the standard C++ optimizer. Compiler developers are pretty smart and the compilers do most of the obvious optimizations for simple things. Understanding the FPU instruction set is good to know, but don''t expect to speed up simple things like x+y=z (or x*x in this case). In fact, most likely your best implementation will be slower than the optimized version your compiler will spit out. On the other hand real speed increase will come from using the MMX, SSE, SSE2, (and soon SSE3) instruction sets (I''ve regularily gotten 10x - 100x speed increases with properly written MMX and SSE assembly). Also simple comparisons are often misleading due in part to the aggressive nature of optimzing compilers. As well memory alignment and cache hit / misses contribute alot to the overall speed of execution. So testing assembly code in a threoretical case is usually useless. The best way to write good assembly is to code an algorithm as best as possible using standard C++ (or C if thats what ur using). Once you''ve tweaked the algorithm out, then try replacing it with a simplified assembly version. Once the assembly version produces correct results (can take a few trys) then start to tweak that until it flys. The speed is there if you need it, you just have to know where to look.
Thanks for the information. I was originally trying to put John Carmack''s InvSqrt() code into assembly, as I was curious to see if I could speed it up this way. I got half way through writing this when I realised I had made it 3-4 times as slow as performing 1 / sqrt(). The original C version of the InvSqrt code was a lot faster than when I converted half of it into assembly.

So I developed a quick test where I multiplied two empty floating point registers, as I was wondering the difference in speed. Performing the assembly loop at the top of this post actually seems about the same speed as the InvSqrt function, which surely shouldn''t be the case. I understand now why sqrt() would be faster, due to MMX and SSE instructions, but I''m still wondering why my InvSqrt function slows down when I try converting it to assembly
Hmm, as far as I remember there''s a post with the code & all you need to do is copy & paste the code - I assume thats where you got the idea from.

The ''carmack'' variant of the newton raphson method relies on the fact that everything is out of the FPU and done with the normal integer instructions... it is also highly dependant on using floats rather than doubles (it could probably be re-written to deal with doubles though).

I don''t quite understand how your test would help to be honest; to test against the carmack routine you should test it against
float result, f;// assign values as you will...result = 1.0f/sqrt( (double)f ); 


That is the equivelent code to the carmack routine & AFAIK should be slower.
quote:I was originally trying to put John Carmack''s InvSqrt() code into assembly, as I was curious to see if I could speed it up this way.

Uh, I don''t mean to be rude or anything, but do you really believe that yourself?
A) Carmack is after all not your average coder.
B) You''re unlikely to write faster asm than the optimizer generates in any modern compiler.

Again I don''t want to take this away from you but on modern CPUs you''re unlikely to write asm that''s faster than what the optimizer can generate.
Learning asm is a good thing™ but that it will "speed up your maths library" is very unlikely, more likely it will slow it down.
To hammer the point in, realize that Carmack has been coding ASM for a long, long time, particularly in the realm of extremely fast graphics calculations. He did almost nothing else during the DOS era. He knows how to optimize code, and he''s really ****ing good at it.
SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.
quote:Original post by amag
Learning asm is a good thing™ but that it will "speed up your maths library" is very unlikely, more likely it will slow it down.
[sarcasm]The same case as when you first learn VB and you will complain how slow it is. When you first learn ASM you will notice how slow it is compared to your C++ code, but since this is the ASM, you won't complain that it's slow.[/sarcasm]

In essence, every language has its own way of optimization.

[edited by - alnite on March 30, 2004 6:49:50 PM]

This topic is closed to new replies.

Advertisement