How fast really is the VC++ inline assembler?

Started by
17 comments, last by OklyDokly 20 years ago
Tell you the truth, I wasn''t really hoping to get much optimization out of the ASM code, but around the same speed or just a tiny bit faster.

It just seemed as a good ground to start on, so I can later get some nice optimisations into my matrix, vector and quaternion classes, which I''m sure can be heavily optimised and will be very highly used

I''ll have to dig out that Carmack post again and see if there''s any asm in it.
Advertisement
Please continue your pursuits on asm...

I too am in the process of speeding up my math engine and have found that some of my more complex math routines can be speeded up by around 1000% (10x)!!!

However, here is the gothca:

Using the VC++ inline asm is POINTLESS for small routines... Like adding a handful of integers or floats...

Let me explain:

Consider the following code which computes the additon of two vectors... consisting of 4 floating point, 32 bit numbers:

////////////////////////////////////////////////////////////
__declspec (align (16)) union float4
{
__m128 m;
float v[4];
struct
{
float x;
float y;
float z;
float w;
};

inline float4 VectorAdd (float4 A, float4 B)
{
float4 result =
{ A.x + B.x, A.y + B.y, A.z + B.z, A.w + B.w };
return result;
}
////////////////////////////////////////////////////////////

Now, don''t worry if you don''t understand some of the details... But;

__declspec (align (16))

allows data in MS VC++ 6.0 to be aligned on the 16 byte boundary, which may allow for faster computation by the processor.

And the data type ''__m128'' is a 128 bit data type, which basically holds four, 32-bit floating point numbers, used by the SSE (Streaming ''Single Instruction Multiple Data'' Extensions) on your Pentium III+ processor.

Okay, now the above code executes on my computer at around 1 nanosecond... So the VC++ complier has done a pretty damn good job at optimization.

The equivalent asm code, using SSE which actually allows for FOUR floating point operations to be executed AT ONCE!!! I.e.:

/////////////////////////////////////////////////////////////
inline __m128 VectorAdd (__m128 A, __m128 B)
{
__m128 result;
__asm
{
movaps xmm0, A
addps xmm0, B
movaps result, xmm0
}
return result
}
/////////////////////////////////////////////////////////////

Operates at EXACTLY the same speed... (Actually, it may even be a bit slower. I had to use an Intel asm ''intrinsic'' function to match VC++ 6.0!!!).

So, using the inline asm for SIMPLE tasks really is pointless.

However complex math routines definetly benefit from inline asm... Especially if you use SSE which totally craps on the FPU.

The FPU (or FPU''s on some machines) IS good if you want more acurracy... but for speed...its gotta be SSE...

I''m willing to put money on, that most games and time critical apps these days are written with SSE, SSE+ instructions wrapped up in time critical routines.

Hope this helps.

P.S. BTW... How da hell do you do CODE DUMPS on this thing???!!!
<Off topic>
[ source ]
Source code here
[ /source ]
</Off topic>
quote:"Premature optimization is the root of all evil"
-D. Knuth


Finish your library, use it in a real app, profile the real app, find bottlenecks, do high level optimizations, if spikes are still there, maybe start thinking about asm : as others have already said, chances are you can't beat your compiler, even if it was VC6

That said, if you're doing it just for fun, and not for work, well, then have fun

[edited by - NicolasQuijano on April 1, 2004 12:34:10 PM]
This is a bit off-topic, but to do a 32-bit square-root (in other words, with floats instead of doubles) you use sqrtf() like so:

float result = sqrtf( f );

That way you don''t have to do any dumb casting and I''m sure this function is faster than the 64-bit (double) sqrt.

Likewise, there''s 32-bit functions cosf, sinf, floorf, ceilf, etc.

~CGameProgrammer( );

Screenshots of your games or desktop captures -- Upload up to four 1600x1200 screenshots of your projects, registration optional. View all existing ones in the archives..
~CGameProgrammer( );Developer Image Exchange -- New Features: Upload screenshots of your games (size is unlimited) and upload the game itself (up to 10MB). Free. No registration needed.
Mmm CGameProgrammer...

Perhaps I have my compiler on a high level for errors and
warnings, but I just KNOW my complier would complain about
the following code:

(Time to check out this code dump thingy...)


[ source ]
float result = sqrtf( f );
[ /source ]


I''d get a warning saying ''possible loss of data, converting
double to float'' or something like that.

I think using a C style cast or ''static_cast'' in C++ keeps the
compiler happy.
quote:Original post by TerrorFLOP
Mmm CGameProgrammer...

Perhaps I have my compiler on a high level for errors and
warnings, but I just KNOW my complier would complain about
the following code:

(Time to check out this code dump thingy...)


[ source ]
float result = sqrtf( f );
[ /source ]


I''d get a warning saying ''possible loss of data, converting
double to float'' or something like that.

I think using a C style cast or ''static_cast'' in C++ keeps the
compiler happy.

No you won''t. sqrtf takes a 32-bit float and returns a 32-bit float. Try it. That''s what the "f" stands for. sqrt, on the other hand, takes a double and returns a double.

~CGameProgrammer( );

Screenshots of your games or desktop captures -- Upload up to four 1600x1200 screenshots of your projects, registration optional. View all existing ones in the archives..
~CGameProgrammer( );Developer Image Exchange -- New Features: Upload screenshots of your games (size is unlimited) and upload the game itself (up to 10MB). Free. No registration needed.
It does actually seem that operations like matrix multiplication are good ones to optimise and everything else is left well alone.

By the way, if you put ASM modules in your code and assembled them seperately before the link, would this be faster than using the inline assembler in general, or would it be around the same speed?
AFAIK the VC++ assembler doesn''t touch your assembly code, it simply pushes it right on forward to the linker. A real assembler, on the other hand, may be able to apply optimizations, whereas VC++ will not even try.

Other than that, there shouldn''t be any difference.
SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.

This topic is closed to new replies.

Advertisement