Back to General and Gameplay Programming

How fast really is the VC++ inline assembler?

OklyDokly · 2004-04-02T08:03:28

Hi I''m in the process of learning assembler to help speed up my maths library, and I''m having some problems getting the VC++ inline assembler up to full speed (VC++.NET 7). I''m using the FPU and I''m finding that this: float f = 0.0f; for (int i = 0; i < 10000; i ++) { result = (float)sqrt( (double)f ); f++; } is about 1.5 times faster than this: for (int i = 0; i < 10000; i ++) { __asm { fmul ST(1) ST(0); } } Does anyone understand why? Are there any flags I have to set in the project variables to optimise __asm somehow? Thanks

General and Gameplay Programming Programming

Started by OklyDokly March 30, 2004 08:13 AM

17 comments, last by OklyDokly 20 years ago

OklyDokly

122

Author

March 31, 2004 04:08 AM

Tell you the truth, I wasn''t really hoping to get much optimization out of the ASM code, but around the same speed or just a tiny bit faster.

It just seemed as a good ground to start on, so I can later get some nice optimisations into my matrix, vector and quaternion classes, which I''m sure can be heavily optimised and will be very highly used

I''ll have to dig out that Carmack post again and see if there''s any asm in it.

TerrorFLOP

100

March 31, 2004 09:42 PM

Please continue your pursuits on asm...

I too am in the process of speeding up my math engine and have found that some of my more complex math routines can be speeded up by around 1000% (10x)!!!

However, here is the gothca:

Using the VC++ inline asm is POINTLESS for small routines... Like adding a handful of integers or floats...

Let me explain:

Consider the following code which computes the additon of two vectors... consisting of 4 floating point, 32 bit numbers:

////////////////////////////////////////////////////////////
__declspec (align (16)) union float4
{
__m128 m;
float v[4];
struct
{
float x;
float y;
float z;
float w;
};

inline float4 VectorAdd (float4 A, float4 B)
{
float4 result =
{ A.x + B.x, A.y + B.y, A.z + B.z, A.w + B.w };
return result;
}
////////////////////////////////////////////////////////////

Now, don''t worry if you don''t understand some of the details... But;

__declspec (align (16))

allows data in MS VC++ 6.0 to be aligned on the 16 byte boundary, which may allow for faster computation by the processor.

And the data type ''__m128'' is a 128 bit data type, which basically holds four, 32-bit floating point numbers, used by the SSE (Streaming ''Single Instruction Multiple Data'' Extensions) on your Pentium III+ processor.

Okay, now the above code executes on my computer at around 1 nanosecond... So the VC++ complier has done a pretty damn good job at optimization.

The equivalent asm code, using SSE which actually allows for FOUR floating point operations to be executed AT ONCE!!! I.e.:

/////////////////////////////////////////////////////////////
inline __m128 VectorAdd (__m128 A, __m128 B)
{
__m128 result;
__asm
{
movaps xmm0, A
addps xmm0, B
movaps result, xmm0
}
return result
}
/////////////////////////////////////////////////////////////

Operates at EXACTLY the same speed... (Actually, it may even be a bit slower. I had to use an Intel asm ''intrinsic'' function to match VC++ 6.0!!!).

So, using the inline asm for SIMPLE tasks really is pointless.

However complex math routines definetly benefit from inline asm... Especially if you use SSE which totally craps on the FPU.

The FPU (or FPU''s on some machines) IS good if you want more acurracy... but for speed...its gotta be SSE...

I''m willing to put money on, that most games and time critical apps these days are written with SSE, SSE+ instructions wrapped up in time critical routines.

Hope this helps.

P.S. BTW... How da hell do you do CODE DUMPS on this thing???!!!

Evil Steve

2,021

April 01, 2004 11:03 AM

<Off topic>
[ source ]
Source code here
[ /source ]
</Off topic>

NicolasQuijano

157

April 01, 2004 11:33 AM

quote:"Premature optimization is the root of all evil"
-D. Knuth

Finish your library, use it in a real app, profile the real app, find bottlenecks, do high level optimizations, if spikes are still there, maybe start thinking about asm : as others have already said, chances are you can't beat your compiler, even if it was VC6

That said, if you're doing it just for fun, and not for work, well, then have fun

[edited by - NicolasQuijano on April 1, 2004 12:34:10 PM]

CGameProgrammer

640

April 01, 2004 03:19 PM

This is a bit off-topic, but to do a 32-bit square-root (in other words, with floats instead of doubles) you use sqrtf() like so:

float result = sqrtf( f );

That way you don''t have to do any dumb casting and I''m sure this function is faster than the 64-bit (double) sqrt.

Likewise, there''s 32-bit functions cosf, sinf, floorf, ceilf, etc.

~CGameProgrammer( );

-- Upload up to four 1600x1200 screenshots of your projects, registration optional. View all existing ones in the archives..

~CGameProgrammer( );Developer Image Exchange -- New Features: Upload screenshots of your games (size is unlimited) and upload the game itself (up to 10MB). Free. No registration needed.

TerrorFLOP

100

April 01, 2004 09:56 PM

Mmm CGameProgrammer...

Perhaps I have my compiler on a high level for errors and
warnings, but I just KNOW my complier would complain about
the following code:

(Time to check out this code dump thingy...)

[ source ]
float result = sqrtf( f );
[ /source ]

I''d get a warning saying ''possible loss of data, converting
double to float'' or something like that.

I think using a C style cast or ''static_cast'' in C++ keeps the
compiler happy.

CGameProgrammer

640

April 02, 2004 05:30 AM

quote:Original post by TerrorFLOP
Mmm CGameProgrammer...

Perhaps I have my compiler on a high level for errors and
warnings, but I just KNOW my complier would complain about
the following code:

(Time to check out this code dump thingy...)

[ source ]
float result = sqrtf( f );
[ /source ]

I''d get a warning saying ''possible loss of data, converting
double to float'' or something like that.

I think using a C style cast or ''static_cast'' in C++ keeps the
compiler happy.

No you won''t. sqrtf takes a 32-bit float and returns a 32-bit float. Try it. That''s what the "f" stands for. sqrt, on the other hand, takes a double and returns a double.

~CGameProgrammer( );

-- Upload up to four 1600x1200 screenshots of your projects, registration optional. View all existing ones in the archives..

~CGameProgrammer( );Developer Image Exchange -- New Features: Upload screenshots of your games (size is unlimited) and upload the game itself (up to 10MB). Free. No registration needed.

OklyDokly

122

Author

April 02, 2004 06:35 AM

It does actually seem that operations like matrix multiplication are good ones to optimise and everything else is left well alone.

By the way, if you put ASM modules in your code and assembled them seperately before the link, would this be faster than using the inline assembler in general, or would it be around the same speed?

Promit

13,404

April 02, 2004 08:03 AM

AFAIK the VC++ assembler doesn''t touch your assembly code, it simply pushes it right on forward to the linker. A real assembler, on the other hand, may be able to apply optimizations, whereas VC++ will not even try.

Other than that, there shouldn''t be any difference.

SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.

How fast really is the VC++ inline assembler?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

How fast really is the VC++ inline assembler?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines