Integrating C++ and assembly code

Started by
8 comments, last by RDragon1 15 years, 5 months ago
More questions! Recently I've been working on making a high-performance vector library for use in my game engine. Specifically I'm trying to make a bunch of functions use SIMD instructions in order to speed things up. Intel's reference manuals (and Agner Fog's invaluable guide) have been of great assistance here, and my question is in relation to something I found in there. I ran across the following code snippet in the Optimization Reference Manual:
movaps xmm0, [eax]
mulps xmm0, [eax+16]
movhlps xmm1, xmm0
addps xmm0, xmm1
pshufd xmm1, xmm0, 1
addss xmm0, xmm1
movss [ecx], xmm0
for SSE/SSE2, and the following:
movaps xmm0, [eax]
mulps xmm0, [eax+16]
haddps xmm0, xmm0
movaps xmm1, xmm0
psrlq xmm0, 32
addss xmm0, xmm1
movss [eax], xmm0
for SSE3. What I'd like to be able to do is take these assembly snippets and turn them into nice, standard C/C++ functions, presumably with the use of the __asm{} stuff. I went ahead and tried this using some __m128 intrinsic types and replaced the eax bits with references to these... ...and it blew up in my face. When I didn't get access violations I got some random number that wasn't at all what I had expected. If there are any x86 assembly gurus out there, would you mind explaining the correct way to set these up? I possess some familiarity with assembly and looked over the article here on SSE2, but it hasn't helped much. Additionally, I'd like for the assembly code to just drop the return value in whichever register it ultimately goes to and leave, but I haven't found anything on how to do this on the Interwebs. Help please?
clb: At the end of 2012, the positions of jupiter, saturn, mercury, and deimos are aligned so as to cause a denormalized flush-to-zero bug when computing earth's gravitational force, slinging it to the sun.
Advertisement
I don't want to dismotivate you, keep on if this is for kicks and for knowledge, or if you intend to invest years on the topic of optimisation :)

If not, please read further:

There is no real advantage of using inline assembly over xmintrin.h. Despite, writing a high performance Vector library (are you talking about a kind of Vector3d or more something like a vector<>, i.e. a container?) is near impossible if you're not a guru on that topic.

Have a look at this gem-thread: http://www.gamedev.net/community/forums/topic.asp?topic_id=437740
Quote:Original post by InvalidPointer

...and it blew up in my face. When I didn't get access violations I got some random number that wasn't at all what I had expected.


Is your data properly aligned?

Quote:Additionally, I'd like for the assembly code to just drop the return value in whichever register it ultimately goes to and leave, but I haven't found anything on how to do this on the Interwebs.


Did you read the calling conventions documentation and related compiler-specific documents?

Also note that just using assembly version for individual operations is very likely to be considerably slower due to function call or data conversion overhead.
Quote:Original post by phresnel
I don't want to dismotivate you, keep on if this is for kicks and for knowledge, or if you intend to invest years on the topic of optimisation :)

If not, please read further:

There is no real advantage of using inline assembly over xmintrin.h. Despite, writing a high performance Vector library (are you talking about a kind of Vector3d or more something like a vector<>, i.e. a container?) is near impossible if you're not a guru on that topic.

Have a look at this gem-thread: http://www.gamedev.net/community/forums/topic.asp?topic_id=437740

This is sort of an experimental foray into assembly-level optimization on my part. It's a highly-specialized vector class used for vertices, etc. Thanks for the link, though, I've checked it out and learned quite a bit.

Quote:Original post by Antheus
Quote:Original post by InvalidPointer

...and it blew up in my face. When I didn't get access violations I got some random number that wasn't at all what I had expected.


Is your data properly aligned?

Quote:Additionally, I'd like for the assembly code to just drop the return value in whichever register it ultimately goes to and leave, but I haven't found anything on how to do this on the Interwebs.


Did you read the calling conventions documentation and related compiler-specific documents?

Also note that just using assembly version for individual operations is very likely to be considerably slower due to function call or data conversion overhead.

Yes. The __m128 type is guaranteed to be 16-byte aligned. Calling conventions, etc. are something I've been learning about recently in an effort to get a better handle on optimization. I did run across something about C-style calls-- seeing as this gets used for a decent number of assembly functions in the examples I've looked at, should I consider messing with this?

And I was under the impression that using the __asm command automatically inlines whatever I write? Thus, shouldn't function call overhead not be a problem?
clb: At the end of 2012, the positions of jupiter, saturn, mercury, and deimos are aligned so as to cause a denormalized flush-to-zero bug when computing earth's gravitational force, slinging it to the sun.
Most compilers will *NOT* inline a functions that contains inline ASM.

Yes the inline ASM is inline, but the funtion its in will not be inlined.

Some advice..

With the exception of GCC, compilers are not only very poor at optimizing around inline asm blocks, they are downright paranoid about side effects within inline asm blocks to the point that they assume the entire register state has been trashed after the asm block and that the asm block might access any memory location.

An effective way to use inline asm is to replace entire loops with asm versions. Replacing just a part of the loop is counter-productive due to the above paranoia... and, because entire loops are being replaced, there is no reason not to use an external library containing them (no point inlining at all.)

Also, Microsofts 64-bit compilers DO NOT SUPPORT inline asm at all.
Quote:Original post by Rockoon1
Most compilers will *NOT* inline a functions that contains inline ASM.

Yes the inline ASM is inline, but the funtion its in will not be inlined.

Some advice..

With the exception of GCC, compilers are not only very poor at optimizing around inline asm blocks, they are downright paranoid about side effects within inline asm blocks to the point that they assume the entire register state has been trashed after the asm block and that the asm block might access any memory location.

An effective way to use inline asm is to replace entire loops with asm versions. Replacing just a part of the loop is counter-productive due to the above paranoia... and, because entire loops are being replaced, there is no reason not to use an external library containing them (no point inlining at all.)

Also, Microsofts 64-bit compilers DO NOT SUPPORT inline asm at all.


Would the __forceinline convention be able to remedy some of this? As mentioned I'm just trying to tinker around with optimization, etc. and figure out how I can make stuff go fast. Compiler hints also seem like they can fix some of the paranoia issues you mentioned.

EDIT: I got it to work and saved ~3 cycles off my previous implementation. (11-12 vs 14-15) For any interested parties:
GMForceInline float __fastcall dot3( const float4* v0, const float4* v2){	register float retval;	__asm	{		movaps xmm0, [ecx]		mulps xmm0, [edx]		movhlps xmm1, xmm0		addps xmm0, xmm1		pshufd xmm1, xmm0, 1		addss xmm0, xmm1		movss retval, xmm0	}	return retval;}
GMForceInline is a macro that expands to __forceinline on compiler/OS configurations that support it, __inline with different configs, and inline as a fallback. I'll try and get something up for the __fastcall once I get that that phase in development. I'd still like to get that return value to be cleaner than it is right now, but I get those damned access violations whenever I try and write to eax and issue a RET instruction.

[Edited by - InvalidPointer on November 15, 2008 11:07:31 AM]
clb: At the end of 2012, the positions of jupiter, saturn, mercury, and deimos are aligned so as to cause a denormalized flush-to-zero bug when computing earth's gravitational force, slinging it to the sun.
Quote:Original post by InvalidPointer
Would the __forceinline convention be able to remedy some of this?
Nope, it won't help. Force inline still only inlines functions that can be considered as inlining candidates in the first place, which a function with inline assembly isn't.
Quote: As mentioned I'm just trying to tinker around with optimization, etc. and figure out how I can make stuff go fast. Compiler hints also seem like they can fix some of the paranoia issues you mentioned.
The compiler hint is called intrinsics.

SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.
Quote:Original post by Promit
Quote:Original post by InvalidPointer
Would the __forceinline convention be able to remedy some of this?
Nope, it won't help. Force inline still only inlines functions that can be considered as inlining candidates in the first place, which a function with inline assembly isn't.
Quote: As mentioned I'm just trying to tinker around with optimization, etc. and figure out how I can make stuff go fast. Compiler hints also seem like they can fix some of the paranoia issues you mentioned.
The compiler hint is called intrinsics.

See edit. The keyword certainly did have an effect :S
Did I break anything?

EDIT: And I should probably mention that the old, 14.5-cycle function was written using intrinsics, albeit in a somewhat less efficient way than the current sequence of assembly I have here. Yay shuffles! :D
clb: At the end of 2012, the positions of jupiter, saturn, mercury, and deimos are aligned so as to cause a denormalized flush-to-zero bug when computing earth's gravitational force, slinging it to the sun.
Quote:Original post by InvalidPointer
Would the __forceinline convention be able to remedy some of this?


Even in cases where it does, you cannot rely on it. Inlining has always been optional, even with the term "force" tucked in there.

I would recommend using the sse intrinsics and let the compiler do register allocation. It will be able to inline functions they're in, too.

Also, you'll likely get more gains out of things like sse if you do 4 operations at a time (ie 4 dot products at once).

This topic is closed to new replies.

Advertisement