Jump to content

  • Log In with Google      Sign In   
  • Create Account

using inline assembly for msvc AND gcc


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
11 replies to this topic

#1 TiPiou   Members   -  Reputation: 161

Like
0Likes
Like

Posted 02 April 2012 - 03:28 AM

g'day everyone Posted Image

I wish to make my project runnable under Windows (DX client, server) and Linux (server only).
Target architectures are x86, x86-64 recommended.
I have a few functions, mostly those related to fixed point arithmetics, which would benefit from inline assembly.

but MSVC and GCC are very different in terms of declaring an asm block, not to mention different in the asm syntax itself...
I was only able to think of the following workaround :

1) declare each asm-instruction to be used with a #define, probably 1 for each size/type of operands combinaison
2) use those macros one by one in the c++ code
3) pray

macro declaration example :

// COMPILER_MSVC or COMPILER_GCC are defined in some other header...

#if (defined COMPILER_MSVC)

#define ASM86_MOV_R32toR32(dst, src)	__asm { mov dst, src }
#define ASM86_MOV_R32toV32(dst, src)	__asm { mov dst, src }
#define ASM86_MOV_V32toR32(dst, src)	__asm { mov dst, src }

// etc.

#elif (defined COMPILER_GCC)

#define ASM86_MOV_R32toR32(dst, src)	asm("mov %%"#src#", %%"#dst#";" : : : "%"#dst)
#define ASM86_MOV_R32toV32(dst, src)	asm("mov %%"#src#", %0;" : "=r"(dst) : : )
#define ASM86_MOV_V32toR32(dst, src)	asm("mov %0, %%"#dst#";" : : "r"(src) : "%"#dst)

// etc.

#endif

usage example :

// int64 and uint64 are typedef'd in some other header...

// Returns the encoding nr of a value vr such as : vr=va/vb,
//   given na and nb, respectively the encodings of two values va and vb.
//   'Scalar' type is 64b fixed point with 30b after point.
//   => vr = nr.2^-30 ; va = na.2^-30 ; vb = nb.2^-30
inline int64 ScalarFixedImpl::div(int64 na, int64 nb) {
	bool negA = na<0;
	bool negB = nb<0;

#ifdef ARCH_64

	uint64 absna = negA? static_cast<uint64>(-na) : static_cast<uint64>(na);
	uint64 absnb = negB? static_cast<uint64>(-nb) : static_cast<uint64>(nb);
	uint64 absResult;
	
	ASM8664_XORQ_R64(rdx, rdx);	 	 	 // zeroes rdx
	ASM8664_MOVQ_V64toR64(rax, absna);	 	 // puts na in rax => rdx:rax = 0:na
	ASM8664_DIVQ_V64(absnb);	 	 	 // rdx = rdx:rax mod nb = na mod nb, i.e the first rest (r)
	 	 	 	 	 	 	 // rax = rdx:rax div nb = na div nb, i.e the first quotient (q1)
	ASM8664_SHLQ_R64(rax, 30);	 	 	 // rax = q1<<30
	ASM8664_MOVQ_R64toV64(absResult, rax);	 	 // stores q1<<30 in absResult
	ASM8664_XORQ_R64(rax, rax);	 	 	 // zeroes rax => rdx:rax = r:0
	ASM8664_DIVQ_V64(absnb);	 	 	 // rdx = rdx:rax mod nb = (r<<64) mod nb, i.e the second rest (discarded)
	 	 	 	 	 	 	 // rax = rdx:rax div nb = (r<<64) div nb, i.e the second quotient (q2)
	ASM8664_SHRQ_R64(rax, 34);	 	 	 // rax = q2>>34 --- logical shift to leave zeroes in left bits
	ASM8664_ORQ_R64toV64(absResult, rax);	 	 // absResult = absResult|rax = q1<<30 | q2>>34 = (uint64) q1.q2 with a 30b fixed point.

	return (negA^negB)? -static_cast<int64>(absResult) : static_cast<int64>(absResult);
	
#else

	// here comes the pain...
	assert(false, "ScalarFixedImpl::div : not yet implemented for 32b");
	
#endif
	
}

So, here are my questions ^^ :

- Will that work ?
- Will both compilers be able to optimize, say, the use of the input variables (not the asm itself, of course) ?
- Do you think of another solution ?
I guess a few guys out there won't be able to refrain from answering to the unspoken question : "Is it worth the pain ?"
As I am a very kind person, here you are :
- Is it worth the pain ?

Thanks in advance Posted Image

Sponsor:

#2 turch   Members   -  Reputation: 581

Like
0Likes
Like

Posted 02 April 2012 - 07:04 AM

- Will that work ?


Yes.

- Will both compilers be able to optimize, say, the use of the input variables (not the asm itself, of course) ?


Using the preprocessor will not affect the compilers' ability to optimize. The preprocessor is run before any actual compilation / optimization.

- Do you think of another solution ?
- Is it worth the pain ?


There are more complex solutions, but they should be left to higher level, bigger differences - such as having one rendering interface with different implementations for OGL / DX9 / DX11. The way you describe is fairly standard for "smaller" stuff like asm, timing, async file io, and a bunch of other stuff that is effectively the same but differs primarily in semantics.

#3 Evil Steve   Moderators   -  Reputation: 1918

Like
0Likes
Like

Posted 02 April 2012 - 07:45 AM

I'd put all of the assembly code in one file, and #include that file based on the compiler. E.g.:
#if _MSC_VER
#include "inline_asm_msvc"
#elif GCC_VER // Or whatever #define GCC uses
#include "inline_asm_gcc"
#else
#error Unsupported compiler!
#endif

Steve Macpherson
Senior programmer, Firebrand Games


#4 TiPiou   Members   -  Reputation: 161

Like
0Likes
Like

Posted 02 April 2012 - 08:16 AM

thanks for feedback :)

Using the preprocessor will not affect the compilers' ability to optimize. The preprocessor is run before any actual compilation / optimization.

Ah. yes. I guess my question was misleading. What i meant is, MSVC syntax of "__asm mov dst, src" is left open enough, imho, to be optimizable by the compiler as to where to put/retrieve any c-style argument effectively. I wasn't so sure about the whole messy gcc alternative.

The way you describe is fairly standard for "smaller" stuff like asm [...]

Well, I'm relieved if this is standart stuff then, I wasn't able to google any hit about gcc vs msvc asm instruction-by-instruction macro ^^

@EvilSteve : I guess this is a matter of taste, but I may very well end up with something like what u described ;) Thanks !

#5 turch   Members   -  Reputation: 581

Like
0Likes
Like

Posted 02 April 2012 - 08:24 AM


Using the preprocessor will not affect the compilers' ability to optimize. The preprocessor is run before any actual compilation / optimization.

Ah. yes. I guess my question was misleading. What i meant is, MSVC syntax of "__asm mov dst, src" is left open enough, imho, to be optimizable by the compiler as to where to put/retrieve any c-style argument effectively. I wasn't so sure about the whole messy gcc alternative.


Ah, well I wouldn't be able to comment on that, I've never really used gcc for any assembly :)


The way you describe is fairly standard for "smaller" stuff like asm [...]

Well, I'm relieved if this is standart stuff then, I wasn't able to google any hit about gcc vs msvc asm instruction-by-instruction macro ^^


I just say its standard based on my own experience of having worked with a fairly decent number of mulitplatform projects which do this (and of course using it myself).

@EvilSteve : I guess this is a matter of taste, but I may very well end up with something like what u described ;) Thanks !


Yeah taste and the case at hand. If I have a decently sized class with one or two functions optimized with assembly, I use in-file defines. If its something where every function or most of the functions are different, then I do a conditional include the way EvilSteve posted.

#6 Cornstalks   GDNet+   -  Reputation: 5613

Like
0Likes
Like

Posted 02 April 2012 - 08:26 AM

@EvilSteve : I guess this is a matter of taste, but I may very well end up with something like what u described ;) Thanks !

One benefit of doing it EvilSteve's way is that if you compile it for another architecture or with another compiler you don't support, you can include and use general purpose C++ code to accomplish your goal. (well, I should say you could still support all these compilers and architectures without using separate files, but it'd a huge mess)

Something like:
#if _MSC_VER
#include "inline_asm_msvc"
#elif GCC_VER // Or whatever #define GCC uses
#include "inline_asm_gcc"
#else
#include "c++_version"
#endif

Or you could mix the two solutions:

asm_file
#include "asm_macro_defs"
        uint64 absna = negA? static_cast<uint64>(-na) : static_cast<uint64>(na);
        uint64 absnb = negB? static_cast<uint64>(-nb) : static_cast<uint64>(nb);
        uint64 absResult;

        ASM8664_XORQ_R64(rdx, rdx);                      // zeroes rdx
        ASM8664_MOVQ_V64toR64(rax, absna);               // puts na in rax => rdx:rax = 0:na
        ASM8664_DIVQ_V64(absnb);                         // rdx = rdx:rax mod nb = na mod nb, i.e the first rest (r)
                                                         // rax = rdx:rax div nb = na div nb, i.e the first quotient (q1)
        ASM8664_SHLQ_R64(rax, 30);                       // rax = q1<<30
        ASM8664_MOVQ_R64toV64(absResult, rax);           // stores q1<<30 in absResult
        ASM8664_XORQ_R64(rax, rax);                      // zeroes rax => rdx:rax = r:0
        ASM8664_DIVQ_V64(absnb);                         // rdx = rdx:rax mod nb = (r<<64) mod nb, i.e the second rest (discarded)
                                                         // rax = rdx:rax div nb = (r<<64) div nb, i.e the second quotient (q2)
        ASM8664_SHRQ_R64(rax, 34);                       // rax = q2>>34 --- logical shift to leave zeroes in left bits
        ASM8664_ORQ_R64toV64(absResult, rax);            // absResult = absResult|rax = q1<<30 | q2>>34 = (uint64) q1.q2 with a 30b fixed point.

        return (negA^negB)? -static_cast<int64>(absResult) : static_cast<int64>(absResult);
#include "asm_macro_undefs"

And then:
#if (SUPPORTED_ARCH)
#include "asm_file"
#else
#include "c++_replacement"
#endif

[ I was ninja'd 71 times before I stopped counting a long time ago ] [ f.k.a. MikeTacular ] [ My Blog ] [ SWFer: Gaplessly looped MP3s in your Flash games ]

#7 Adam_42   Members   -  Reputation: 1454

Like
0Likes
Like

Posted 02 April 2012 - 09:31 AM

There's another option - use yasm to compile your assembly - it will output both MSVC and gcc friendly object files.

It also solves the problem at MSVC doesn't support inline assembly at all in x64 code.

However I'd also recommend that you don't bother with assembly at all and use plain C++ code instead. The optimizer usually doesn't generate code that is so slow it's worth hand optimizing using assembly. It can be worthwhile using intrinsics though.

#8 TiPiou   Members   -  Reputation: 161

Like
0Likes
Like

Posted 02 April 2012 - 10:01 AM

There's another option - use yasm to compile your assembly - it will output both MSVC and gcc friendly object files.

I did not wish to use external compiled asm, thus avoiding the call :
1 - for performance reasons.
2 - if I'm correct, gcc and msvc don't have the exact same protocol as to where to put a function argument, especially in 64b, and this seemed a pain for me to work around.

It also solves the problem at MSVC doesn't support inline assembly at all in x64 code.

seriously ? Posted Image ... okay, this is bad news for me.

However I'd also recommend that you don't bother with assembly at all and use plain C++ code instead. The optimizer usually doesn't generate code that is so slow it's worth hand optimizing using assembly. It can be worthwhile using intrinsics though.

I am not trying to optimize c-writable code. Assembly was the only way I could think of to ask for an initialization of the high-part register used by the div (resp. divq) asm instruction. And it's also easier to directly get the rest from div or high-part of mul result. Although for that last point, I am aware that some (compiler-dependent) implementations exist already.

#9 Adam_42   Members   -  Reputation: 1454

Like
0Likes
Like

Posted 02 April 2012 - 11:15 AM

The compiler will usually do something sensible if you simply use a 64-bit integer to do the calculations. I'd try that first.

For MSVC you can use Int32x32To64() to get the full result from the mutliply (there's also a UInt version). That will get inlined and should be similar performance to assembly. There's also MulDiv() which may help.

#10 TiPiou   Members   -  Reputation: 161

Like
0Likes
Like

Posted 02 April 2012 - 12:10 PM

I'd try that first.

Oh, I did ^^
I even did more than that, as I came with working C algorithms, and a "sensible" compiler would not have. Trouble with fixed point arithmetics is, to get things done, the compiler would need to be more than just sensible. It would need to know my goal in advance and have it ready for me.
which maybe we'll get in compilers and computers in general by 2051... but then I'll be hobbyless and unemployed :(

Well. I mean, to multiply fixed points, you can't just mul the integers. You need at least a shift, and if not using the higher register of the mul result, you need more than that.
And div is harder.

So, I have a working C solution (for mul), which could be improved, and a working but-slow-and-unprecise C solution (for div) which could be greatly improved. I even profiled them :P (and yes, my div would indeed greatly benefit from a speedup, not to mention the precision boost)

Btw, operands are 64b each. multiplication outputs and division inputs should then be 128 (as provided by x86-64 asm) to be able to get the most efficient algorithms.

But I'm by no means an asm (or fixed point, for that matter) expert, so if someone has a solution to that problem, apart from blackmailing a C comitee guy and ask him for native support of a x86-64 feature (hey, to hell with SPARC afterall :P), then I'm all ears :)

#11 Adam_42   Members   -  Reputation: 1454

Like
0Likes
Like

Posted 02 April 2012 - 01:15 PM

I was assuming your fixed point values were 32-bit, in which case a 64-bit multiply followed by a shift is all that's needed.

gcc supports 128-bit integers on 64-bit targets which should make 64-bit fixed point straight forward there. However MSVC doesn't support them at all.

You could use a library like GMP to handle it for you.

#12 TiPiou   Members   -  Reputation: 161

Like
0Likes
Like

Posted 02 April 2012 - 01:43 PM

I just google-hit a bunch of questions related to this same issue...

one of the answers was to hack the corresponding machine code as hard coded data and address it with a brutal cast to a function pointer, using __fastcall macros and the like... this seems quite nasty to me ^^
Well, if I'm using this, I guess I'm in for #ifdef nonetheless, as gcc won't have same call-mod macros or put its arguments in same registers... but it's a start.
For the curious (or the insanes) : http://stackoverflow...sic-in-visual-c
But then, I fail to see the difference with Adam_42 advice to compile it first and call as extern... which may very well be my final choice :x

@Adam_42 : I guess there are some 128 bits weirdies in msvc, but they won't work with operators such as div, I dunno if it's the same with gcc ? maybe not, but even so, there is a risk that "128b / 128b" expands to a lot of code of unessessary complexity (dividing by something larger than the size of one register is no trivial matter)... when I only need a native x86-64 "128b / 64b". (As an example of this, trying to div a "long long" by a "long" on 32bit architectures is interpreted as "long long" div "long long" and is an horrid operation)
Thanks for your answers anyway Posted Image

[Edited for readability]




Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS