• 12
• 12
• 9
• 10
• 13

# using inline assembly for msvc AND gcc

This topic is 2179 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

g'day everyone

I wish to make my project runnable under Windows (DX client, server) and Linux (server only).
Target architectures are x86, x86-64 recommended.
I have a few functions, mostly those related to fixed point arithmetics, which would benefit from inline assembly.

but MSVC and GCC are very different in terms of declaring an asm block, not to mention different in the asm syntax itself...
I was only able to think of the following workaround :

1) declare each asm-instruction to be used with a #define, probably 1 for each size/type of operands combinaison
2) use those macros one by one in the c++ code
3) pray

macro declaration example :

 // COMPILER_MSVC or COMPILER_GCC are defined in some other header... #if (defined COMPILER_MSVC) #define ASM86_MOV_R32toR32(dst, src) __asm { mov dst, src } #define ASM86_MOV_R32toV32(dst, src) __asm { mov dst, src } #define ASM86_MOV_V32toR32(dst, src) __asm { mov dst, src } // etc. #elif (defined COMPILER_GCC) #define ASM86_MOV_R32toR32(dst, src) asm("mov %%"#src#", %%"#dst#";" : : : "%"#dst) #define ASM86_MOV_R32toV32(dst, src) asm("mov %%"#src#", %0;" : "=r"(dst) : : ) #define ASM86_MOV_V32toR32(dst, src) asm("mov %0, %%"#dst#";" : : "r"(src) : "%"#dst) // etc. #endif 

usage example :

 // int64 and uint64 are typedef'd in some other header... // Returns the encoding nr of a value vr such as : vr=va/vb, // given na and nb, respectively the encodings of two values va and vb. // 'Scalar' type is 64b fixed point with 30b after point. // => vr = nr.2^-30 ; va = na.2^-30 ; vb = nb.2^-30 inline int64 ScalarFixedImpl::div(int64 na, int64 nb) { bool negA = na<0; bool negB = nb<0; #ifdef ARCH_64 uint64 absna = negA? static_cast<uint64>(-na) : static_cast<uint64>(na); uint64 absnb = negB? static_cast<uint64>(-nb) : static_cast<uint64>(nb); uint64 absResult; ASM8664_XORQ_R64(rdx, rdx); // zeroes rdx ASM8664_MOVQ_V64toR64(rax, absna); // puts na in rax => rdx:rax = 0:na ASM8664_DIVQ_V64(absnb); // rdx = rdx:rax mod nb = na mod nb, i.e the first rest (r) // rax = rdx:rax div nb = na div nb, i.e the first quotient (q1) ASM8664_SHLQ_R64(rax, 30); // rax = q1<<30 ASM8664_MOVQ_R64toV64(absResult, rax); // stores q1<<30 in absResult ASM8664_XORQ_R64(rax, rax); // zeroes rax => rdx:rax = r:0 ASM8664_DIVQ_V64(absnb); // rdx = rdx:rax mod nb = (r<<64) mod nb, i.e the second rest (discarded) // rax = rdx:rax div nb = (r<<64) div nb, i.e the second quotient (q2) ASM8664_SHRQ_R64(rax, 34); // rax = q2>>34 --- logical shift to leave zeroes in left bits ASM8664_ORQ_R64toV64(absResult, rax); // absResult = absResult|rax = q1<<30 | q2>>34 = (uint64) q1.q2 with a 30b fixed point. return (negA^negB)? -static_cast<int64>(absResult) : static_cast<int64>(absResult); #else // here comes the pain... assert(false, "ScalarFixedImpl::div : not yet implemented for 32b"); #endif } 

So, here are my questions ^^ :

- Will that work ?
- Will both compilers be able to optimize, say, the use of the input variables (not the asm itself, of course) ?
- Do you think of another solution ?
I guess a few guys out there won't be able to refrain from answering to the unspoken question : "Is it worth the pain ?"
As I am a very kind person, here you are :
- Is it worth the pain ?

##### Share on other sites

- Will that work ?

Yes.

- Will both compilers be able to optimize, say, the use of the input variables (not the asm itself, of course) ?

Using the preprocessor will not affect the compilers' ability to optimize. The preprocessor is run before any actual compilation / optimization.

- Do you think of another solution ?
- Is it worth the pain ?

There are more complex solutions, but they should be left to higher level, bigger differences - such as having one rendering interface with different implementations for OGL / DX9 / DX11. The way you describe is fairly standard for "smaller" stuff like asm, timing, async file io, and a bunch of other stuff that is effectively the same but differs primarily in semantics.

##### Share on other sites
I'd put all of the assembly code in one file, and #include that file based on the compiler. E.g.:
 #if _MSC_VER #include "inline_asm_msvc" #elif GCC_VER // Or whatever #define GCC uses #include "inline_asm_gcc" #else #error Unsupported compiler! #endif 

##### Share on other sites
thanks for feedback

Using the preprocessor will not affect the compilers' ability to optimize. The preprocessor is run before any actual compilation / optimization.

Ah. yes. I guess my question was misleading. What i meant is, MSVC syntax of "__asm mov dst, src" is left open enough, imho, to be optimizable by the compiler as to where to put/retrieve any c-style argument effectively. I wasn't so sure about the whole messy gcc alternative.

The way you describe is fairly standard for "smaller" stuff like asm [...]

Well, I'm relieved if this is standart stuff then, I wasn't able to google any hit about gcc vs msvc asm instruction-by-instruction macro ^^

@EvilSteve : I guess this is a matter of taste, but I may very well end up with something like what u described ;) Thanks !

##### Share on other sites

[quote name='turch' timestamp='1333371854' post='4927469']
Using the preprocessor will not affect the compilers' ability to optimize. The preprocessor is run before any actual compilation / optimization.

Ah. yes. I guess my question was misleading. What i meant is, MSVC syntax of "__asm mov dst, src" is left open enough, imho, to be optimizable by the compiler as to where to put/retrieve any c-style argument effectively. I wasn't so sure about the whole messy gcc alternative.
[/quote]

Ah, well I wouldn't be able to comment on that, I've never really used gcc for any assembly

[quote name='turch' timestamp='1333371854' post='4927469']
The way you describe is fairly standard for "smaller" stuff like asm [...]

Well, I'm relieved if this is standart stuff then, I wasn't able to google any hit about gcc vs msvc asm instruction-by-instruction macro ^^
[/quote]

I just say its standard based on my own experience of having worked with a fairly decent number of mulitplatform projects which do this (and of course using it myself).

@EvilSteve : I guess this is a matter of taste, but I may very well end up with something like what u described ;) Thanks !

Yeah taste and the case at hand. If I have a decently sized class with one or two functions optimized with assembly, I use in-file defines. If its something where every function or most of the functions are different, then I do a conditional include the way EvilSteve posted.

##### Share on other sites

@EvilSteve : I guess this is a matter of taste, but I may very well end up with something like what u described ;) Thanks !

One benefit of doing it EvilSteve's way is that if you compile it for another architecture or with another compiler you don't support, you can include and use general purpose C++ code to accomplish your goal. (well, I should say you could still support all these compilers and architectures without using separate files, but it'd a huge mess)

Something like:
 #if _MSC_VER #include "inline_asm_msvc" #elif GCC_VER // Or whatever #define GCC uses #include "inline_asm_gcc" #else #include "c++_version" #endif 

Or you could mix the two solutions:

asm_file
 #include "asm_macro_defs" uint64 absna = negA? static_cast<uint64>(-na) : static_cast<uint64>(na); uint64 absnb = negB? static_cast<uint64>(-nb) : static_cast<uint64>(nb); uint64 absResult; ASM8664_XORQ_R64(rdx, rdx); // zeroes rdx ASM8664_MOVQ_V64toR64(rax, absna); // puts na in rax => rdx:rax = 0:na ASM8664_DIVQ_V64(absnb); // rdx = rdx:rax mod nb = na mod nb, i.e the first rest (r) // rax = rdx:rax div nb = na div nb, i.e the first quotient (q1) ASM8664_SHLQ_R64(rax, 30); // rax = q1<<30 ASM8664_MOVQ_R64toV64(absResult, rax); // stores q1<<30 in absResult ASM8664_XORQ_R64(rax, rax); // zeroes rax => rdx:rax = r:0 ASM8664_DIVQ_V64(absnb); // rdx = rdx:rax mod nb = (r<<64) mod nb, i.e the second rest (discarded) // rax = rdx:rax div nb = (r<<64) div nb, i.e the second quotient (q2) ASM8664_SHRQ_R64(rax, 34); // rax = q2>>34 --- logical shift to leave zeroes in left bits ASM8664_ORQ_R64toV64(absResult, rax); // absResult = absResult|rax = q1<<30 | q2>>34 = (uint64) q1.q2 with a 30b fixed point. return (negA^negB)? -static_cast<int64>(absResult) : static_cast<int64>(absResult); #include "asm_macro_undefs" 

And then:
 #if (SUPPORTED_ARCH) #include "asm_file" #else #include "c++_replacement" #endif 

##### Share on other sites
There's another option - use yasm to compile your assembly - it will output both MSVC and gcc friendly object files.

It also solves the problem at MSVC doesn't support inline assembly at all in x64 code.

However I'd also recommend that you don't bother with assembly at all and use plain C++ code instead. The optimizer usually doesn't generate code that is so slow it's worth hand optimizing using assembly. It can be worthwhile using intrinsics though.

##### Share on other sites

There's another option - use yasm to compile your assembly - it will output both MSVC and gcc friendly object files.

I did not wish to use external compiled asm, thus avoiding the call :
1 - for performance reasons.
2 - if I'm correct, gcc and msvc don't have the exact same protocol as to where to put a function argument, especially in 64b, and this seemed a pain for me to work around.

It also solves the problem at MSVC doesn't support inline assembly at all in x64 code.

seriously ? ... okay, this is bad news for me.

However I'd also recommend that you don't bother with assembly at all and use plain C++ code instead. The optimizer usually doesn't generate code that is so slow it's worth hand optimizing using assembly. It can be worthwhile using intrinsics though.

I am not trying to optimize c-writable code. Assembly was the only way I could think of to ask for an initialization of the high-part register used by the div (resp. divq) asm instruction. And it's also easier to directly get the rest from div or high-part of mul result. Although for that last point, I am aware that some (compiler-dependent) implementations exist already.

##### Share on other sites
The compiler will usually do something sensible if you simply use a 64-bit integer to do the calculations. I'd try that first.

For MSVC you can use Int32x32To64() to get the full result from the mutliply (there's also a UInt version). That will get inlined and should be similar performance to assembly. There's also MulDiv() which may help.

##### Share on other sites

I'd try that first.

Oh, I did ^^
I even did more than that, as I came with working C algorithms, and a "sensible" compiler would not have. Trouble with fixed point arithmetics is, to get things done, the compiler would need to be more than just sensible. It would need to know my goal in advance and have it ready for me.
which maybe we'll get in compilers and computers in general by 2051... but then I'll be hobbyless and unemployed

Well. I mean, to multiply fixed points, you can't just mul the integers. You need at least a shift, and if not using the higher register of the mul result, you need more than that.
And div is harder.

So, I have a working C solution (for mul), which could be improved, and a working but-slow-and-unprecise C solution (for div) which could be greatly improved. I even profiled them (and yes, my div would indeed greatly benefit from a speedup, not to mention the precision boost)

Btw, operands are 64b each. multiplication outputs and division inputs should then be 128 (as provided by x86-64 asm) to be able to get the most efficient algorithms.

But I'm by no means an asm (or fixed point, for that matter) expert, so if someone has a solution to that problem, apart from blackmailing a C comitee guy and ask him for native support of a x86-64 feature (hey, to hell with SPARC afterall ), then I'm all ears