Sign in to follow this  

using inline assembly for msvc AND gcc

This topic is 2084 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

g'day everyone [img]http://public.gamedev.net//public/style_emoticons/default/smile.png[/img]

I wish to make my project runnable under Windows (DX client, server) and Linux (server only).
Target architectures are x86, x86-64 recommended.
I have a few functions, mostly those related to fixed point arithmetics, which would benefit from inline assembly.

but MSVC and GCC are very different in terms of declaring an asm block, not to mention different in the asm syntax itself...
I was only able to think of the following workaround :

1) declare each asm-instruction to be used with a #define, probably 1 for each size/type of operands combinaison
2) use those macros one by one in the c++ code
3) pray

macro declaration example :

[code]
// COMPILER_MSVC or COMPILER_GCC are defined in some other header...

#if (defined COMPILER_MSVC)

#define ASM86_MOV_R32toR32(dst, src) __asm { mov dst, src }
#define ASM86_MOV_R32toV32(dst, src) __asm { mov dst, src }
#define ASM86_MOV_V32toR32(dst, src) __asm { mov dst, src }

// etc.

#elif (defined COMPILER_GCC)

#define ASM86_MOV_R32toR32(dst, src) asm("mov %%"#src#", %%"#dst#";" : : : "%"#dst)
#define ASM86_MOV_R32toV32(dst, src) asm("mov %%"#src#", %0;" : "=r"(dst) : : )
#define ASM86_MOV_V32toR32(dst, src) asm("mov %0, %%"#dst#";" : : "r"(src) : "%"#dst)

// etc.

#endif
[/code]

usage example :

[code]
// int64 and uint64 are typedef'd in some other header...

// Returns the encoding nr of a value vr such as : vr=va/vb,
// given na and nb, respectively the encodings of two values va and vb.
// 'Scalar' type is 64b fixed point with 30b after point.
// => vr = nr.2^-30 ; va = na.2^-30 ; vb = nb.2^-30
inline int64 ScalarFixedImpl::div(int64 na, int64 nb) {
bool negA = na<0;
bool negB = nb<0;

#ifdef ARCH_64

uint64 absna = negA? static_cast<uint64>(-na) : static_cast<uint64>(na);
uint64 absnb = negB? static_cast<uint64>(-nb) : static_cast<uint64>(nb);
uint64 absResult;

ASM8664_XORQ_R64(rdx, rdx); // zeroes rdx
ASM8664_MOVQ_V64toR64(rax, absna); // puts na in rax => rdx:rax = 0:na
ASM8664_DIVQ_V64(absnb); // rdx = rdx:rax mod nb = na mod nb, i.e the first rest (r)
// rax = rdx:rax div nb = na div nb, i.e the first quotient (q1)
ASM8664_SHLQ_R64(rax, 30); // rax = q1<<30
ASM8664_MOVQ_R64toV64(absResult, rax); // stores q1<<30 in absResult
ASM8664_XORQ_R64(rax, rax); // zeroes rax => rdx:rax = r:0
ASM8664_DIVQ_V64(absnb); // rdx = rdx:rax mod nb = (r<<64) mod nb, i.e the second rest (discarded)
// rax = rdx:rax div nb = (r<<64) div nb, i.e the second quotient (q2)
ASM8664_SHRQ_R64(rax, 34); // rax = q2>>34 --- logical shift to leave zeroes in left bits
ASM8664_ORQ_R64toV64(absResult, rax); // absResult = absResult|rax = q1<<30 | q2>>34 = (uint64) q1.q2 with a 30b fixed point.

return (negA^negB)? -static_cast<int64>(absResult) : static_cast<int64>(absResult);

#else

// here comes the pain...
assert(false, "ScalarFixedImpl::div : not yet implemented for 32b");

#endif

}
[/code]

So, here are my questions ^^ :

- Will that work ?
- Will both compilers be able to optimize, say, the use of the input variables (not the asm itself, of course) ?
- Do you think of another solution ?
I guess a few guys out there won't be able to refrain from answering to the unspoken question : "Is it worth the pain ?"
As I am a very kind person, here you are :
- Is it worth the pain ?

Thanks in advance [img]http://public.gamedev.net//public/style_emoticons/default/smile.png[/img]

Share this post


Link to post
Share on other sites
[quote name='TiPiou' timestamp='1333358894' post='4927416']
- Will that work ?
[/quote]

Yes.

[quote name='TiPiou' timestamp='1333358894' post='4927416']
- Will both compilers be able to optimize, say, the use of the input variables (not the asm itself, of course) ?
[/quote]

Using the preprocessor will not affect the compilers' ability to optimize. The preprocessor is run before any actual compilation / optimization.

[quote name='TiPiou' timestamp='1333358894' post='4927416']
- Do you think of another solution ?
- Is it worth the pain ?
[/quote]

There are more complex solutions, but they should be left to higher level, bigger differences - such as having one rendering interface with different implementations for OGL / DX9 / DX11. The way you describe is fairly standard for "smaller" stuff like asm, timing, async file io, and a bunch of other stuff that is effectively the same but differs primarily in semantics.

Share this post


Link to post
Share on other sites
I'd put all of the assembly code in one file, and #include that file based on the compiler. E.g.:
[code]
#if _MSC_VER
#include "inline_asm_msvc"
#elif GCC_VER // Or whatever #define GCC uses
#include "inline_asm_gcc"
#else
#error Unsupported compiler!
#endif
[/code]

Share this post


Link to post
Share on other sites
thanks for feedback :)

[quote name='turch' timestamp='1333371854' post='4927469']
Using the preprocessor will not affect the compilers' ability to optimize. The preprocessor is run before any actual compilation / optimization.
[/quote]
Ah. yes. I guess my question was misleading. What i meant is, MSVC syntax of "__asm mov dst, src" is left open enough, imho, to be optimizable by the compiler as to where to put/retrieve any c-style argument effectively. I wasn't so sure about the whole messy gcc alternative.

[quote name='turch' timestamp='1333371854' post='4927469']
The way you describe is fairly standard for "smaller" stuff like asm [...]
[/quote]
Well, I'm relieved if this is standart stuff then, I wasn't able to google any hit about gcc vs msvc asm instruction-by-instruction macro ^^

@EvilSteve : I guess this is a matter of taste, but I may very well end up with something like what u described ;) Thanks !

Share this post


Link to post
Share on other sites
[quote name='TiPiou' timestamp='1333376206' post='4927501']
[quote name='turch' timestamp='1333371854' post='4927469']
Using the preprocessor will not affect the compilers' ability to optimize. The preprocessor is run before any actual compilation / optimization.
[/quote]
Ah. yes. I guess my question was misleading. What i meant is, MSVC syntax of "__asm mov dst, src" is left open enough, imho, to be optimizable by the compiler as to where to put/retrieve any c-style argument effectively. I wasn't so sure about the whole messy gcc alternative.
[/quote]

Ah, well I wouldn't be able to comment on that, I've never really used gcc for any assembly :)

[quote name='TiPiou' timestamp='1333376206' post='4927501']
[quote name='turch' timestamp='1333371854' post='4927469']
The way you describe is fairly standard for "smaller" stuff like asm [...]
[/quote]
Well, I'm relieved if this is standart stuff then, I wasn't able to google any hit about gcc vs msvc asm instruction-by-instruction macro ^^
[/quote]

I just say its standard based on my own experience of having worked with a fairly decent number of mulitplatform projects which do this (and of course using it myself).

[quote name='turch' timestamp='1333371854' post='4927469']
@EvilSteve : I guess this is a matter of taste, but I may very well end up with something like what u described ;) Thanks !
[/quote]

Yeah taste and the case at hand. If I have a decently sized class with one or two functions optimized with assembly, I use in-file defines. If its something where every function or most of the functions are different, then I do a conditional include the way EvilSteve posted.

Share this post


Link to post
Share on other sites
[quote name='TiPiou' timestamp='1333376206' post='4927501']
@EvilSteve : I guess this is a matter of taste, but I may very well end up with something like what u described ;) Thanks !
[/quote]
One benefit of doing it EvilSteve's way is that if you compile it for another architecture or with another compiler you don't support, you can include and use general purpose C++ code to accomplish your goal. (well, I should say you could still support all these compilers and architectures without using separate files, but it'd a huge mess)

Something like:
[code]
#if _MSC_VER
#include "inline_asm_msvc"
#elif GCC_VER // Or whatever #define GCC uses
#include "inline_asm_gcc"
#else
#include "c++_version"
#endif
[/code]

Or you could mix the two solutions:

asm_file
[code]
#include "asm_macro_defs"
uint64 absna = negA? static_cast<uint64>(-na) : static_cast<uint64>(na);
uint64 absnb = negB? static_cast<uint64>(-nb) : static_cast<uint64>(nb);
uint64 absResult;

ASM8664_XORQ_R64(rdx, rdx); // zeroes rdx
ASM8664_MOVQ_V64toR64(rax, absna); // puts na in rax => rdx:rax = 0:na
ASM8664_DIVQ_V64(absnb); // rdx = rdx:rax mod nb = na mod nb, i.e the first rest (r)
// rax = rdx:rax div nb = na div nb, i.e the first quotient (q1)
ASM8664_SHLQ_R64(rax, 30); // rax = q1<<30
ASM8664_MOVQ_R64toV64(absResult, rax); // stores q1<<30 in absResult
ASM8664_XORQ_R64(rax, rax); // zeroes rax => rdx:rax = r:0
ASM8664_DIVQ_V64(absnb); // rdx = rdx:rax mod nb = (r<<64) mod nb, i.e the second rest (discarded)
// rax = rdx:rax div nb = (r<<64) div nb, i.e the second quotient (q2)
ASM8664_SHRQ_R64(rax, 34); // rax = q2>>34 --- logical shift to leave zeroes in left bits
ASM8664_ORQ_R64toV64(absResult, rax); // absResult = absResult|rax = q1<<30 | q2>>34 = (uint64) q1.q2 with a 30b fixed point.

return (negA^negB)? -static_cast<int64>(absResult) : static_cast<int64>(absResult);
#include "asm_macro_undefs"
[/code]

And then:
[code]
#if (SUPPORTED_ARCH)
#include "asm_file"
#else
#include "c++_replacement"
#endif
[/code]

Share this post


Link to post
Share on other sites
There's another option - use [url="http://yasm.tortall.net/"]yasm[/url] to compile your assembly - it will output both MSVC and gcc friendly object files.

It also solves the problem at MSVC doesn't support inline assembly at all in x64 code.

However I'd also recommend that you don't bother with assembly at all and use plain C++ code instead. The optimizer usually doesn't generate code that is so slow it's worth hand optimizing using assembly. It can be worthwhile using intrinsics though.

Share this post


Link to post
Share on other sites
[quote name='Adam_42' timestamp='1333380703' post='4927522']
There's another option - use [url="http://yasm.tortall.net/"]yasm[/url] to compile your assembly - it will output both MSVC and gcc friendly object files.
[/quote]
I did not wish to use external compiled asm, thus avoiding the call :
1 - for performance reasons.
2 - if I'm correct, gcc and msvc don't have the exact same protocol as to where to put a function argument, especially in 64b, and this seemed a pain for me to work around.

[quote name='Adam_42' timestamp='1333380703' post='4927522']
It also solves the problem at MSVC doesn't support inline assembly at all in x64 code.
[/quote]
seriously ? [img]http://public.gamedev.net//public/style_emoticons/default/sleep.png[/img] ... okay, this is bad news for me.

[quote name='Adam_42' timestamp='1333380703' post='4927522']
However I'd also recommend that you don't bother with assembly at all and use plain C++ code instead. The optimizer usually doesn't generate code that is so slow it's worth hand optimizing using assembly. It can be worthwhile using intrinsics though.
[/quote]
I am not trying to optimize c-writable code. Assembly was the only way I could think of to ask for an initialization of the high-part register used by the div (resp. divq) asm instruction. And it's also easier to directly get the rest from div or high-part of mul result. Although for that last point, I am aware that some (compiler-dependent) implementations exist already.

Share this post


Link to post
Share on other sites
The compiler will usually do something sensible if you simply use a 64-bit integer to do the calculations. I'd try that first.

For MSVC you can use [url="http://msdn.microsoft.com/en-us/library/aa383703(v=vs.85).aspx"]Int32x32To64()[/url] to get the full result from the mutliply (there's also a UInt version). That will get inlined and should be similar performance to assembly. There's also [url="http://msdn.microsoft.com/en-us/library/aa383718(v=vs.85).aspx"]MulDiv()[/url] which may help.

Share this post


Link to post
Share on other sites
[quote name='Adam_42' timestamp='1333386914' post='4927558']
I'd try that first.
[/quote]
Oh, I did ^^
I even did more than that, as I came with working C algorithms, and a "sensible" compiler would not have. Trouble with fixed point arithmetics is, to get things done, the compiler would need to be more than just sensible. It would need to know my goal in advance and have it ready for me.
which maybe we'll get in compilers and computers in general by 2051... but then I'll be hobbyless and unemployed :(

Well. I mean, to multiply fixed points, you can't just mul the integers. You need at least a shift, and if not using the higher register of the mul result, you need more than that.
And div is harder.

So, I have a working C solution (for mul), which could be improved, and a working but-slow-and-unprecise C solution (for div) which could be greatly improved. I even profiled them :P (and yes, my div would indeed greatly benefit from a speedup, not to mention the precision boost)

Btw, operands are 64b each. multiplication outputs and division inputs should then be 128 (as provided by x86-64 asm) to be able to get the most efficient algorithms.

But I'm by no means an asm (or fixed point, for that matter) expert, so if someone has a solution to that problem, apart from blackmailing a C comitee guy and ask him for native support of a x86-64 feature (hey, to hell with SPARC afterall :P), then I'm all ears :)

Share this post


Link to post
Share on other sites
I was assuming your fixed point values were 32-bit, in which case a 64-bit multiply followed by a shift is all that's needed.

[url="http://stackoverflow.com/questions/3329541/does-gcc-support-128-bit-int-on-amd64"]gcc supports 128-bit integers[/url] on 64-bit targets which should make 64-bit fixed point straight forward there. However MSVC doesn't support them at all.

You could use a library like [url="http://gmplib.org/"]GMP[/url] to handle it for you.

Share this post


Link to post
Share on other sites
I just google-hit a bunch of questions related to this same issue...

one of the answers was to hack the corresponding machine code as hard coded data and address it with a brutal cast to a function pointer, using __fastcall macros and the like... this seems quite nasty to me ^^
Well, if I'm using this, I guess I'm in for #ifdef nonetheless, as gcc won't have same call-mod macros or put its arguments in same registers... but it's a start.
For the curious (or the insanes) : [url="http://stackoverflow.com/questions/8453146/128-bit-division-intrinsic-in-visual-c"]http://stackoverflow...sic-in-visual-c[/url]
But then, I fail to see the difference with Adam_42 advice to compile it first and call as extern... which may very well be my final choice :x

@Adam_42 : I guess there are some 128 bits weirdies in msvc, but they won't work with operators such as div, I dunno if it's the same with gcc ? maybe not, but even so, there is a risk that "128b / 128b" expands to a lot of code of unessessary complexity (dividing by something larger than the size of one register is no trivial matter)... when I only need a native x86-64 "128b / 64b". (As an example of this, trying to div a "long long" by a "long" on 32bit architectures is interpreted as "long long" div "long long" and is an horrid operation)
Thanks for your answers anyway [img]http://public.gamedev.net//public/style_emoticons/default/smile.png[/img]

[Edited for readability]

Share this post


Link to post
Share on other sites

This topic is 2084 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this