Archived

This topic is now archived and is closed to further replies.

Assembler with gcc/g++

This topic is 4939 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I have been writeing a engine for my new game in Dev-cpp(5)and so far i have been able to get it all working OK. The problem starts when i want to optimize my engine with SIMD (3DNow! to be specific). I want to write an assembler function and I get stuck with pasing the functions parameters. I hawe read that the GCC/G++ compilers can''t see an local variable inside an ASM("") block. Question : How can i use the function parameters in ASM block P.S. I have tried the pop funct. and i get an Error mesage by Windows! And sory abouth bad speling but i don''t speek english wery well

Share this post


Link to post
Share on other sites
You might want to start with the gcc documentation. You can find the docs for the various versions here: GCC online documentation - and you can find the docs in a more convenient .chm file form here: gcc-3.3.2.chm (694k). Look for section "6.35. Assembler Instructions with C Expression Operands" - or see this link: Assembler Instructions with C Expression Operands. You''ll probably want to download the complete documentation package.

Unfortunately, the documentation isn''t as clearly written as it could be - it''s not so easy to understand even for native english speakers.

There are several things to consider when accessing function parameters with inline assembly. For example - what is the calling convention used by the function, are optimizations turned on, should the function be declared "naked" - and so on. Going over the background of each of these questions is rather involved. You can find a pretty good description of calling conventions here: fubi - although the assembly syntax used is Intel rather than AT&T which is what gcc uses.

Assuming that a function has a prolog - that is instructions to set up a stack frame, for example:

pushl %ebp
movl %esp,%ebp

the first parameter is found at

pushl 8(%ebp)

the second at

pushl 12(%ebp)

and so on (assuming that each parameter is 32 bits).

That you inserted a pop and got an error tells me that you probably need to learn more about assembly language before attempting to optimize your engine with it. Hopefully, the above links will help.




Share this post


Link to post
Share on other sites
Thanks man!
That worked!!!

What i ment to say with Pop is that i hawe been trying to do the oposit of
pushl %ebp
movl %esp,%ebp

with
pop %ebp (why did it pushed it in the first place) ???

buth your code works so it is OK.

And i don''t know a lot of assembler but i do know all SIMD like 3DNow and SSE instructs that i need.
And for the rest of the assembler code -- thats why GameDev has the forums so i can ask for help!

As for AT&T assembler lanuge i use
.intel_syntax noprefix

and at the end of asm block
.ATT_syntax
(i get a lot of errors from compiler if i dont pu this)

Thank you again

Share this post


Link to post
Share on other sites
quote:
Original post by Red Drake
What i ment to say with Pop is that i hawe been trying to do the oposit of
pushl %ebp
movl %esp,%ebp

with
pop %ebp (why did it pushed it in the first place) ???



The corrollary to the prologue is called the epilogue. At it''s most basic it looks like this:

movl %ebp,%esp ; restore the stack
popl %ebp ; restore the old frame pointer
ret ; return to caller. Caller cleans the stack

This is for a function using the cdecl calling convention. A function using the stdcall calling convention would readjust the stack to account for the function parameters. Like so:

ret $N

Where N is the number of bytes of arguments. Typically a multiple of 4. You should get yourself an assembler reference - like "The Art of Assembly" as well as a copy of the Intel manuals.

ebp is pushed at the beginning of a function to save whatever value was in it before the function was called. This is by convention - an agreement regarding how calling and called functions related to each other. By this convention, there are certain registers that a called function must not over write (ebp is one of them) and likewise, registers that it can over write. That means that if the registers that can be over written contain important values, a calling function must save them before calling another function. A typical method of saving these values is to push them on the stack before making the call and popping them back from the stack afterwards.

ebp is pushed to save it''s contents, and then it''s used to store the contents of esp. This makes it easier to reference function arguments and local variables. Arguments remain at fixed positive offsets from ebp and locals at fixed negative offsets. It''s not absolutely necessary to use this approach, but it''s easier. The alternative is to use esp directly, but that takes more work because esp changes with every push, every pop - and that means the offsets to arguments and local variables change as well.



Share this post


Link to post
Share on other sites
Red Drake, knowing the SIMD sets is one thing. But your efforts of making asm will be hindered if you do not master the various compiler settings, the calling conventions and the scheduling of the asm intructions.

Right Lessbread. I can add my own experience about it, there are pros and cons concerning esp and ebp. I would say not using ebp for the stack frame is good for small functions (20-40 clock cycles).

- esp : frees ebp. Avoids an AGI in the function prologue. Most routines feel the register pressure and stall when ebp can not be used. The x86 only has 6 without ebp. When you are short of registers it creates unneccessary traffic on the stack, AGIs, load/store dependencies, etc...

- ebp : makes shorter code mov eax,[ebp+offset] is shorter than mov eax,[esp+offset]. Which is code cache friendly. It's also easier to debug.

When the functions are really small. For instance if you redefine simple vector operators like + use the C intrisics of the Intel lib (_m....) or those of the latest gnu c. __builtin_...

Now I'd also like some feedback. I am still working on my cathedral work. A ultra high perf normalized and compatible math lib. And I am experiencing problems with gcc/DevCpp and the intrisics. It works but when I look at the assembly code generated, it is awful. The cc only uses mm0 (for my 3DNow code) despite all the settings I have tried.

Currently I am testing various CCs and hardwares on a routine I have chosen as a reference test. It's the quaternion multiplication (since it's non trivial with some swizzling). Currently the best I can achieve is 22 cycles on an Athlon (3DNowExt), Visual 6. Intrisics inlined are irrelevant (33 cycles at best). An inline function with inline asm is buggy. The best I have found for Visual is using __fastcall for a true function containing highly scheduled inline asm.

I hope gcc can make it work with truely inlined asm. In 3DNow, without the register fill/restore and call overhead, the quat. mul should take around 16 cycles, since it can be highly parallelized. Around 12 in SSE. The register name abstraction is really a feature I am eagger to test. But if someone could give me some advices about the calling conventions and inline asm in gcc that'd be great.


[edited by - Charles B on May 23, 2004 9:07:57 PM]

Share this post


Link to post
Share on other sites
Red Drake, I give you my "sensitive secrets" on how to code inline asm under gcc :

This will be faster than using the builtins. This shows the full power of inline asm under gcc, where register names are abstract, which lets the compiler optimize the way it allocates data into registers. This is impossible to achieve eaquivalent result under Visual C++ :

typedef unsigned long long my64 __attribute__((__aligned__(8)));

#define _xor_64(r, s)asm volatile("pxor %2, %0" : "y"(r) : "0"(r) : "y"(s))

inline my64 xor_64(const my64 s, const my64 t){
my64 r=s; _xor_64(r, t); return (r);
}

This is how I define my own intrisics. You asked me in another thread. I did it because the builtins force the result (r) into mm0, and source (s) into mm1, but here %2 can be any MMX register from mm0-mm7. this lets the compiler optimize much better. For instance :

my64 a,b;
...
a=..; b=..;
a = xor_64(a, b);
b = xor_64(b, b);

I suppose that in the surrounding context, the compiler already uses mm0, mm1, mm2, mm3, mm4 to hold some temp MMX data.

With builtins the compiler would generate something like :

mov mm5, ... ; temp var a
mov mm6, ... ; tamp var b
push mm0
mov mm0, mm5
mov mm1, mm6
pxor mm0, mm1
mov mm5, mm0
mov mm0, mm6
pxor mm0, mm1
mov mm6, mm0
pop mm0
...
Very heawy and cumbersome !!!

With my intrinsic it becomes :
mov mm5, ...
mov mm6, ...
pxor mm5, mm6
pxor mm6, mm6
...

I let you read the inline asm related stuff in the gcc docs to understand what the asm code really means here.


[edited by - Charles B on June 9, 2004 8:21:01 AM]

Share this post


Link to post
Share on other sites
quote:
Original post by Charles B
Red Drake, I give you my "sensitive secrets" on how to code inline asm under gcc :

This will be faster than using the builtins. This shows the full power of inline asm under gcc, where register names are abstract, which lets the compiler optimize the way it allocates data into registers. This is impossible to achieve eaquivalent result under Visual C++ :

typedef unsigned long long my64 __attribute__((__aligned__(8)));

#define _xor_64(r, s)asm volatile("pxor %2, %0" : "y"(r) : "0"(r) : "y"(s))

inline my64 xor_64(const my64 s, const my64 t){
my64 r=s; _xor_64(r, t); return (r);
}

This is how I define my own intrisics. You asked me in another thread. I did it because the builtins force the result (r) into mm0, and source (s) into mm1, but here %2 can be any MMX register from mm0-mm7. this lets the compiler optimize much better. For instance :

my64 a,b;
...
a=..; b=..;
a = xor_64(a, b);
b = xor_64(b, b);

I suppose that in the surrounding context, the compiler already uses mm0, mm1, mm2, mm3, mm4 to hold some temp MMX data.

With builtins the compiler would generate something like :

mov mm5, ... ; temp var a
mov mm6, ... ; tamp var b
push mm0
mov mm0, mm5
mov mm1, mm6
pxor mm0, mm1
mov mm5, mm0
mov mm0, mm6
pxor mm0, mm1
mov mm6, mm0
pop mm0
...
Very heawy and cumbersome !!!

With my intrinsic it becomes :
mov mm5, ...
mov mm6, ...
pxor mm5, mm6
pxor mm6, mm6
...

I let you read the inline asm related stuff in the gcc docs to understand what the asm code really means here.


[edited by - Charles B on June 9, 2004 8:21:01 AM]


This is far above my curent knowlidge, buth I will put an efort on understanding what you yust told me (I wroute abouve that I was begining to learn assembly ).
I am nearly a beginer and I see that you probably hawe more than years of expiriencec with assembler.
What else can I say ?
I bow to your wisdome

P.S.
I don't use inline assembly - I use an linker to link my OBJ files built by NASM. - I know I said the oposit of this at the begining buth it's proven the bether choice than inline asm


[edited by - Red Drake on June 9, 2004 10:12:06 AM]

Share this post


Link to post
Share on other sites
I have read that the GCC/G++ compilers can''t see an local variable inside an ASM("") block.

This is totally false. Just read my code sample, you''ll see that local variables can be accessed through the %0 system. And i''s te most powrful way of doing things because these locals are then passed as registers to your asm inline code. Thus inline asm is far more performant than linked asm files. There is no calling overhead. You don''t need to push/pop and call. It''s much faster.

If you want I can send you a small test project under Dev-cpp that shows what I told you before. You can test it, it outputs benchmarks in the console.

Share this post


Link to post
Share on other sites
quote:
Original post by Charles B
I have read that the GCC/G++ compilers can''t see an local variable inside an ASM("") block.

This is totally false. Just read my code sample, you''ll see that local variables can be accessed through the %0 system. And i''s te most powrful way of doing things because these locals are then passed as registers to your asm inline code. Thus inline asm is far more performant than linked asm files. There is no calling overhead. You don''t need to push/pop and call. It''s much faster.

If you want I can send you a small test project under Dev-cpp that shows what I told you before. You can test it, it outputs benchmarks in the console.



This woud be nice
Plase send me this at my mail
"rafael.munitic@ri.hinet.hr" - and not the one from the my profile.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Charles B I''ld be glad to receive your code sample...

You cans end it to me at :
phiaphednrwl@jetable.com

thanks a lot.

Share this post


Link to post
Share on other sites
Thank''s and only one question - I said this a lot of times to you today buth i keep thinking of things that I don''t know of assembly that I am shure that you will know.
So is 3DNow! actualy slower than SSE - not in code usage (becouse you only hawe 2 registers) buth in tehnical/arhitectural way - lets say you use 3DNow sqrt & SSE sqrt so wich one woud be faster and similar instructs or are they the same ?

Share this post


Link to post
Share on other sites
quote:
Original post by Charles B
K but I''ll do that, wait a bit. My dev PC is not this one, not connected. None can hack it


Did you already sent me e-mail becouse I recived one totaly empty - (i don''t wan''t to display the adress becouse of your privacy).
My outlook is totaly fu**d by SPAM - "buy viagra .... Chaeap WinXP license ..."

Share this post


Link to post
Share on other sites
I am emailing right now. I have cleaned a few personnal things out of the files. And I have transfered it on this PC. So I e-mail right after this post, and I go swimming

SSE is basically faster because 128bits, that is 4xfloats. Now download the vendors (AMD, Intel) tach papers to compare te latencies. If you want to use SIMD sqrt to accelerate float sqrtf(float), I can''t tell. This depends on latencies.

Share this post


Link to post
Share on other sites