Archived

This topic is now archived and is closed to further replies.

zoggo

Asm/C Performance

Recommended Posts

The following assembly code is generated by MS Visual C++ 6 push ebp mov ebp,esp push ebx push esi push edi fld dword [ ds:0x4080D0 ] pop edi pop esi pop ebx pop ebp ret for this function: float dotf( vec3 v1, vec3 v2 ) { return 0; } However, when I assemble the code using NASM (exact same calling conventions etc.) the code performs consistantly slower than the compiler generated code. Does anybody know why? I assume NASM is generating slightly different machine code but I don''t know what the difference is or how to change it. It is not being inlined, I can see this from a debugger. It is not dependant on where it is called from. Also, it is not a timing issue as various profilers and hand coded timer systems all show up the function as being slower. Thanks in advance for any help received. James

Share this post


Link to post
Share on other sites
Its not being inlined because I have turned it off. What I want to know is why two pieces of seemingly identical ASM are performing differently.

The idea is that I have two functions doing the same thing, one in C and one in Assembly. I don''t want either to be inlined, I just want the asm one to run at the same speed as the C one.

Sorry I didn''t make that clear.

Thanks anyway

Share this post


Link to post
Share on other sites
I suspect its a data access problem, but check the listing files to see if they both generated the same machine code. I dont remember the option for VC but in NASM, use an -l [listing file] option.

Share this post


Link to post
Share on other sites
I guess my suggestion would be to link both in under different function names and compare the disassembly of the functions directly from the exe with a disassembler. How are you determining that one is slower than the other?

Share this post


Link to post
Share on other sites
Thanks to all those who replied.

I have figured it out now. One, I was being stupid and looking at the wrong disassembly. But two, more interestingly is that VC generates a 6 byte instruction for FLD but NASM generates a 7 byte instruction. If anyone knows why, I would love to know.

However, having forced nasm to use the 6 byte version by means of many a DB, the performance difference is negligable.

Share this post


Link to post
Share on other sites
Just out of interest, does the MASM (7 byte) version differ from the 6 byte version by only an 0x3e at the start, i.e the 7 byte code = "0x3e d1 d2 d3 d4 d5 d6" and the 6 byte version = "d1 d2 d3 d4 d5 d6"?

Skizz

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Just wondering why, if you''re returning zero, you''re loading the constant from memory afaik there''s a FLDZ instruction on the FPU to handle this case precieslly

Share this post


Link to post
Share on other sites
I''m not using FLDZ because I wanted to replicate the VC++ asm code exactly in NASM. In their infinite wisdom, the MS compiler writers decided not use FLDZ and load 0 constants from memory.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Another performance hit could be caused by function alignment

The absolute worst case is when a function A) is not aligned on a 4-byte boundary and B) that it straddles two different cache lines.

Masm''s "align 16" eliminates (A) and reduces the chance of (B) - find the equivilent of "align 16" in nasm and slap that baby in front of your function.

- Rockoon

Share this post


Link to post
Share on other sites
Yes.

The 0x3e is a segment override prefix. In this case, it''s overriding the default segment used by the opcode with the DS segment. The default segment for the FLD instruction is already DS, so the CPU is doing a lot of unnecessary work overriding the instruction''s default segment (which implies the CPU doesn''t look ahead to check the override is really necessary).

To get NASM to generate the opcode without the prefix, you need to use the ASSUME directive. This directive is used to tell the assembler what the segments are pointing to so it can insert the overrides when necessary. At the moment, I''m guessing you''ve not got an ASSUME directive so the assembler doesn''t know what DS currently is - so it assumes you know best and puts a DS override in (since you''ve put DS:addr in the instruction). If you tell NASM that DS is pointing to the data segment then it will know that addr is also in that segment and the override isn''t needed.

There''s only one place a DS segment override is actually useful: whenever ESP or EBP is used as a base addresses the SS segment is used by default.

There are, incidentally, up to four prefixes that can be added to an instruction: a lock or repeat prefix, a segment override or branch hint, an instruction size prefix and a data size prefix. The latter two are intersting as they will add to the code size and execution time when accessing or using data of a size that doesn''t match the segment''s descriptor (i.e. accessing 16 bit data in a 32 bit data segment).

Skizz

Share this post


Link to post
Share on other sites
I think we have got to the bottom of this one then. eee, grand.

FLDZ is quite a lot faster than loading a const from memory, however, VC++ isn''t generating this instruction. That said, with a few command line args it could probably be convinced to. Although I am using std edition which is missing most of the optimization features. However, I am pretty sure this is not the cause of my issue.

Thanks to Skizz for pointing out the instruction prefix. I knew about the lock, repeat and even the segment overide prefix. However, clearly I was too stupid to connect the two, eh? Hmm, I guess you learn best from mistakes.

I have a feeling the anon poster who mentioned alignment has hit the cause though. The functions change speed when you change strings within the program, it''s quite a large change infact. The only thing this can be doing is shifting the position of the code. Modern 32 bit x86 CPUs have cache lines on 16 byte boundaries don''t they? (Or is it 32 byte, but do dual fetches? I dunno) Oh well, time to hunt through the nasm docs.

Many thanks to all those who replied.

Share this post


Link to post
Share on other sites