Back to General and Gameplay Programming

Asm/C Performance

zoggo · 2004-04-23T11:53:12

The following assembly code is generated by MS Visual C++ 6 push ebp mov ebp,esp push ebx push esi push edi fld dword [ ds:0x4080D0 ] pop edi pop esi pop ebx pop ebp ret for this function: float dotf( vec3 v1, vec3 v2 ) { return 0; } However, when I assemble the code using NASM (exact same calling conventions etc.) the code performs consistantly slower than the compiler generated code. Does anybody know why? I assume NASM is generating slightly different machine code but I don''t know what the difference is or how to change it. It is not being inlined, I can see this from a debugger. It is not dependant on where it is called from. Also, it is not a timing issue as various profilers and hand coded timer systems all show up the function as being slower. Thanks in advance for any help received. James

General and Gameplay Programming Programming

Started by zoggo April 20, 2004 01:46 PM

12 comments, last by zoggo 19 years, 12 months ago

quasar3d

814

April 21, 2004 03:48 PM

but it probably will when optimization is turned on. I''m pretty sure fldz will be faster than loading a constant from memory.

My Site

Anonymous

April 21, 2004 04:16 PM

Another performance hit could be caused by function alignment

The absolute worst case is when a function A) is not aligned on a 4-byte boundary and B) that it straddles two different cache lines.

Masm''s "align 16" eliminates (A) and reduces the chance of (B) - find the equivilent of "align 16" in nasm and slap that baby in front of your function.

- Rockoon

Skizz

794

April 21, 2004 04:34 PM

Yes.

The 0x3e is a segment override prefix. In this case, it''s overriding the default segment used by the opcode with the DS segment. The default segment for the FLD instruction is already DS, so the CPU is doing a lot of unnecessary work overriding the instruction''s default segment (which implies the CPU doesn''t look ahead to check the override is really necessary).

To get NASM to generate the opcode without the prefix, you need to use the ASSUME directive. This directive is used to tell the assembler what the segments are pointing to so it can insert the overrides when necessary. At the moment, I''m guessing you''ve not got an ASSUME directive so the assembler doesn''t know what DS currently is - so it assumes you know best and puts a DS override in (since you''ve put DS:addr in the instruction). If you tell NASM that DS is pointing to the data segment then it will know that addr is also in that segment and the override isn''t needed.

There''s only one place a DS segment override is actually useful: whenever ESP or EBP is used as a base addresses the SS segment is used by default.

There are, incidentally, up to four prefixes that can be added to an instruction: a lock or repeat prefix, a segment override or branch hint, an instruction size prefix and a data size prefix. The latter two are intersting as they will add to the code size and execution time when accessing or using data of a size that doesn''t match the segment''s descriptor (i.e. accessing 16 bit data in a 32 bit data segment).

Skizz

zoggo

194

Author

April 23, 2004 11:53 AM

I think we have got to the bottom of this one then. eee, grand.

FLDZ is quite a lot faster than loading a const from memory, however, VC++ isn''t generating this instruction. That said, with a few command line args it could probably be convinced to. Although I am using std edition which is missing most of the optimization features. However, I am pretty sure this is not the cause of my issue.

Thanks to Skizz for pointing out the instruction prefix. I knew about the lock, repeat and even the segment overide prefix. However, clearly I was too stupid to connect the two, eh? Hmm, I guess you learn best from mistakes.

I have a feeling the anon poster who mentioned alignment has hit the cause though. The functions change speed when you change strings within the program, it''s quite a large change infact. The only thing this can be doing is shifting the position of the code. Modern 32 bit x86 CPUs have cache lines on 16 byte boundaries don''t they? (Or is it 32 byte, but do dual fetches? I dunno) Oh well, time to hunt through the nasm docs.

Many thanks to all those who replied.

Asm/C Performance

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Asm/C Performance

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines