Asm/C Performance

Started by
12 comments, last by zoggo 19 years, 12 months ago
but it probably will when optimization is turned on. I''m pretty sure fldz will be faster than loading a constant from memory.

My Site
Advertisement
Another performance hit could be caused by function alignment

The absolute worst case is when a function A) is not aligned on a 4-byte boundary and B) that it straddles two different cache lines.

Masm''s "align 16" eliminates (A) and reduces the chance of (B) - find the equivilent of "align 16" in nasm and slap that baby in front of your function.

- Rockoon
Yes.

The 0x3e is a segment override prefix. In this case, it''s overriding the default segment used by the opcode with the DS segment. The default segment for the FLD instruction is already DS, so the CPU is doing a lot of unnecessary work overriding the instruction''s default segment (which implies the CPU doesn''t look ahead to check the override is really necessary).

To get NASM to generate the opcode without the prefix, you need to use the ASSUME directive. This directive is used to tell the assembler what the segments are pointing to so it can insert the overrides when necessary. At the moment, I''m guessing you''ve not got an ASSUME directive so the assembler doesn''t know what DS currently is - so it assumes you know best and puts a DS override in (since you''ve put DS:addr in the instruction). If you tell NASM that DS is pointing to the data segment then it will know that addr is also in that segment and the override isn''t needed.

There''s only one place a DS segment override is actually useful: whenever ESP or EBP is used as a base addresses the SS segment is used by default.

There are, incidentally, up to four prefixes that can be added to an instruction: a lock or repeat prefix, a segment override or branch hint, an instruction size prefix and a data size prefix. The latter two are intersting as they will add to the code size and execution time when accessing or using data of a size that doesn''t match the segment''s descriptor (i.e. accessing 16 bit data in a 32 bit data segment).

Skizz
I think we have got to the bottom of this one then. eee, grand.

FLDZ is quite a lot faster than loading a const from memory, however, VC++ isn''t generating this instruction. That said, with a few command line args it could probably be convinced to. Although I am using std edition which is missing most of the optimization features. However, I am pretty sure this is not the cause of my issue.

Thanks to Skizz for pointing out the instruction prefix. I knew about the lock, repeat and even the segment overide prefix. However, clearly I was too stupid to connect the two, eh? Hmm, I guess you learn best from mistakes.

I have a feeling the anon poster who mentioned alignment has hit the cause though. The functions change speed when you change strings within the program, it''s quite a large change infact. The only thing this can be doing is shifting the position of the code. Modern 32 bit x86 CPUs have cache lines on 16 byte boundaries don''t they? (Or is it 32 byte, but do dual fetches? I dunno) Oh well, time to hunt through the nasm docs.

Many thanks to all those who replied.

This topic is closed to new replies.

Advertisement