FADD vs FMUL time

Started by
4 comments, last by Hodgman 11 years ago

Hi,

I've been some comparisons in C between the 4 basic arithmetic operations (+ , - , * , / ), and surprisingly (for me), add and multiply operations takes the same time: I did the work testes using int and doubles data types and it's the same thing.

Analizing the dissamble code generated by gcc (-S parameter) I noted that the opcodes used are fadd and fmul. According wikipedia, x87 FPU in Athlon 64 employs the same time processing both opcodes.

I'd like to know what is the reason of this curiosity.

Thanks.

Advertisement
A lot of time and effort (and die space) spent on optimizing fmul.

Are you asking for the specifics of the floating point ALU? That's getting pretty deep, man. I'd be curious to see it if anyone has access to that info.
void hurrrrrrrr() {__asm sub [ebp+4],5;}

There are ten kinds of people in this world: those who understand binary and those who don't.
Any operation on a floating point number is complicated -- both addition and multiplication basically require steps that add and multiply or shift the component parts of the float.
There's fixed-function hardware that's hard-wired to perform each of these operations, which it turns out can be implemented with similiar time constraints. A lot of operations can be hard wired to complete in a single clock cycle, if you throw enough transistors at it.

Are you asking for the specifics of the floating point ALU? That's getting pretty deep, man. I'd be curious to see it if anyone has access to that info.

Maybe: My interest lies on to know the hardware-algorithmics aspects behind the add and mul operations, regardless if those operations are performed in the FPU or not.


A lot of operations can be hard wired to complete in a single clock cycle, if you throw enough transistors at it.

FDIV operation still be too slow compared to FADD-FMUL. That means FDIV requires too much transistors to approach to FADD-FMUL times?

Yes, it's more efficiently implemented with an iterative algorithm, where each clock cycle performs one iteration.
[edit] internally, the algorithm of couse has to use integer division
http://stackoverflow.com/questions/8401194/the-integer-division-algorithm-of-x86-processors
[/edit]

Note that CPUs often have some kind of RCP op, which very quickly computes an approximation to 1/x, rather than y/x. Sometimes 'close enough' results are ok (e.g. In graphics), where you'd use y*rcp(x) instead of y/x.

You can find the human-readable algorithms by searching for "floating point multiplication", etc, and the format's layout is on Wikipedia. I'm not sure about finding details about what the logic-gate/transistor diagrams would look like... the most advanced thing I've drawn in hardware diagrams is an integer adder ;-)
Maybe the famous "what every programmer should know about floating point" document would be illuminating?

This topic is closed to new replies.

Advertisement