question about multiplication

Started by
7 comments, last by baddrizz 21 years, 8 months ago
i''m sitting here reading the art of assembly language to learn assembly cause i''v been programming for about two years and have been soso programmer so i figure lets learn asm to know more about how computers work make fast code ect. and in the last week i''v learned more about the inner workings of a computer then i have since i started programming but anyway on to my question everywhere i go people say using multiplication is slow so would it be faster to use a function you wrote that adds the number your multiplying be how ever many like this int a = 2, b = 3, c = 4; a = b * c; //supposed to be slow a = mul(b, c); //maybe faster?? inline void mul(int a, int b) { int d = a; for(int i = 0; i < b; i++) a = a + d; } i would really like to hear any thoughts or comments on this idea someone has probably thought of this already but ohh well thx for any replys
Advertisement
ohh ya just thought i''d say i quickly though of that code so i don''t know if it works but you get the idea i think
I don''t know, but I would guess that CPU uses something similar to your mul(a,b) algorithm to calc multiplication and much faster than your mul(a,b)
Ok mul is slow BUT the function you''ve writing is SLOWER.
When you hear "mul is slow", this means that mul is slower that other processor instruction, not that mul is "absolutely slow" I think it''s impossible to optimize mul by rewriting it.
I know only one way to optimize mul, and it work only when you multiply by a number that is a power of to:

a * 2 is equivalent to a << 1 (shift right)
a * 4 == a << 2
etc...
Yep, shifting and bitmasking can take the place of some arithmetic, and in some cases can be faster (like on 286-486 processors, generally):

x << y = x * ( 2 ^ y )
So if you want to multiply an integer by 8...
x << 3 = x * 8

x >> y = x / ( 2 ^ y )
Same as with multiplication; if you want to divide an integer by 16...
x >> 4 = x / 16

x & y = x % ( y + 1 )
Does modulos using bitwise operators...
x & 0x03 = x % 4

Now on a Pentium processor, you will probably want to simply stick with the usual multiplicatio, division, and modulos operators, as they can be done on either pipeline and so may be optimized better than any of these other techniques.

Summary: Do these things in the simplest manner possible to start with. Then test with some of these easy optimizations (_AFTER_ you''ve tried algorithmic optimizations).
quote:
I know only one way to optimize mul, and it work only when you multiply by a number that is a power of to:

a * 2 is equivalent to a << 1 (shift right)
a * 4 == a << 2
etc...


Two other ways:
1) the above can be expanded to do more complex sums, e.g.

a * 10 == (a << 1) + (a << 3)
a * 15 == (a << 4) - a
a * 89 == (a << 6 ) + (a << 5) - (a << 3) + a

2) Use floats. FP multiplication is usually a lot quicker than integer. Some processors have multiply-add instructions that are as quick as plain multiply, making FP even quicker for calculations involving both products and sums.
John BlackburneProgrammer, The Pitbull Syndicate
yes, hidden pointed out how to make use of shifts in special cases to get a very fast "mul" by a power of 2. The same works for div by a power of 2 but you shift it right instead left.

And back to your question, mul is slow because it takes more cpu cycles then add or shift(shl,shr).
On pentium cpu it takes 10-11 cycles to perform and simple instructions like add etc. take 1 cycle.

Your mul function(a*b=c) uses add and jump instructions b times (in your loop)... now if b is 32bit it can be for example 100.000... so what about 100.000 add operations and 100.000 jump operations?

Add to it things like branch prediction for jump (if it doesn't predict right it will have a penalty which will be equal to length of pipeline.. for modern cpus - athlon xp it's 12cycles and p4 it's 20) and additional waiting time because in every add you use the result of earlier add operation (long stall here everytime you do a=a+d as before executing another a=a+d cpu has to wait for earlier result to be calculated)

All in all, your loop is probably slower then mul for b greater then 1 or so...

PS.do some testing and you will see it's a very good practice to time your functions and see how they work.

PPS.
You would get a much better mul using lots of shift instructions.

Theory
a*5=a*4+a*1
or
a*132=a*128+a*4

we can get the power of 2 multiplication by shifting values.

we use AND instruction to check if specific bits from b value (from a*b=c) are set (1) or not.

if bit is set in b, we add to our result(c) a << bitnr (so a shifted left bitnr bits, where bitnr is the nr we check in b)

ok this was my pretty bad explanation, let us now try to use it - it's easy to understand in practice

(we will use 8bit numbers for simplicity reasons and 16bit result, it works the same way with more bits

a=01001001 // it's in binary format, equal to 73 decimal
b=00101001 // it's in binary format, equal to 41
73*41=2993 (101110110001 binary)(hope i haven't made a typo somewhere

our mul works in few simple step (a*b=c)
1.c=0 (result)
2.for (bitnr=0;bitnr<7;bitnr++)
{
3.if ((b&(1 << bitnr))!=0) c+=a << bitnr;
}
4.result is waiting in c, use it

from this code already it's obvious, that by adding one more if per loop, one and and two shifts we're going to loop only 8 bits for every mul (8 times, not more, not less, in your mul if we would mul two 8bit number we could loop from 0 to 255 times)

now what we do:
1.We set result to 0 (that's obvious
2.We make a loop that loops thru the number of bits our b value is created from (b is an 8bit value, so we loop changing bitnr from 0 to 7)
3.here we do all the work. First we check if particular bit in b is set and if it's then we add one multiplication (we use shift as it's faster and can be used for numbers that are power of 2)

in our example it looks like this (i will emulate cpu
a=01001001
b=00101001

1.c=0

2.bitnr=0

3 if ((b&(1 << bitnr))!=0)
b=00101001
1 << bitnr = 1 << 0 = 1
00101001 (b) and (&)
00000001 (1) =
00000001 (result 1 , 1!=0 - true)

4.because 1!=0 we perform c+=a<in short c=a now.

5.bitnr is increased by 1 because of for loop- bitnr=1 , which is <8 so we continue our loop

6.if ((b&(1 << bitnr))!=0)
b=00101001
1 << bitnr = 1 << 1 = 2 (binary 10)
00101001 (b) and (&)
00000010 (2) =
00000000 (result 0 , 0!=0 - false)

7.bitnr is increased by 1 because of for loop- bitnr=2 , which is <8 so we continue our loop

8.if ((b&(1 << bitnr))!=0)
b=00101001
1 << bitnr = 1 << 2 = 4 (binary 100)
00101001 (b) and (&)
00000100 (4) =
00000000 (result 0 , 0!=0 - false)

9.bitnr is increased by 1 because of for loop- bitnr=3 , which is <8 so we continue our loop

10.if ((b&(1 << bitnr))!=0)
b=00101001
1 << bitnr = 1 << 3 = 8 (binary 1000)
00101001 (b) and (&)
00001000 (8) =
00001000 (result 0 , 8!=0 - true) (00001000 is 8 decimal)

11.because 8!=0 we perform c+=a << bitnr; so c+=a << 3; so c+=a*8;
in short c=a+a*8 now.

12.bitnr is increased by 1 because of for loop- bitnr=4 , which is <8 so we continue our loop

13.if ((b&(1 << bitnr))!=0)
b=00101001
1 << bitnr = 1 << 4 = 16 (binary 10000)
00101001 (b) and (&)
00010000 (16) =
00000000 (result 0 , 0!=0 - false)

14.bitnr is increased by 1 because of for loop- bitnr=5 , which is <8 so we continue our loop

15.if ((b&(1 << bitnr))!=0)
b=00101001
1 << bitnr = 1 << 5 = 32 (binary 100000)
00101001 (b) and (&)
00100000 (8) =
00100000 (result 0 , 32!=0 - true) (00100000 is 32 decimal)

16.because 32!=0 we perform c+=a<in short c=a+a*8+a*32 now.

... (higher bits in b value are 0, so we won't add anything to c anymore and leave this loop once bitnr>7.. i'm just tired and won't copy/paste those few lines)

all in all, we finish with result c=a+a*8+a*32 which is equal to c=a*41. All in all, instead performing a loop 41 times we did it 8 times and still got the same result.. althrought this loop will execute slower then your loop, decreasing the number of repeats will make my procedure much faster. It's also better for cpu design as we ALWAYS repeat it 8 times, so designers know exactly how many clock cycles will be spend on each mul etc.

But wait, there's more.. there are few optimizations possible.
The most obvious one is to check everytime we call for if b>>bitnr>0 and exit the for loop if it isn't.
Reason is simple, if b=00000010 we don't have to loop for bitnr=2, bitnr=3... bitnr=7 as it won't change final result.




that was long Hope it helps anyone...


[edited by - MirekCz on August 29, 2002 4:54:30 PM]
With best regards, Mirek Czerwiñski
Pretty funny stuff.
Most modern CPUs can do multiplication as fast as addition.
quote:Original post by mossmoss
Pretty funny stuff.
Most modern CPUs can do multiplication as fast as addition.


ehh, where did you read that?
I have just checked AMD docs about it to be 100% sure... multiplication takes 4...18 cycles on athlon xp (mostly in 4-9 range)
So there''s still a decent speed drawback using mul comparing to add instruction

Of course our self-made mul functions doesn''t have a chance against hardware implementation... it''s just for fun.

With best regards, Mirek Czerwiñski

This topic is closed to new replies.

Advertisement