Jump to content

  • Log In with Google      Sign In   
  • Create Account


Karatsuba algorithm


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
12 replies to this topic

#1 Rattrap   Members   -  Reputation: 1606

Like
0Likes
Like

Posted 13 May 2014 - 01:16 PM

I've been implementing my own big integer class in C++.  I've been trying to implement the Karatsuba algorithm as an alternate method of doing multiplication once the bit length reaches a certain threshold.  My current implementation seems to get the first 3 or 4 elements correct, but after that the values are incorrect.

 

I've tried reproducing the algorithm using C# and .Net's BigInteger, and I've been getting the same results. So, I believe that my actual math functions are correct (Multiply, Add, Subtract, LeftShift), I'm guessing I'm doing something incorrectly in the algorithm itself.  I'm hoping some other sets of eyes might see something I missed.

 

My big integer class stores binary values in vectors of unsigned integers, the lowest order bits at the beginning of the array.  All values are stored in the positive, along with a signed bit.  Since the signs don't actually affect multiplication, the function is not concerned about them.

 

typedef std::vector<std::uint32_t> storage_t
 
inline static void AlternateMultiplication(const storage_t& Lhs, const storage_t& Rhs, storage_t& Product)
{
/*Get the size of each input*/
const auto LhsSize(Lhs.size());
const auto RhsSize(Rhs.size());
/*If one of the inputs is a single element long, use regular multiplication*/
if((LhsSize <= 1) || (RhsSize <= 1))
{
/* Multiplication ignoring the signs.  First two parameters are inputs by const reference, 3rd parameter is the output by reference */
AbsoluteMultiplication(Lhs, Rhs, Product);
return;
}
 
/*Find the length of the longest input*/
const auto LongestSize(std::max(LhsSize, RhsSize));
/*Find half of the longest input. The +1 helps it round up*/
const auto HalfLongestSize((LongestSize + 1) / 2);
/*Hold over from messing with the algorithm at the moment*/
const auto HalfLongestSizeSafe(HalfLongestSize);
 
/*Vector's to copy the split values*/
storage_t LhsHigh(HalfLongestSizeSafe);
storage_t LhsLow(HalfLongestSizeSafe);
storage_t RhsHigh(HalfLongestSizeSafe);
storage_t RhsLow(HalfLongestSizeSafe);
 
/*Iterators marking where the splits will occur*/
const auto LhsBegin(std::begin(Lhs));
const auto LhsEnd(std::end(Lhs));
const auto LhsMid(LhsSize >= HalfLongestSize ? LhsBegin + HalfLongestSize : LhsEnd);
const auto RhsBegin(std::begin(Rhs));
const auto RhsEnd(std::end(Rhs));
const auto RhsMid(RhsSize >= HalfLongestSize ? RhsBegin + HalfLongestSize : RhsEnd);
 
/*Split the inputs*/
storage_t LhsLow(LhsBegin, LhsMid);
Truncate(LhsLow);
storage_t LhsHigh(LhsMid, LhsEnd);
Truncate(LhsHigh);
storage_t RhsLow(RhsBegin, RhsMid);
Truncate(RhsLow);
storage_t RhsHigh(RhsMid, RhsEnd);
Truncate(RhsHigh);
 
storage_t Z0;
AlternateMultiplication(LhsLow, RhsLow, Z0);
 
storage_t Z2;
AlternateMultiplication(LhsHigh, RhsHigh, Z2);
 
storage_t Z1;
AbsoluteAddition(LhsHigh, LhsLow, LhsHigh);
AbsoluteAddition(RhsHigh, RhsLow, RhsHigh);
AlternateMultiplication(LhsHigh, RhsHigh, Z1);
 
/*Just to be safe, tracking the sign of Z1, since subtraction is involved*/
bool Z1Sign(true);
Subtraction(Z1Sign, Z1, true, Z2, Z1Sign, Z1);
Subtraction(Z1Sign, Z1, true, Z0, Z1Sign, Z1);
/*Left shift doesn't affect the sign. Convert elements length of number of bits to shift*/
AbsoluteLeftShift(Z1, (HalfLongestSize * (sizeof(T) * CHAR_BIT)), Z1);
 
bool Z2Sign(true);
AbsoluteLeftShift(Z2, (LongestSize * (sizeof(T) * CHAR_BIT)), Z2);
Addition(Z2Sign, Z2, Z1Sign, Z1, Z2Sign, Z2);
bool ProductSign(true);
Addition(Z2Sign, Z2, true, Z0, ProductSign, Product);
}

 

In the pseudocode on Wikipedia, the example is in base 10, where my code is in binary.  If I had to guess, the issue is in the left shifting or that I have to keep breaking down a single element down into a single bytes or smaller, since this is base 2 math.

 

storage_t Input = {0x2fad3fd6,0xb14e9f81,0x00000006};
storage_t Output;
 
AlternateMultiplication(Input, Input, Output);

 

Output - 0xeb2706e4,0x1d2f3c5b,0x7627584b,0xca7d4ae8,0x00000008

Correct - 0xeb2706e4,0x1d2f3c5b,0x7627584b,0xca7d4ac4,0x0000002c

 

As you can see, the first 3 elements are correct, as well as all but the highest order byte in the 4th.  This is similar if the inputs get bigger.  If the inputs are smaller, the answer comes out correct.

 

I'm trying to avoid posting the entire big integer class, since it is massive.  Any insight or suggestions would be helpful.



Sponsor:

#2 plainoldcj   Members   -  Reputation: 765

Like
1Likes
Like

Posted 13 May 2014 - 02:35 PM

Hey,

could it be that you override memory? You could try
changing func(A, B, A) to func(A, B, C) and see what happens then.

A note on bignums: you mentioned that you use base 2. I think the typical
choice is base 2^n, where n is the number of bits in an int-value.
This simplifies things as you can add two digits using ordinary + operator
etc.

#3 Rattrap   Members   -  Reputation: 1606

Like
0Likes
Like

Posted 13 May 2014 - 02:40 PM


A note on bignums: you mentioned that you use base 2. I think the typical
choice is base 2^n, where n is the number of bits in an int-value.
This simplifies things as you can add two digits using ordinary + operator
etc.

 

That is how it works.  Sorry, I meant base 2 as in I wasn't storing the number in base 10 form (like I've seen some big int libraries do), but as binary number that use the entire element.  The actual class is template based allowing any unsigned base integer to be used as a vector element.

 


could it be that you override memory? You could try

changing func(A, B, A) to func(A, B, C) and see what happens then.

 

The places where It is overriding old values is where those values aren't used again (which is why I do Z0 and Z2 first), but I'll give it a shot. 



#4 Rattrap   Members   -  Reputation: 1606

Like
1Likes
Like

Posted 14 May 2014 - 11:56 AM

There was a link at the bottom of the Wikipedia page for a c++ example that worked very similar to mine.  After rigging up some debugging code, I found that the issue involved the splitting and the shifting.  I added 1 before dividing the size by 2 in order to make the lower half bigger.  I assumed this would be the place to split it and trailing zeros in the upper half would "balance" it out.  Based on the other code, it looks like you want the higher order to be the larger of the two if it doesn't split perfectly.  This then caused me to use to large of a shift for Z2, since it should have been shifting by twice the "half".  I assumed that twice the half would always be the longest length, which isn't true when the number is odd.



#5 iMalc   Crossbones+   -  Reputation: 2299

Like
1Likes
Like

Posted 15 May 2014 - 01:41 AM

I implemented that algorithm , but found that the numbers had to be pretty big before it was a win. About 700 bytes IIRC. It would be good to hear how well it works for you.

In case you want to compare notes, my three variations of bigint libraries are here: http://homepages.ihug.co.nz/~aurora76/Malc/Useful_Classes.htm


"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms

#6 Rattrap   Members   -  Reputation: 1606

Like
2Likes
Like

Posted 15 May 2014 - 06:15 AM

Thanks.  This has been one of those pet projects of mine that has been written and rewritten several times (bigint not Karatsuba).  I've been playing with implementing different cryptographic functions and needed it to get RSA to work.

 

Based on Wikipedia, it was saying in the 300-600 bit range should be where it beats the long multiplication method.  My default multiuplication uses the shift and add method instead of long multiplication, which tends to be faster than long multiplication as well.  So my base comparison is a little skewed to begin with.

 

I've been comparing runtimes between my Karatsuba and my shift and add based multiplication and currently Karatsuba is way slower using QueryPerformanceCounter as the metric.  Doing a test of 50 multiplies for each algorithm, even at the 25000 bit range (I was really looking for the point Karatsuba would overtake the other), it was still twice as slow.  The shift and add version only needs to two temporary vectors. My current Karatsuba implementation currently has to allocate 4 temporary vectors for each level of recursion, so I think the memory allocations are killing it.  The two vectors in shift and add grow so there are multiple allocations there too, but no where near as bad.  Plus the needed space can be pre-allocated.  I'm going to try allocating a large pool before starting the recursion and passing iterators through to see if that will help speed it up.



#7 Bacterius   Crossbones+   -  Reputation: 8523

Like
3Likes
Like

Posted 15 May 2014 - 06:51 AM


I've been comparing runtimes between my Karatsuba and my shift and add based multiplication and currently Karatsuba is way slower using QueryPerformanceCounter as the metric.  Doing a test of 50 multiplies for each algorithm, even at the 25000 bit range (I was really looking for the point Karatsuba would overtake the other), it was still twice as slow.  The shift and add version only needs to two temporary vectors. My current Karatsuba implementation currently has to allocate 4 temporary vectors for each level of recursion, so I think the memory allocations are killing it.  The two vectors in shift and add grow so there are multiple allocations there too, but no where near as bad.  Plus the needed space can be pre-allocated.  I'm going to try allocating a large pool before starting the recursion and passing iterators through to see if that will help speed it up.

 

There's a lot of tweaking you can do to lower the overhead. Not making unnecessary copies, using an iterative version rather than recursion, switching over to the standard multiplication algorithm at some point during the process instead of recursing down to the last digit, and, yes, allocations inside primitive functions is definitely a mistake - in many cases you know exactly how big the output is going to be at most, so you should try and preallocate whenever possible. In any case, until both versions have been equally optimized, a performance comparison between the two is not too useful.

 

I should note though that if you are doing this specifically for RSA, the main bottleneck in RSA is not multiplication, but modular multiplication/exponentiation. There are faster ways to do this, such as Montgomery reduction, and furthermore decryption can be made faster by using number-theoretic algorithms (but those are probably beyond the scope of what you're trying to achieve). Anyway I thought I'd mention it, many bigint libraries do provide very optimized algorithms for "a^b mod m" type calculations, but a Karatsuba multiplication algorithm is always useful for the general case.


The slowsort algorithm is a perfect illustration of the multiply and surrender paradigm, which is perhaps the single most important paradigm in the development of reluctant algorithms. The basic multiply and surrender strategy consists in replacing the problem at hand by two or more subproblems, each slightly simpler than the original, and continue multiplying subproblems and subsubproblems recursively in this fashion as long as possible. At some point the subproblems will all become so simple that their solution can no longer be postponed, and we will have to surrender. Experience shows that, in most cases, by the time this point is reached the total work will be substantially higher than what could have been wasted by a more direct approach.

 

- Pessimal Algorithms and Simplexity Analysis


#8 Rattrap   Members   -  Reputation: 1606

Like
0Likes
Like

Posted 15 May 2014 - 07:21 AM


I should note though that if you are doing this specifically for RSA, the main bottleneck in RSA is not multiplication, but modular multiplication/exponentiation. There are faster ways to do this, such as Montgomery reduction, and furthermore decryption can be made faster by using number-theoretic algorithms (but those are probably beyond the scope of what you're trying to achieve). Anyway I thought I'd mention it, many bigint libraries do provide very optimized algorithms for "a^b mod m" type calculations, but a Karatsuba multiplication algorithm is always useful for the general case.

 

RSA was just one of the motivations I had for trying to implement a big int library.  I've just recently been researching faster multiplication methods on big integers.  As far as modular exponentiation, I'm currently using the method described on the Wikipedia entry using the Right-to-left binary method optimized with my big int class.

 

This has been a real iterative process.  I've been trying to just get things working, then work on trying to get things faster.  My goal has been to try and keep it C++11 compliant, not being dependent on platform optimizations.  I was just ecstatic when my Karatsuba implementation worked, so I felt like comparing my two implementations to see how they compared at the moment.  I was hoping to build a crossover point into the multiplication function, but until I further optimize the Karatsuba method, it can't surpass it.



#9 iMalc   Crossbones+   -  Reputation: 2299

Like
0Likes
Like

Posted 16 May 2014 - 02:45 AM

Always cool to hear of other people messing with this stuff. It sounds like we have fairly similar goals.

The three main bigint classes I have are:

One for fixed-size (not dynamically allocated) integers. The goal of this one if to behave as close to the real thing as possible. In fact to that end, it did end up getting used in a commercial compiler.

One for variable-sized (dynamically allocated) integers.

One for string-based integers.

My Karatsuba step is in the second one. See uvarbigint.cpp, in uvarbigint::Multiply_Unsigned. It doesn't do brilliantly in terms of reducing allocations itself, but hopefully it shows something of value. I'd be interested to hear if you'd like to work together on combining our efforts.


"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms

#10 Rattrap   Members   -  Reputation: 1606

Like
0Likes
Like

Posted 16 May 2014 - 06:24 AM

Going to have to really stare at your code a while, since it is structured very different than mine.  For example yours is using raw pointers and a lot of for loops.

 

I had original started writing an unsigned and signed version of the code, but once subtraction came into play with the unsigned big int, I just dropped it.  I didn't think it was really feasible once I tried to define 0 - 1.  It is well defined in C++ since the size of the native type doesn't grow and modulo wrap around comes into play.  I couldn't wrap my head around what the result would be, since it had no maximum value to use as a modulo.

 

Just an example, here is my signed addition code (minus a couple functions).  I included the shared functions that are called to perform the twos compliment on the fly.

// T is an unsigned integer and a template parameter for the class.
typedef std::vector<T> storage_t;
 
  /*
   * Helper function for returning an elements twos compliment value
   * Requires a persistant boolean value to track the carry/overflow
   * from element to element.
   */
  inline static const T GetSignedValue(bool& Carry, bool Sign, T Value)
  {
   /* Sign test */
   if(Sign == false)
   {
    /* Perform twos compliment to get negative value */
    const T OnesCompliment(~Value);
 
    if(Carry == true)
    {
     /* Handle twos compliment +1/addition carry */
     const T TwosCompliment(OnesCompliment + 1);
 
     /*
      * If the two's compliment value is zero, the Carry will always
      * be false throughout the rest of the elements.
      */
     Carry = TwosCompliment == 0;
     return TwosCompliment;
    }
    return OnesCompliment;
   }
   return Value;
  }
  /*
   * Perform an operation on a set of input values resulting in the output
   * Handles conversion of inputs and result to and from twos compliment
   */
  template <typename BinaryOperation>
  inline static const T SignedBinaryOperation(bool& LhsCarry, bool LhsSign, T LhsValue, bool& RhsCarry, bool RhsSign, T RhsValue, bool& ResultCarry, bool ResultSign, BinaryOperation binary_op)
  {
   /* Perform twos compliment on input, if needed */
   const T SignedLhsValue(GetSignedValue(LhsCarry, LhsSign, LhsValue));
   const T SignedRhsValue(GetSignedValue(RhsCarry, RhsSign, RhsValue));
 
   /* Perform operation */
   const T SignedResultValue(binary_op(SignedLhsValue, SignedRhsValue));
 
   /* Perform twos compliment on result, if needed */
   const T ResultValue(GetSignedValue(ResultCarry, ResultSign, SignedResultValue));
   return ResultValue;
  }
 
  inline static const T ElementAddition(T LhsValue, T RhsValue, bool &AdditionCarry)
  {
   const T ResultValue(LhsValue + RhsValue);
   if(AdditionCarry == true)
   {
    /* Add the carry */
    const T ResultWithCarry(ResultValue + 1);
    /*
     * Test for overflow
     * Because of the carry, we have to test for wrap around and equality
     * Worst case:
     *  FF + FF = FE
     *  FF + FF + 1 = FF
     */
    AdditionCarry = ResultWithCarry <= LhsValue;
    return ResultWithCarry;
   }
   else
   {
    /*
     * Test for overflow
     * FF + FF = FE
     */
    AdditionCarry = ResultValue < LhsValue;
    return ResultValue;
   }
  }
 
  /* Addition - Signed */
  inline static void Addition(bool LhsSign, const storage_t& Lhs, bool RhsSign, const storage_t& Rhs, bool& ResultSign, storage_t& Result)
  {
   /*
    * Determine Sign
    * This must be done first, since Lhs or Rhs could change, if &Lhs or &Rhs == &Result
    */
   if(LhsSign == RhsSign)
   {
    /* Signs Match */
    ResultSign = LhsSign;
 
    /*
     * Since the signs match, we can use AbsoluteAddition
     * |Lhs| + |Rhs| = |-Lhs| + |-Rhs|
     */
    AbsoluteAddition(Lhs, Rhs, Result);
    return;
   }
 
   /* Compare absolute value of Lhs and Rhs, to determine which is bigger */
   const int AbsoluteCompareResults(AbsoluteCompare(Lhs, Rhs));
   switch(AbsoluteCompareResults)
   {
   case -1:
    {
     /* |Lhs| < |Rhs| */
     ResultSign = RhsSign;
     break;
    }
 
   case 0:
    {
     /*
      * |Lhs| = |Rhs|
      * Lhs and Rhs Signs do not match, so values will cancel out
      * -Lhs + Rhs = Lhs + -Rhs = 0
      */
     ResultSign = true;
     Result.clear();
     return;
    }
 
   case 1:
    {
     /* |Lhs| > |Rhs| */
     ResultSign = LhsSign;
     break;
    }
 
   default:
    {
     throw std::exception();
    }
   }
 
   /*
    * Get the original sizes of the inputs
    * The resizing of Result could change the input sizes if the smaller of
    * the inputs is also the output.
    */
   const auto LhsSize(Lhs.size());
   const auto RhsSize(Rhs.size());
 
   /* Marking which of the inputs is greater */
   const bool LhsSizeGreaterThanRhsSize(LhsSize > RhsSize);
 
   /* Resize the output to match the largest input */
   Result.resize(LhsSizeGreaterThanRhsSize == true ? LhsSize : RhsSize);
 
   /* Create aliases to the shortest of the inputs */
   const auto& Shortest(LhsSizeGreaterThanRhsSize == true ? Rhs : Lhs);
   const auto& ShortestSign(LhsSizeGreaterThanRhsSize == true ? RhsSign : LhsSign);
 
   /*
    * Because of the possibiilty that Shortest == Result, it is neccessary to
    * track of the original size, so the original end element can be located.
    */
   const auto& ShortestSize(LhsSizeGreaterThanRhsSize  == true? RhsSize : LhsSize);
 
   /* Create an alias to the largest of the inputs */
   const auto& Longest(LhsSizeGreaterThanRhsSize  == true? Lhs : Rhs);
   const auto& LongestSign(LhsSizeGreaterThanRhsSize  == true? LhsSign : RhsSign);
 
   /*
    * Assists with twos compliment calculations
    * Addes the initial +1 as part of the twos compliment conversion
    */
   bool ShortestCarry(true);
   bool LongestCarry(true);
   bool ResultCarry(true);
   /* Tracks addition carry between elements */
   bool AdditionCarry(false);
 
   /* Perform addition operation on the inputs, until the end of the shortest range is encounter */
   auto ResultIterator(std::transform(std::begin(Shortest), (std::begin(Shortest) + ShortestSize), std::begin(Longest), std::begin(Result), [&ShortestCarry, &LongestCarry, &ResultCarry, &AdditionCarry, ShortestSign, LongestSign, ResultSign](T ShortestValue, T LongestValue) -> T
   {
    const T ResultValue(SignedBinaryOperation(ShortestCarry, ShortestSign, ShortestValue, LongestCarry, LongestSign, LongestValue, ResultCarry, ResultSign, [&AdditionCarry](T LhsValue, T RhsValue) -> T
    {
     const T ResultValue(ElementAddition(LhsValue, RhsValue, AdditionCarry));
     return ResultValue;
    }));
    return ResultValue;
   }));
 
   /* Perform addition operation on the longest input and treat shortest as 0 or negative equivelent */
   ResultIterator = std::transform(std::begin(Longest) + ShortestSize, std::end(Longest), ResultIterator, [&ShortestCarry, &LongestCarry, &ResultCarry, &AdditionCarry, ShortestSign, LongestSign, ResultSign](T LongestValue) -> T
   {
    const T ResultValue(SignedBinaryOperation(ShortestCarry, ShortestSign, 0, LongestCarry, LongestSign, LongestValue, ResultCarry, ResultSign, [&AdditionCarry](T LhsValue, T RhsValue) -> T
    {
     const T ResultValue(ElementAddition(LhsValue, RhsValue, AdditionCarry));
     return ResultValue;
    }));
    return ResultValue;
   });
   assert(ResultIterator == std::end(Result));
 
   /* Handle remaining carry, if one exists */
   if(true == AdditionCarry)
   {
    const T ResultValue(SignedBinaryOperation(ShortestCarry, ShortestSign, 0, LongestCarry, LongestSign, 0, ResultCarry, ResultSign, [&AdditionCarry](T LhsValue, T RhsValue) -> const T
    {
     const T ResultValue(ElementAddition(LhsValue, RhsValue, AdditionCarry));
     return ResultValue;
    }));
 
    /* Append the result, if one exists */
    if(ResultValue != 0)
    {
     Result.push_back(ResultValue);
    }
   }
 
   /* Remove any trailing zeros */
   Truncate(Result);
  }



#11 iMalc   Crossbones+   -  Reputation: 2299

Like
0Likes
Like

Posted 16 May 2014 - 06:41 PM

Yeah I really should change mine to just use a vector internally. I had been trying to be clever and save space.

Modulo wrap around is very easy for fixed-sized values. It just means you only keep the n lowest bits of the result.

But certainly for a variable-sized bigint, unsigned doesn't really make sense.

 

Your code seems to be very high level and well commented. Mine is more designed for high performance and less need to understand how it works hence less commenting.

In fact your code reminds me of this: http://ericlippert.com/2013/09/16/math-from-scratch-part-one/


"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms

#12 Rattrap   Members   -  Reputation: 1606

Like
0Likes
Like

Posted 16 May 2014 - 07:06 PM

Mine is more designed for high performance and less need to understand how it works hence less commenting.

I think for the most part, mine should still have good performance. At this point, I put a lot of trust in the compiler to optimize things, like the intermediate variables, lambdas, and the algorithm based loops. They do make debugging a little easier. Plus there have been several iterations of this code, I figure the more I document it, they more likely I'll remember what the heck I was thinking at the time, and be less inclined to start over :).

#13 iMalc   Crossbones+   -  Reputation: 2299

Like
0Likes
Like

Posted 17 May 2014 - 04:50 PM

I hope that optimism pays off for ya.


"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms




Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS