JohnnyCode 1046 Report post Posted July 24, 2013 Hi all I have a rather strange specific question. If I have IEEE 754 big endian floating 32 bit values in memory, How would I perform arithmetic operations with them, at the level of bit operators? Such as how would I define following function char* AddF(char* a, char* b) { } to return 4 bytes that would represent IEEE 754 floating 32 bit number of IEEE 754 format? I would like to define adding, subtracting, multiplying, dividing. If someone could shed some light on what actualy is done on the memory if it undercomes those operations. Thanks a bunch!!! 0 Share this post Link to post Share on other sites
Brother Bob 10347 Report post Posted July 24, 2013 A floating point value is typically (including the IEEE754) stored in the format s×m×2^{e}, where s is the sign (either 1 or -1), m is the mantissa and e is the exponent. Operations follows from operating on this format, for example such that multiplication becomes s_{1}×m_{1}×2^{e1 }* s_{2}×m_{2}×2^{e2}= (s_{1}*s_{2})×(m_{1}*m_{2})×2^{(e1+e2)}. That is, multiply the sign and mantissa, and add the exponent. However, since you mention IEEE754 explicitly, the question is: do you want to implement some general floating point operations, or follow the exact details of IEEE754? The former can be a decent exercise in understanding how floating point values work in general, but the latter will be a tremendous job. The first thing you need to do is getting your hands on the specification describing the exact details you need to implement. I believe this is the document you need, clicky. And no, it is not freely available, you have to pay for it, or if you have access to it through your university or employer. 2 Share this post Link to post Share on other sites
Pink Horror 2459 Report post Posted July 25, 2013 Don't return a pointer. If you insist on using char * instead of maybe a float / char array union, make the return address one of the arguments. 1 Share this post Link to post Share on other sites
Adam_42 3630 Report post Posted July 25, 2013 (edited) Agreed, char* is probably not the pointer type to use. int32_t* would make things a lot simpler as you shouldn't need to worry about endianness that way. Addition is relatively straight forward, off the top of my head it goes something like this: 1. Extract s, e and m from the floating point number. They are 1, 8 and 23 bits each. Note that the mantissa has an implicit leading 1 bit, unless it's denormal. 2. If either number is either a NaN or Infinity, the result is NaN or Infinity (with the appropriate sign). Inf-Inf=NaN. Inf+Inf = Inf. 3. Construct a signed integer mantissa value for both values, using a decent number of bits, 32 is probably simplest. You need at least one extra for rounding purposes. 4. Shift one of those mantissas to the right by the difference in exponent values (you shift the smaller number). They are now lined up correctly. 5. Add the mantissas together, and work out the resulting sign and exponent by finding the most significant set/unset bit. 6. Round the result to the nearest 23-bit unsigned representation. Be careful of denormals here, and rounding may affect the exponent. 7. Check for overflow, and return the appropriate infinity if it's happened. 8. You probably need to check for zero and/or a denormal result too to get the right answer. 9. Reconstruct your new floating point number from the result. Subtraction is of course trivial once you have addition working. Simply flip the sign bit on one value and add them. You might find this post and this series useful. Note that the bitfield trick is highly non-portable as the ordering can and will change between compilers. Edited July 25, 2013 by Adam_42 1 Share this post Link to post Share on other sites