Jump to content
  • Advertisement
Sign in to follow this  

IEEE 754 floating point operations on bits

This topic is 2188 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi all


I have a rather strange specific question.

If I have IEEE 754 big endian floating 32 bit values in memory, How would I perform arithmetic operations with them, at the level of bit operators?

Such as how would I define following function

char* AddF(char* a, char* b)




to return 4 bytes that would represent IEEE 754 floating 32 bit number of IEEE 754 format?

I would like to define adding, subtracting, multiplying, dividing.

If someone could shed some light on what actualy is done on the memory if it undercomes those operations.

Thanks a bunch!!!

Share this post

Link to post
Share on other sites

A floating point value is typically (including the IEEE754) stored in the format s×m×2e, where s is the sign (either 1 or -1), m is the mantissa and e is the exponent. Operations follows from operating on this format, for example such that multiplication becomes s1×m1×2e1 * s2×m2×2e2= (s1*s2)×(m1*m2)×2(e1+e2). That is, multiply the sign and mantissa, and add the exponent.


However, since you mention IEEE754 explicitly, the question is: do you want to implement some general floating point operations, or follow the exact details of IEEE754? The former can be a decent exercise in understanding how floating point values work in general, but the latter will be a tremendous job. The first thing you need to do is getting your hands on the specification describing the exact details you need to implement. I believe this is the document you need, clicky. And no, it is not freely available, you have to pay for it, or if you have access to it through your university or employer.

Share this post

Link to post
Share on other sites

Don't return a pointer. If you insist on using char * instead of maybe a float / char array union, make the return address one of the arguments.

Share this post

Link to post
Share on other sites

Agreed, char* is probably not the pointer type to use. int32_t* would make things a lot simpler as you shouldn't need to worry about endianness that way.


Addition is relatively straight forward, off the top of my head it goes something like this:


1. Extract s, e and m from the floating point number. They are 1, 8 and 23 bits each. Note that the mantissa has an implicit leading 1 bit, unless it's denormal.

2. If either number is either a NaN or Infinity, the result is NaN or Infinity (with the appropriate sign). Inf-Inf=NaN. Inf+Inf  = Inf.

3. Construct a signed integer mantissa value for both values, using a decent number of bits, 32 is probably simplest. You need at least one extra for rounding purposes.

4. Shift one of those mantissas to the right by the difference in exponent values (you shift the smaller number). They are now lined up correctly.

5. Add the mantissas together, and work out the resulting sign and exponent by finding the most significant set/unset bit.

6. Round the result to the nearest 23-bit unsigned representation. Be careful of denormals here, and rounding may affect the exponent.

7. Check for overflow, and return the appropriate infinity if it's happened.

8. You probably need to check for zero and/or a denormal result too to get the right answer.

9. Reconstruct your new floating point number from the result.


Subtraction is of course trivial once you have addition working. Simply flip the sign bit on one value and add them.


You might find this post and this series useful. Note that the bitfield trick is highly non-portable as the ordering can and will change between compilers.

Edited by Adam_42

Share this post

Link to post
Share on other sites
Sign in to follow this  

  • Advertisement

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!