Jump to content
  • Advertisement
Sign in to follow this  

Floating Point Arithmetic

This topic is 4880 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I'm doing a simulator in Java with floating numbers. I want to know how to convert a single-precision IEEE 754 format to double-precision IEEE 754 format. I've tried google but can't find anything on Floating Point Arithmetic... Can anyone help ?

Share this post


Link to post
Share on other sites
Advertisement
Maybe it's not what you are exactly looking for, but...

In linux kernel sources there is a lib that performes fpu simulation. You could look up there.

/def

Share this post


Link to post
Share on other sites
Just cast it.

Or did you want to do all the bit-twiddling yourself, in order to demonstrate understanding of machine or something?

Share this post


Link to post
Share on other sites
For bit twiddling try this.



float f
int* i
i = &f //Do whatever you have to, you need the float to allow bitwise arithmatic
Sign = i & 0x800000 //24 bits
Exp = i & 0x7F8000
Mant = i & 0x007FFF



That splits the float into its sign, exponent and mantissa sections. then you have to glue it back together

Exercise for reader (hint: use the |'s luke)

From,
Nice coder

Share this post


Link to post
Share on other sites
Quote:
Original post by Nice Coder
That splits the float into its sign, exponent and mantissa sections. then you have to glue it back together

Exercise for reader (hint: use the |'s luke)


Not sufficient. The float and double exponents have a different bias (127 and 1023, respectively). Then there is the issue of infinities, NaNs and denormals.

Share this post


Link to post
Share on other sites
I haven't done 32->64 bit, but see the class below that I just finished for 16 bit floats. It handles Infinities and NaN's, but I think I left out denormals.

Nice Coder, while I've no doubt that 24-bit floats exists somewhere, PC's use 32-bit floats so the masks are:
Sign = i & 0x80000000;
Exp = i & 0x7F800000;
Mant = i & 0x007FFFFF;

64-bit double: (1-11-52)
sign = bit 63
exponent = bits 62-52 (biased by 1023)
mantissa = bits 51-0

32-bit float: (1-8-23)
sign = bit 31
exponent = bits 30-23 (biased by 127)
mantissa = bits 22-0

16-bit shortfloat: (1-5-10)
sign = bit 15
exponent = bits 14-10 (biased by 15)
mantissa = bits 9-0

class shortfloat {
unsigned short raw;
public:
shortfloat() {}
shortfloat(float f) {
unsigned int u = *(reinterpret_cast<unsigned int*>(&f));
int exponent = ((u>>0x17)&0xFF)-0x70;
if (exponent < 0) exponent = 0;
else if (exponent > 0x1F) exponent = 0x1F;
raw = ((u>>0x1F)<<0x0F) | (exponent<<0x0A) | ((u>>0x0D)&0x03FF);
}
operator float() const {
int exponent = ((raw>>0x0A)&0x1F)+0x70;
int mantissa = raw&0x03FF;
if (exponent == 0x70 && mantissa == 0) exponent = 0;
else if (exponent == 0x8F) exponent = 0xFF;
unsigned int u = (((unsigned long)(raw>>0x0F))<<0x1F) | (exponent<<0x17) | (mantissa<<0x0D);
return *(reinterpret_cast<float*>(&u));
}
inline shortfloat operator +=(const float b) { return *this = shortfloat((float)*this + b); }
inline shortfloat operator -=(const float b) { return *this = shortfloat((float)*this - b); }
inline shortfloat operator *=(const float b) { return *this = shortfloat((float)*this * b); }
inline shortfloat operator /=(const float b) { return *this = shortfloat((float)*this / b); }
friend inline bool operator < (const shortfloat a, const shortfloat b) { return (float)a < (float)b; }
friend inline bool operator < (const float a, const shortfloat b) { return a < (float)b; }
friend inline bool operator < (const shortfloat a, const float b) { return (float)a < b; }
friend inline bool operator > (const shortfloat a, const shortfloat b) { return (float)a > (float)b; }
friend inline bool operator > (const float a, const shortfloat b) { return a > (float)b; }
friend inline bool operator > (const shortfloat a, const float b) { return (float)a > b; }
friend inline bool operator <=(const shortfloat a, const shortfloat b) { return (float)a <= (float)b; }
friend inline bool operator <=(const float a, const shortfloat b) { return a <= (float)b; }
friend inline bool operator <=(const shortfloat a, const float b) { return (float)a <= b; }
friend inline bool operator >=(const shortfloat a, const shortfloat b) { return (float)a >= (float)b; }
friend inline bool operator >=(const float a, const shortfloat b) { return a >= (float)b; }
friend inline bool operator >=(const shortfloat a, const float b) { return (float)a >= b; }
friend inline bool operator ==(const shortfloat a, const shortfloat b) { return a.raw == b.raw; }
friend inline bool operator ==(const float a, const shortfloat b) { return a == (float)b; }
friend inline bool operator ==(const shortfloat a, const float b) { return (float)a == b; }
friend inline bool operator !=(const shortfloat a, const shortfloat b) { return a.raw != b.raw; }
friend inline bool operator !=(const float a, const shortfloat b) { return a != (float)b; }
friend inline bool operator !=(const shortfloat a, const float b) { return (float)a != b; }
friend inline shortfloat operator - (const shortfloat a) { shortfloat b = a; b.raw ^= 0x8000; return b; }
friend inline shortfloat operator + (const shortfloat a, const shortfloat b) { return shortfloat((float)a + (float)b); }
friend inline shortfloat operator + (const float a, const shortfloat b) { return shortfloat(a + (float)b); }
friend inline shortfloat operator + (const shortfloat a, const float b) { return shortfloat((float)a + b); }
friend inline shortfloat operator - (const shortfloat a, const shortfloat b) { return shortfloat((float)a - (float)b); }
friend inline shortfloat operator - (const float a, const shortfloat b) { return shortfloat(a - (float)b); }
friend inline shortfloat operator - (const shortfloat a, const float b) { return shortfloat((float)a - b); }
friend inline shortfloat operator * (const shortfloat a, const shortfloat b) { return shortfloat((float)a * (float)b); }
friend inline shortfloat operator * (const float a, const shortfloat b) { return shortfloat(a * (float)b); }
friend inline shortfloat operator * (const shortfloat a, const float b) { return shortfloat((float)a * b); }
friend inline shortfloat operator / (const shortfloat a, const shortfloat b) { return shortfloat((float)a / (float)b); }
friend inline shortfloat operator / (const float a, const shortfloat b) { return shortfloat(a / (float)b); }
friend inline shortfloat operator / (const shortfloat a, const float b) { return shortfloat((float)a / b); }
};

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

Participate in the game development conversation and more when you create an account on GameDev.net!

Sign me up!