• Advertisement
Sign in to follow this  

Floating Point Arithmetic

This topic is 4794 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I'm doing a simulator in Java with floating numbers. I want to know how to convert a single-precision IEEE 754 format to double-precision IEEE 754 format. I've tried google but can't find anything on Floating Point Arithmetic... Can anyone help ?

Share this post


Link to post
Share on other sites
Advertisement
Maybe it's not what you are exactly looking for, but...

In linux kernel sources there is a lib that performes fpu simulation. You could look up there.

/def

Share this post


Link to post
Share on other sites
Just cast it.

Or did you want to do all the bit-twiddling yourself, in order to demonstrate understanding of machine or something?

Share this post


Link to post
Share on other sites
For bit twiddling try this.



float f
int* i
i = &f //Do whatever you have to, you need the float to allow bitwise arithmatic
Sign = i & 0x800000 //24 bits
Exp = i & 0x7F8000
Mant = i & 0x007FFF



That splits the float into its sign, exponent and mantissa sections. then you have to glue it back together

Exercise for reader (hint: use the |'s luke)

From,
Nice coder

Share this post


Link to post
Share on other sites
Quote:
Original post by Nice Coder
That splits the float into its sign, exponent and mantissa sections. then you have to glue it back together

Exercise for reader (hint: use the |'s luke)


Not sufficient. The float and double exponents have a different bias (127 and 1023, respectively). Then there is the issue of infinities, NaNs and denormals.

Share this post


Link to post
Share on other sites
I haven't done 32->64 bit, but see the class below that I just finished for 16 bit floats. It handles Infinities and NaN's, but I think I left out denormals.

Nice Coder, while I've no doubt that 24-bit floats exists somewhere, PC's use 32-bit floats so the masks are:
Sign = i & 0x80000000;
Exp = i & 0x7F800000;
Mant = i & 0x007FFFFF;

64-bit double: (1-11-52)
sign = bit 63
exponent = bits 62-52 (biased by 1023)
mantissa = bits 51-0

32-bit float: (1-8-23)
sign = bit 31
exponent = bits 30-23 (biased by 127)
mantissa = bits 22-0

16-bit shortfloat: (1-5-10)
sign = bit 15
exponent = bits 14-10 (biased by 15)
mantissa = bits 9-0

class shortfloat {
unsigned short raw;
public:
shortfloat() {}
shortfloat(float f) {
unsigned int u = *(reinterpret_cast<unsigned int*>(&f));
int exponent = ((u>>0x17)&0xFF)-0x70;
if (exponent < 0) exponent = 0;
else if (exponent > 0x1F) exponent = 0x1F;
raw = ((u>>0x1F)<<0x0F) | (exponent<<0x0A) | ((u>>0x0D)&0x03FF);
}
operator float() const {
int exponent = ((raw>>0x0A)&0x1F)+0x70;
int mantissa = raw&0x03FF;
if (exponent == 0x70 && mantissa == 0) exponent = 0;
else if (exponent == 0x8F) exponent = 0xFF;
unsigned int u = (((unsigned long)(raw>>0x0F))<<0x1F) | (exponent<<0x17) | (mantissa<<0x0D);
return *(reinterpret_cast<float*>(&u));
}
inline shortfloat operator +=(const float b) { return *this = shortfloat((float)*this + b); }
inline shortfloat operator -=(const float b) { return *this = shortfloat((float)*this - b); }
inline shortfloat operator *=(const float b) { return *this = shortfloat((float)*this * b); }
inline shortfloat operator /=(const float b) { return *this = shortfloat((float)*this / b); }
friend inline bool operator < (const shortfloat a, const shortfloat b) { return (float)a < (float)b; }
friend inline bool operator < (const float a, const shortfloat b) { return a < (float)b; }
friend inline bool operator < (const shortfloat a, const float b) { return (float)a < b; }
friend inline bool operator > (const shortfloat a, const shortfloat b) { return (float)a > (float)b; }
friend inline bool operator > (const float a, const shortfloat b) { return a > (float)b; }
friend inline bool operator > (const shortfloat a, const float b) { return (float)a > b; }
friend inline bool operator <=(const shortfloat a, const shortfloat b) { return (float)a <= (float)b; }
friend inline bool operator <=(const float a, const shortfloat b) { return a <= (float)b; }
friend inline bool operator <=(const shortfloat a, const float b) { return (float)a <= b; }
friend inline bool operator >=(const shortfloat a, const shortfloat b) { return (float)a >= (float)b; }
friend inline bool operator >=(const float a, const shortfloat b) { return a >= (float)b; }
friend inline bool operator >=(const shortfloat a, const float b) { return (float)a >= b; }
friend inline bool operator ==(const shortfloat a, const shortfloat b) { return a.raw == b.raw; }
friend inline bool operator ==(const float a, const shortfloat b) { return a == (float)b; }
friend inline bool operator ==(const shortfloat a, const float b) { return (float)a == b; }
friend inline bool operator !=(const shortfloat a, const shortfloat b) { return a.raw != b.raw; }
friend inline bool operator !=(const float a, const shortfloat b) { return a != (float)b; }
friend inline bool operator !=(const shortfloat a, const float b) { return (float)a != b; }
friend inline shortfloat operator - (const shortfloat a) { shortfloat b = a; b.raw ^= 0x8000; return b; }
friend inline shortfloat operator + (const shortfloat a, const shortfloat b) { return shortfloat((float)a + (float)b); }
friend inline shortfloat operator + (const float a, const shortfloat b) { return shortfloat(a + (float)b); }
friend inline shortfloat operator + (const shortfloat a, const float b) { return shortfloat((float)a + b); }
friend inline shortfloat operator - (const shortfloat a, const shortfloat b) { return shortfloat((float)a - (float)b); }
friend inline shortfloat operator - (const float a, const shortfloat b) { return shortfloat(a - (float)b); }
friend inline shortfloat operator - (const shortfloat a, const float b) { return shortfloat((float)a - b); }
friend inline shortfloat operator * (const shortfloat a, const shortfloat b) { return shortfloat((float)a * (float)b); }
friend inline shortfloat operator * (const float a, const shortfloat b) { return shortfloat(a * (float)b); }
friend inline shortfloat operator * (const shortfloat a, const float b) { return shortfloat((float)a * b); }
friend inline shortfloat operator / (const shortfloat a, const shortfloat b) { return shortfloat((float)a / (float)b); }
friend inline shortfloat operator / (const float a, const shortfloat b) { return shortfloat(a / (float)b); }
friend inline shortfloat operator / (const shortfloat a, const float b) { return shortfloat((float)a / b); }
};

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement