# Floating Point Arithmetic

This topic is 5120 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

I'm doing a simulator in Java with floating numbers. I want to know how to convert a single-precision IEEE 754 format to double-precision IEEE 754 format. I've tried google but can't find anything on Floating Point Arithmetic... Can anyone help ?

##### Share on other sites
Maybe it's not what you are exactly looking for, but...

In linux kernel sources there is a lib that performes fpu simulation. You could look up there.

/def

##### Share on other sites
Just cast it.

Or did you want to do all the bit-twiddling yourself, in order to demonstrate understanding of machine or something?

##### Share on other sites
For bit twiddling try this.

float fint* ii = &f //Do whatever you have to, you need the float to allow bitwise arithmaticSign = i & 0x800000  //24 bitsExp  = i & 0x7F8000Mant = i & 0x007FFF

That splits the float into its sign, exponent and mantissa sections. then you have to glue it back together

Exercise for reader (hint: use the |'s luke)

From,
Nice coder

##### Share on other sites
Quote:
 Original post by Nice CoderThat splits the float into its sign, exponent and mantissa sections. then you have to glue it back togetherExercise for reader (hint: use the |'s luke)

Not sufficient. The float and double exponents have a different bias (127 and 1023, respectively). Then there is the issue of infinities, NaNs and denormals.

##### Share on other sites
I haven't done 32->64 bit, but see the class below that I just finished for 16 bit floats. It handles Infinities and NaN's, but I think I left out denormals.

Nice Coder, while I've no doubt that 24-bit floats exists somewhere, PC's use 32-bit floats so the masks are:
Sign = i & 0x80000000;
Exp = i & 0x7F800000;
Mant = i & 0x007FFFFF;

64-bit double: (1-11-52)
sign = bit 63
exponent = bits 62-52 (biased by 1023)
mantissa = bits 51-0

32-bit float: (1-8-23)
sign = bit 31
exponent = bits 30-23 (biased by 127)
mantissa = bits 22-0

16-bit shortfloat: (1-5-10)
sign = bit 15
exponent = bits 14-10 (biased by 15)
mantissa = bits 9-0

class shortfloat {	unsigned short raw;public:	shortfloat() {}	shortfloat(float f) {		unsigned int u = *(reinterpret_cast<unsigned int*>(&f));		int exponent = ((u>>0x17)&0xFF)-0x70;		if (exponent < 0) exponent = 0;		else if (exponent > 0x1F) exponent = 0x1F;		raw = ((u>>0x1F)<<0x0F) | (exponent<<0x0A) | ((u>>0x0D)&0x03FF);	}	operator float() const {		int exponent = ((raw>>0x0A)&0x1F)+0x70;		int mantissa = raw&0x03FF;		if (exponent == 0x70 && mantissa == 0) exponent = 0;		else if (exponent == 0x8F) exponent = 0xFF;		unsigned int u = (((unsigned long)(raw>>0x0F))<<0x1F) | (exponent<<0x17) | (mantissa<<0x0D);		return *(reinterpret_cast<float*>(&u));	}	inline shortfloat operator +=(const float b) { return *this = shortfloat((float)*this + b); }	inline shortfloat operator -=(const float b) { return *this = shortfloat((float)*this - b); }	inline shortfloat operator *=(const float b) { return *this = shortfloat((float)*this * b); }	inline shortfloat operator /=(const float b) { return *this = shortfloat((float)*this / b); }	friend inline bool operator < (const shortfloat a, const shortfloat b) { return (float)a < (float)b; }	friend inline bool operator < (const      float a, const shortfloat b) { return a < (float)b; }	friend inline bool operator < (const shortfloat a, const      float b) { return (float)a < b; }	friend inline bool operator > (const shortfloat a, const shortfloat b) { return (float)a > (float)b; }	friend inline bool operator > (const      float a, const shortfloat b) { return a > (float)b; }	friend inline bool operator > (const shortfloat a, const      float b) { return (float)a > b; }	friend inline bool operator <=(const shortfloat a, const shortfloat b) { return (float)a <= (float)b; }	friend inline bool operator <=(const      float a, const shortfloat b) { return a <= (float)b; }	friend inline bool operator <=(const shortfloat a, const      float b) { return (float)a <= b; }	friend inline bool operator >=(const shortfloat a, const shortfloat b) { return (float)a >= (float)b; }	friend inline bool operator >=(const      float a, const shortfloat b) { return a >= (float)b; }	friend inline bool operator >=(const shortfloat a, const      float b) { return (float)a >= b; }	friend inline bool operator ==(const shortfloat a, const shortfloat b) { return a.raw == b.raw; }	friend inline bool operator ==(const      float a, const shortfloat b) { return a == (float)b; }	friend inline bool operator ==(const shortfloat a, const      float b) { return (float)a == b; }	friend inline bool operator !=(const shortfloat a, const shortfloat b) { return a.raw != b.raw; }	friend inline bool operator !=(const      float a, const shortfloat b) { return a != (float)b; }	friend inline bool operator !=(const shortfloat a, const      float b) { return (float)a != b; }	friend inline shortfloat operator - (const shortfloat a) { shortfloat b = a; b.raw ^= 0x8000; return b; }	friend inline shortfloat operator + (const shortfloat a, const shortfloat b) { return shortfloat((float)a + (float)b); }	friend inline shortfloat operator + (const      float a, const shortfloat b) { return shortfloat(a + (float)b); }	friend inline shortfloat operator + (const shortfloat a, const      float b) { return shortfloat((float)a + b); }	friend inline shortfloat operator - (const shortfloat a, const shortfloat b) { return shortfloat((float)a - (float)b); }	friend inline shortfloat operator - (const      float a, const shortfloat b) { return shortfloat(a - (float)b); }	friend inline shortfloat operator - (const shortfloat a, const      float b) { return shortfloat((float)a - b); }	friend inline shortfloat operator * (const shortfloat a, const shortfloat b) { return shortfloat((float)a * (float)b); }	friend inline shortfloat operator * (const      float a, const shortfloat b) { return shortfloat(a * (float)b); }	friend inline shortfloat operator * (const shortfloat a, const      float b) { return shortfloat((float)a * b); }	friend inline shortfloat operator / (const shortfloat a, const shortfloat b) { return shortfloat((float)a / (float)b); }	friend inline shortfloat operator / (const      float a, const shortfloat b) { return shortfloat(a / (float)b); }	friend inline shortfloat operator / (const shortfloat a, const      float b) { return shortfloat((float)a / b); }};

• ### What is your GameDev Story?

In 2019 we are celebrating 20 years of GameDev.net! Share your GameDev Story with us.

• 10
• 15
• 14
• 46
• 22
• ### Forum Statistics

• Total Topics
634054
• Total Posts
3015269
×