Jump to content
  • Advertisement
L. Spiro

How To Convert double To float

Recommended Posts

The reason the title surprised you is the same as the reason I can’t get proper results on Google: You’re thinking about a cast, and no matter what search terms I use I keep getting results talking about casts.

I enjoy super-low-level programming and I have both a personal interest in and a need for creating a manual cast from a double to a floating-point type given customizable properties (number of exponent bits, number of mantissa bits, is there a sign bit?, is there an implied mantissa bit?, etc.)

I’ve already written something that could be called a prototype and it correctly handles all “normal” conversions from a 64-bit double into any other type of floating-point format based on however many bits you want.
So as an example, let’s say I have PI as a double constant “3.1415926535897931”.  I manually cast it to a 32-bit float by specifying SIGNBIT=TRUE, EXP=8, MANT=24, IMPLICITMAN=TRUE and I get this result:
Actual [1,8,24,TRUE]FLOAT Result = 3.1415927410125732
Sign = 0
Exp = 0x0000000000000080
Man = 0x0000000000490fdb

So it manually gives the exact same result as a regular cast from double to float, and can round or truncate (here, rounding up was performed).  Casting manually seems useless, right?  But I want to support arbitrary floating-point types that are not present in C.  How about an example from a 16-bit float?

Actual [1,5,11,TRUE]FLOAT Result = 3.1406250000000000
Sign = 0
Exp = 0x0000000000000010
Man = 0x0000000000000248

I also want to study arbitrary floating-point formats.  For example, the smallest non-0 number a 32-bit float can be is 1.4012984643248171e-45 and the max is 3.4028234663852886e+38.

How about for a 16-bit float?

Smallest non-0: 5.9604644775390625e-08
Max: 65504.000000000000.

How about this random format?
Smallest non-0: 2.0194839173657902e-28
Max: 1.8446744065119617e+19


So a lot of my converter is already working.  If you are still wondering what the point is, you can understand that a graphics programmer who has to work with 16-bit and 32-bit shader precision, F16 and F32 textures, D24 depth textures, and R11G11B10 float textures can really find something like this useful.  There are many floating-point formats out there but not really any tools to investigate those float formats.

Now to the Question

There are special cases for denormalized numbers that I am not currently handling, and I am temporarily making assumptions about sign bits etc.  Anyone have a good link to a break-down of casting from a floating-point value to another type of floating-point value manually?  Going over IEEE doesn’t provide example implementations nor does it really dig into the details.  The details I often find cover mostly what I have already implemented, which is from a normalized number converted to another normalized number.  I don’t really see guides on the best way to implement the cases where either the source or the destination is denormalized.

I can implement it “my way” but I definitely want to look at what has been done or at the very least go over specifications to ensure my way fully complies.

L. Spiro

Edited by L. Spiro

Share this post

Link to post
Share on other sites

Impossible AFAIK on the IEEE754 standard, because your specific trouble is about digits. The binary representation of your fractional/mantissa part of PI is longer than 23, the max. that a single floating-point type can handle as a fractional part. So the compiler will convert it for you to the nearest(not clarified, but it's supposed to work like that) floating point that is accepted into a 23-bits fractional part.

Share this post

Link to post
Share on other sites

You’ve missed the point.

The IEEE standard would give me guidelines that I can generalize to any floating-point format of any combination of bits, as demonstrated above.  In the example of PI I gave, you can see how it degraded in precision based on the number of bits I assigned to the exponent and mantissa.

From double = 3.1415926535897931, to float = 3.1415927410125732, to float16 = 3.1406250000000000.
It’s exactly this degradation that it is important to see and investigate for programmers in general, but heavily for graphics programmers.

And again, note that I am not relying on a CPU cast, I am doing the cast manually, so I am not worried about what the compiler will cast etc., and I am not restricted in the floating-point type.  In the last example I literally just invented a 38-bit float.  That’s the whole point.  I need to cast manually because I need to inspect float types that are not natively supported in C/C++.

If I were to get clear documentation, or better yet pseudocode showing the micro-instruction process for converting a double to a float, I will be able to generalize it for my purposes to cast to anything.


Every number I posted above actually came out of my converter, which I wrote just last night.  Even the maximum float value came from my implementation rather than looking at FLT_MAX.  The fact that my class generates the same value as FLT_MAX is just because my implementation is 100% correct for normalized numbers.  That part is a completely-solved area.

I want to look at standards and example implementations so that I can be confident I’ve handled all edge cases specifically dealing with denormalized numbers.



L. Spiro

Share this post

Link to post
Share on other sites

Customizing floating point seems interesting. I'm using C for rendering and math. Since C doesn't support 16-bit floats I also need to do that manually or something else (just speaking for 16-bit):

What I'm thinking to do is; since I don't think all CPUs (except ARM maybe) supports 16 bit float arithmetic, I'm considering to do half-precision arithmetic as 32-bit float then convert it back to 16-bit float for storing in memory (https://software.intel.com/en-us/articles/performance-benefits-of-half-precision-floats). I think this would be better then implement the arithmetic manually if performance is matter (single instruction vs multiple)

In the future I may look somethings what you are looking now, just wanted to share my thoughts about 16-bit floats. 

Share this post

Link to post
Share on other sites
4 hours ago, L. Spiro said:

Anyone have a good link to a break-down of casting from a floating-point value to another type of floating-point value manually? ...  I don’t really see guides on the best way to implement the cases where either the source or the destination is denormalized.

As far as I can see, there is nothing specific on the matter. 

It looks like the standard defines several critical aspects, such a the numeric base (either 2 or 10) the sign/coefficient/quotient requirements, ranges, and so forth. Looks like you've got that covered.

Section 5.3 covers the conversion, but doesn't provide a specific process.  The conversion merely needs to happen and be properly rounded if narrower, exactly precise if wider.  Nothing specific about denormalized numbers so I presume the rules are the same.

I'm not exactly sure about your example values since an implementation MAY use reserved exponents used for +/- INF, SNAN/QNAN, +/- zero, and denormalized values, using the two lowest values for the markers. That's why 8-bit coefficients have a range of -126 to +127 instead of the more typical -128 to +127, 11-bit coefficients range from -1022 to +1023 instead of the more typical -1024 to +1023.  Implementations are allowed to extend beyond that because the standard is careful to define the represented values rather than the encoding.

In your example you give a 16-bit float with a 5 bit coefficient, but I think the representation allows for e-05 instead of e-08 since the two lowest values are reserved (excluding denormals). The 7-bit coefficient would have values -64 and -63 reserved so range from -62 to +63, meaning the lowest number would be e-19 rather than e-28 (excluding denormals). 

In that regard I agree with your assessments. Denormalized numbers aren't treated differently other than a special value for the coefficient. For a denormalized number I'd expect the mantissa to be extended or reduced in the same manner, rounding the mantissa if narrower, extending with zeros if wider.

Share this post

Link to post
Share on other sites

Today I got it fully compliant with IEEE standards and full support for denormalized values.  I created some intrinsics that allow you to specify the properties of floats so you can make any kind of float you want as long as no components are larger than in a 64-bit double (a limitation I might handle in the future).

Here are some examples.

as_float10( 1.0 / 3 )
0.328125 (3EA80000h, 3FD5000000000000h)

as_float11( 1.0 / 3 )
0.33203125 (3EAA0000h, 3FD5400000000000h)

as_float14( 1.0 / 3 )
0.3330078125 (3EAA8000h, 3FD5500000000000h)

as_float16( 1.0 / 3 )
0.333251953125 (3EAAA000h, 3FD5540000000000h)

as_float32( 1.0 / 3 )
0.3333333432674407958984375 (3EAAAAABh, 3FD5555560000000h)

as_float64( 1.0 / 3 )
0.333333333333333314829616256247390992939472198486328125 (3EAAAAABh, 3FD5555555555555h)


These are just shortcuts for the common formats you might encounter.  For a custom type you can use the full instrinsic:
as_float( 1, 7, 20, true, 1.0 / 3 ) // as_float( signBits, expBits, manBits, implicitMantissa, value )
0.33333301544189453125 (3EAAAAA0h, 3FD5555400000000h)


I will be adding more features to get properties of custom floats too.  For example:
as_float_max( 1, 7, 20, true ) // Gets the maximum value for the given type of float.
as_float_min( 1, 7, 20, true ) // Gets the min non-0 value for the given type of float.
as_float_min_n( 1, 7, 20, true ) // Gets the min non-0 normalized value for the given type of float.
as_float_inf( 1, 7, 20, true ) // Etc.

Also some options to display the components of floats separately (sign, exponent, and mantissa), and more features.

L. Spiro

Share this post

Link to post
Share on other sites


I'm highly interested by the solution you've had, for cases where the normalised source floating point would result in a denormalised destination.

How have you handled this ?

Share this post

Link to post
Share on other sites

When dealing with normalized numbers, you just need to make the exponents the same value, which is just a matter of getting the real exponent value from the source by taking the exponent integer and applying the bias, then converting to the new set of bits with the new bias.

If the exponents are the same, the rest of the number is converted just by shaving bits off the mantissa.  You can round the number by adding the highest bit shaved off.  My original implementation had the sign, exponent, and mantissa as separate numbers, so my implementation would have produced incorrect results when the mantissa was all F's and a 1 was added from the shaved bits.  My mantissa would wrap to 0 but the exponent wouldn't have increased by 1 as it should.

So now the problem with denormalized numbers can be made more general and refer instead to any case where the exponents in the source and destination are different (which will always be the case for denormalized numbers with different amounts of bits for the exponent).

The most common case is with denormals, so I will go a bit into that.  Denormalized numbers lose the implicit 1 on the mantissa.  For normalized cases the mantissa always increases the value based off the exponent (it is always Exponent * [1, 1.999999...]) whereas for denormalized cases the number always decreases based off the exponent (Exponent * [0.999999..., 0]).

You can't arrive at a general solution just with bit-shifting tricks now.  You have to take the source number and the denormalized exponent bias on the destination to determine what mantissa value will best match the source value when multiplied.  In other words, from a double to a float, ::pow( 2, -126 ) * X ≈ SrcValue, solve for X.

Now you have one case generalized for values where the exponent in both numbers matches and another generalized way to handle denormalized cases, and then you patch for cases where the smallest normalized number is closer to the source number than the highest denormalized number, plus rounding into InF, and you are working with a single integer so that adding to the mantissa properly rolls up the exponent when necessary.


Now throw that all away and copy this guy's code: https://stackoverflow.com/a/3542975

He hard-coded it from 32-bit floats to 16-bit floats.  I generalized it to go from 64-bit floats to anything else.  It works properly in all cases, including going into InF and NaN, but doesn't allow specifying a rounding mode explicitly.  I will have to add that later.

L. Spiro

Edited by L. Spiro

Share this post

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Advertisement

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!