🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

Back to General and Gameplay Programming

Double to float C++

General and Gameplay Programming Programming

Started by taby May 13, 2024 04:27 PM

181 comments, last by JoeJ 6 days, 20 hours ago

JoeJ

4,258

May 21, 2024 02:44 PM

taby said:
bits &= -1 << shift;

Probably it treats the -1 as a 32 bit integer.

Try bits &= -1ull << shift;

Or bits &= uint64_t(0xFFFFFFFFFFFFFFFFull) << shift;

This sucks. I'm never sure and it often causes me bugs. More bits, more trouble.

taby

1,508

Author

May 21, 2024 03:09 PM

Sorry, yes, I was using the wrong value for the shifting. I’ll try it tonight when I get home. 🙂

thanks again joej!

taby

1,508

Author

May 21, 2024 10:02 PM

Yes the problem was me. I got it working. Now to test if it works the way we want it to!

#include <iostream>
#include <iomanip>
using namespace std;

int main(void)
{
	cout << setprecision(30) << endl;

	double pi = 4.0 * atan(1.0);

	const int64_t mantissa_size = 52;
	uint64_t max = static_cast<uint64_t>(-1); // 2^64 - 1

	for (int64_t shift = 0; shift < mantissa_size; shift++)
	{	
		uint64_t bits = reinterpret_cast<uint64_t &>(pi);
		bits = bits & (max << shift);
		double reduced = reinterpret_cast<double &>(bits);
		cout << shift << " " << reduced << endl;
	}

	return 0;
}

taby

1,508

Author

May 21, 2024 10:25 PM

I tried it out. It still does not snap to the closest float. :(

In fact, if I shift before I cast back to float, it doesn't work. So, shifting can't be the solution. I really appreciate your hard work joej! Sorry man.

JoeJ

4,258

May 21, 2024 10:55 PM

Interesting, because i'm quite certain that's what a conversion from double to float is doing - clipping less significant bits. But ofc. from both mantissa and exponent.

Maybe there is rounding before the clip as well. Or your values are close to a change in exponent. Something like that is probably missing.

You can search for code examples to convert float to half (fp16). There CPU is no instruction for that, so examples should be plenty.

taby

1,508

Author

May 21, 2024 11:06 PM

thanks again for the ideas. Yes I never thought to check fp16 conversion. You’re a saviour, man!

taby

1,508

Author

May 22, 2024 12:46 AM

I found this code, which might be helpful:

https://gamedev.stackexchange.com/a/17329/149713

#define F16_EXPONENT_BITS 0x1F
#define F16_EXPONENT_SHIFT 10
#define F16_EXPONENT_BIAS 15
#define F16_MANTISSA_BITS 0x3ff
#define F16_MANTISSA_SHIFT (23 - F16_EXPONENT_SHIFT)
#define F16_MAX_EXPONENT (F16_EXPONENT_BITS << F16_EXPONENT_SHIFT)

GLushort F32toF16(GLfloat val)
{
    GLuint f32 = (*(GLuint *) &val);
    GLushort f16 = 0;
    /* Decode IEEE 754 little-endian 32-bit floating-point value */
    int sign = (f32 >> 16) & 0x8000;
    /* Map exponent to the range [-127,128] */
    int exponent = ((f32 >> 23) & 0xff) - 127;
    int mantissa = f32 & 0x007fffff;
    if (exponent == 128) 
    { /* Infinity or NaN */
        f16 = sign | F16_MAX_EXPONENT;
        if (mantissa) f16 |= (mantissa & F16_MANTISSA_BITS);

    } 
    else if (exponent > 15) 
    { /* Overflow - flush to Infinity */
        f16 = sign | F16_MAX_EXPONENT;
    } 
    else if (exponent > -15) 
    { /* Representable value */
        exponent += F16_EXPONENT_BIAS;
        mantissa >>= F16_MANTISSA_SHIFT;
        f16 = sign | exponent << F16_EXPONENT_SHIFT | mantissa;
    }
    else 
    {
        f16 = sign;
    }
    return f16;
}

taby

1,508

Author

May 22, 2024 12:52 AM

Edit:

This works great! frexp and copysign for the win!

double truncate_normalized_double(double d)
{
	if (d <= 0.0)
		return 0.0f;
	else if (d >= 1.0)
		return 1.0f;

	double result = 0;
	int exponent = 0;
	double s = signbit(d);

	result = frexp(d, &exponent);

	const double d_final = result * pow(2.0, static_cast<double>(exponent));

	return copysignf(d_final, s);
}

taby

1,508

Author

May 22, 2024 02:05 AM

It’s still not what I need. To make things simple, the range is from 0 through 1, so the exponent is always zero. I’ll be working on it all night lol

taby

1,508

Author

May 22, 2024 05:01 PM

Sorry, it doesn't quite work. Surely I'm missing something obvious!?

#include <iostream>
#include <iomanip>
#include <string>
#include <bitset>
using namespace std;


void get_truncated_bit_string(double d, string &s)
{
	s = "";

	for (int i = 63; i >= 0; i--)
	{
		if (i <= 31)
			s += '0';
		else
			s += to_string((reinterpret_cast<uint64_t&>(d) >> i) & 1);
	}
}

void get_double_bit_string(double d, string& s)
{
	s = "";

	for (int i = 63; i >= 0; i--)
		s += to_string((reinterpret_cast<uint64_t&>(d) >> i) & 1);
}


double truncate_normalized_double(double d)
{
	//return static_cast<double>(static_cast<float>(d));

	string sd = "";
	get_double_bit_string(d, sd);
	cout << sd << endl;

	std::bitset<64> Bitset64(sd);

	uint64_t value = Bitset64.to_ullong();

	double dv = reinterpret_cast<double&>(value);
	string sdv = "";
	get_truncated_bit_string(dv, sdv);
	cout << sdv << endl;

	double df = static_cast<double>(static_cast<float>(d));
	string sdf = "";
	get_double_bit_string(df, sdf);
	cout << sdf << endl;

	return dv;
}

int main(void)
{
	cout << setprecision(20) << endl;

	for(double d = 0.0; d <= 1.0; d += 0.1)
		cout << truncate_normalized_double(d) << endl << endl;

	return 0;
}

🎉 Celebrating 25 Years of GameDev.net! 🎉

Double to float C++

Popular Topics

Recommended Tutorials

🎉 Celebrating 25 Years of GameDev.net! 🎉

Double to float C++

Popular Topics

Recommended Tutorials

Reticulating splines