🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

Back to General and Gameplay Programming

Double to float C++

General and Gameplay Programming Programming

Started by taby May 13, 2024 04:27 PM

181 comments, last by JoeJ 1 week, 1 day ago

taby

1,508

Author

May 26, 2024 12:06 AM

I don't know how many times I need to say it, but where alpha = beta = 1 you get zero precession.

As for the character of the function that calculates angle from bits, it is not monotonically decreasing like you are saying. 20 bits gives an angle of 5.7, 24 bits gives an angle of 43.5, and 30 bits gives an angle of 14.8. That's not monotonic. Sorry man, but you're not right all of the time.

taby

1,508

Author

May 26, 2024 02:39 AM

Oh yes, and it all works with the symplectic integrator too. Holy fucking shit, I’m twice as lucky!

taby

1,508

Author

May 26, 2024 03:39 AM

I'm interested in knowing how to generally calculate the next toward value from 0 to 1. For instance, nexttowardf(0, 1) returns a float 1.4013e-45. For double, using nexttoward(0, 1), the result is a double 4.94066e-324.

These values are much smaller than the epsilon, which is pow(2, -23) = 1.19209e-07 for float, etc.

Not sure what to make of them. Sorry. I don't know everything lol.

FLT_TRUE_MIN works good.

JoeJ

4,258

May 26, 2024 05:13 AM

taby said:
it is not monotonically decreasing like you are saying.

I don't say anything like this.
When i did your angle measurement each timestep, it gave nan half of the time, thus the speculation about measurements being the explanation.
Using complex numbers avoids a need to clamp a dot product just like atan2, so no nans.
But i got the same numbers as before, so it isn't an explanation and measurement is not the problem.
(I still recommend to clamp the dot product for acos, so no nans can happen in any case.)

taby said:
I'm interested in knowing how to generally calculate the next toward value from 0 to 1.

If you can be sure no overflow happens requiring to change exponent, the technical answer is simple:

uint64_t bits = (uint64_t&)value;
bits++;
value = (double&) bits;

This increases mantissa by one, and you can't do a change smaller than that.

I'm no floating point expert, but i've learned those things from GPU work, when figuring out we can use integer atomic min/max operations on floating point data. It works because exponent is in the higher bits, thus interpreting bits as integers works to compare for larger / smaller. (Negative numbers require some bit hacking before and after the atomic ops.)

I'm really baffled about the failure of replicating the casting behavior with bit hacking.
The doubt on the scientific workflow here is one thing, but the failure on this low technical level is another.
It should work, but it does not. I have no more idea about the reason.

taby

1,508

Author

May 26, 2024 03:28 PM

sorry man, I was mostly talking to the other posters.

it all works well with boost::multiprecision.

Thanks for the bit twiddling tutorial!

taby

1,508

Author

May 26, 2024 03:56 PM

I’m going to make up a set of orbit parameters for a fictitious planet. I’ll make the eccentricity high compared to Mercury's.

JoeJ

4,258

May 26, 2024 06:01 PM

taby said:
it all works well with boost::multiprecision.

What precision do you use then?

I did some serious research:

const vector_3 grav_dir = sun_pos - pos;
				const double distance = grav_dir.length();
				const double Rs = 2 * grav_constant * sun_mass / (speed_of_light * speed_of_light);

				const double alpha = 2.0 - sqrt(1 - (vel.length() * vel.length()) / (speed_of_light * speed_of_light));

				double beta = sqrt(1.0 - Rs / distance);

				beta = std::clamp(beta, 0., 1.);

				if (1)
				{
					static int64_t maxDiff = 0;
					int64_t bits = (int64_t&)beta;
					bits = bits & (int64_t(-1ull)<<shift);
					//double beta1 = (double&)bits;
					double betaRef = static_cast<float>(beta);
					int64_t bitsRef = (uint64_t&)betaRef;
					int64_t diff = abs(bitsRef - bits) >> shift;
					if (diff > maxDiff)
					{
						maxDiff = diff;
						SystemTools::Log("diff %i\n", diff);
					}
				}

With this debug code, i can see if there is a difference between the casting and the bit hack.
Maybe it happens rarely, explaining the mystery.

But no. After 4 orbits, it only prints 1, so the rounding error on the least significant bit.
Which should not matter, the fuck. Or does it?

I try again, with a shift of 28 instead 29, to get this f*@ing bit right as well…

Progress!

Much better angles. Using the bit masked double, killing 28 bits.
(Debug output still just 1 ofc.)

But repeating the cast (with the newly added clamp in the code), it's still better:

Diff should be 42.9, iirc.

Last attempt, 27 bits:

worse.

So as expected, 28 bits gives the closest match.
But is worse than expected. WHY?

There can be only one answer. The mystery must be about those lesser significant bits! It's the only difference.

Let's reveal the mystery, let's see gods magic:

This must be the explanation we're looking for. Is it zero? Or noisy bits?

Well… obviously we are not meant to understand the universe, taby. I've told you before. We shall not know.

If i run this code, it does not print it. It still prints diff, but it does not print quantumRandomness.

No joke. It does not print it. Which is even more a mystery then the difference to the casting, no?

I shall stop at this point. God says so, loud and clear. \ o:- /

JoeJ

4,258

May 26, 2024 06:07 PM

… Hehe, it's just that VS takes a while to load million lines of text. : )

So, here is the mystery, explaining all the stuff out there:

meh.

taby

1,508

Author

May 26, 2024 07:02 PM

I thank you again for all of your help, man.

taby

1,508

Author

May 26, 2024 08:41 PM

The initial conditions are:

const MyBig dt = 0.01;

const MyBig speed_of_light = 299792458.0;
const MyBig grav_constant = 6.6743e-11;
const MyBig sun_mass = 1.98847e30;

custom_math::vector_3 sun_pos(0, 0, 0);

const MyBig initial_vel = 38858.47;

custom_math::vector_3 mercury_pos(0, 69817079000.0, 0);
custom_math::vector_3 mercury_vel(-initial_vel, 0, 0);

The types are (from boost::multiprecision:

typedef cpp_bin_float_100 MyBig;
typedef cpp_bin_float_24 MySmall;

where

using cpp_bin_float_24 = number<backends::cpp_bin_float<24, backends::digit_base_2, void, std::int16_t, -126, 127>, et_off >;
using cpp_bin_float_100 = number<backends::cpp_bin_float<100, backends::digit_base_2, void, std::int16_t, -126, 127>, et_off >;

...

🎉 Celebrating 25 Years of GameDev.net! 🎉

Double to float C++

Popular Topics

Recommended Tutorials

🎉 Celebrating 25 Years of GameDev.net! 🎉

Double to float C++

Popular Topics

Recommended Tutorials

Reticulating splines