Double to float C++

Started by
179 comments, last by taby 1 day, 2 hours ago

Juliean said:

taby said:
If I comment the casting out, I get an answer of 7.75. If the casting remains, I get an answer of 42.66. The analytical solution gives an answer of 42.94.

taby said:
Not a problem. We're modelling the orbit of Mercury.

Are the distances/floating point values at any point in the calculation extremely large? Double→float does not only reduce precision, it also has to clamp values that were somehow out of range of the smaller number. That would be my only remaining guess of what could explain a difference.

Otherwise, you need to get a debugger out, and step through both versions of the code to see a difference. You obviously know what number you expect in a certain case. Record the inputs that you feed into the function, that then produces the different values. Make proceed_EulerDouble and proceed_EulerFloat. Call them with the same parameters, with a debugger attached, in a simplified version of the app. Step through the code and examine the values at each step. This is the only surefire way to diagnose this.

Thank you for the outline of the debugging strategy. I will think about this for a while now.

Advertisement

Thank you all for your input!!!

I tried to paste all your code in to mine, but probably i've missed something about the initial conditions.

Here is output after some frames:

pos is changing, but there seems no difference between casting or not casting. Both is just 1, at least after rounding for display.

(Edit: checking with debugger shows the close to one double value is rounded to exactly one by the cast.)

Posting my ported code in case you spot something quickly. (The code runs once per frame, so i use static to fake global variables and lambdas to fake functions.)

		if (visDebugTaby)
		{
			class vector_3
			{
			public:
				long double x, y, z;

				vector_3(const long double &src_x = 0, const long double &src_y = 0, const long double &src_z = 0)
				{
					x = src_x;
					y = src_y;
					z = src_z;
				};

				bool operator==(const vector_3 &rhs);
				bool operator!=(const vector_3 &rhs);
				void zero(void);
				void rotate_x(const long double &radians);
				void rotate_y(const long double &radians);
				void rotate_z(const long double &radians)
				{
					long double t_x = x;
    
					x = t_x*cos(radians) + y*-sin(radians);
					y = t_x*sin(radians) + y*cos(radians);
				}
				vector_3 operator+(const vector_3 &rhs);
				vector_3 operator-(const vector_3 &rhs)
				{
					return vector_3(x - rhs.x, y - rhs.y, z - rhs.z);
				}
				vector_3 operator*(const vector_3 &rhs);
				vector_3 operator*(const long double &rhs)
				{
					return vector_3(x*rhs, y*rhs, z*rhs);
				}
				vector_3 operator/(const long double &rhs)
				{
					return vector_3(x/rhs, y/rhs, z/rhs);
				}
				vector_3 &operator=(const vector_3 &rhs)
					{
	x = rhs.x;
	y = rhs.y;
	z = rhs.z;
	return *this;
}
				vector_3 &operator+=(const vector_3 &rhs)
				{
					x += rhs.x; y += rhs.y; z += rhs.z;
					return *this;
				}
				vector_3 &operator*=(const vector_3 &rhs);
				vector_3 &operator*=(const long double &rhs);
				vector_3 operator-(void);
				long double length(void) const
				{
					return sqrt(self_dot());
				}
				vector_3 &normalize(void)
				{
					long double len = length();

					if(len != 1)
					{
						x /= len;
						y /= len;
						z /= len;
					}

					return *this;
				}
				long double dot(const vector_3 &rhs) const
				{
					return x*rhs.x + y*rhs.y + z*rhs.z;
				}
				long double self_dot(void) const
				{
					return x*x + y*y + z*z;
				}
				vector_3 cross(const vector_3 &rhs) const;
			};


			const long double speed_of_light = 299792458;
			const long double grav_constant = 6.6743e-11;
			const long double sun_mass = 1.98847e30;
			const long double pi = 4.0 * atan(1.0);

			static vector_3 sun_pos(0, 0, 0);
			static vector_3 mercury_pos(0, 69817079000.0, 0);
			static vector_3 mercury_vel(-38860, 0, 0); // -38860
			static bool decreasing = true;
			static size_t orbit_count = 0;
			static vector_3 previous_dir(0, 1, 0);
			static long double delta = 6 * pi * grav_constant * sun_mass / (speed_of_light*speed_of_light * (1 - 0.2056*0.2056) * 57.909e9);
			static long double total = 0;
			static long unsigned int frame_count = 0;

			auto grav_acceleration = [&](const vector_3& pos, const vector_3& vel, const long double G)
			{
				vector_3 grav_dir = sun_pos - pos;

				const double distance = grav_dir.length();
				grav_dir.normalize();

				vector_3 accel = grav_dir * G * sun_mass / (distance * distance);

				return accel;
			};

			auto proceed_Euler = [&](vector_3& pos, vector_3& vel, const long double G, const long double dt)
			{
				ImGui::Text("pos (%f, %f, %f)", pos.x, pos.y, pos.z);
				ImGui::Text("vel (%f, %f, %f)", vel.x, vel.y, vel.z);

				const vector_3 grav_dir = sun_pos - pos;
				const double distance = grav_dir.length();
				const double Rs = 2 * grav_constant * sun_mass / (speed_of_light * speed_of_light);

				const double alpha = 2.0 - sqrt(1 - (vel.length() * vel.length()) / (speed_of_light * speed_of_light));

				double beta = sqrt(1.0 - Rs / distance);

				//beta = precision2(beta, 3);

				ImGui::Text("Rs %f  distance %f", Rs, distance);
				ImGui::Text("Rs / distance %f", Rs / distance);

				ImGui::Text("beta0 %f", beta);

				beta = static_cast<float>(beta);

				ImGui::Text("beta1 %f", beta);

				//cout << precision2(beta, 9) << " " << beta << endl;



				vector_3 accel = grav_acceleration(pos, vel, G);// *(1.0 / beta);

				vel += accel * dt * alpha;
				pos += vel * dt * beta;
			};


			frame_count++;

			const long double dt = 0.01;// 5e-6 * (speed_of_light / mercury_vel.length());

			vector_3 last_pos = mercury_pos;

			proceed_Euler(mercury_pos, mercury_vel, grav_constant, dt);


			/// this doesn't work all of the time... change it by keeping track of previous two positions


			if (decreasing)
			{
				if (mercury_pos.length() > last_pos.length())
				{
					// hit perihelion
					ImGui::Text("hit perihelion");
					decreasing = false;
					//return;
				}
			}
			else
			{
				if (mercury_pos.length() < last_pos.length())
				{
					// hit aphelion
					ImGui::Text("hit aphelion");

					orbit_count++;

					vector_3 current_dir = last_pos;
					current_dir.normalize();

					const long double d = current_dir.dot(previous_dir);

					const long double angle = acos(d);
					previous_dir = current_dir;

					vector_3 temp_pos = last_pos;
					temp_pos.rotate_z(-total);

					if (temp_pos.x < 0)
						total += angle;
					else
						total -= angle;

					const long double avg = total / orbit_count;

					static const long double num_orbits_per_earth_century = 365.0 / 88.0 * 100;
					static const long double to_arcseconds = 1.0 / (pi / (180.0 * 3600.0));

					ImGui::Text("orbit %f", orbit_count);
					ImGui::Text("dot   %f", d);
					ImGui::Text("total %f", total * num_orbits_per_earth_century * to_arcseconds);
					ImGui::Text("angle %f", angle * num_orbits_per_earth_century * to_arcseconds);
					ImGui::Text("delta %f", delta * num_orbits_per_earth_century * to_arcseconds);
					ImGui::Text("avg   %f", avg   * num_orbits_per_earth_century * to_arcseconds);

					//positions.clear();

					decreasing = true;
				}
			}
		}

I believe other people recommended this to you already, but you really should read it or re-read it: What Every Computer Scientist Should Know About Floating Point Numbers.

Floating point numbers operate within a precision. Floats give about 6 decimal digits of precision. Double gives up to 16 decimal digits of precision. Many operations will catastrophically cancel numbers when you mix big numbers with small numbers, effectively they round away to oblivion or vanish into numerical noise or other numeric errors.

When you start mixing big numbers with small numbers anything below the precision is ignored, as described in depth in that document. Normally floating point exceptions are disabled but you can dig through your compiler documentation on how to enable them. For me trying your code, enabling floating point exceptions I get errors right away about catastrophic cancelations in self_dot() called by grav_dir.length(), followed by multiple floating point exceptions in Rs = 2 * grav_constant * sun_mass / (speed_of_light * speed_of_light). I'm not going to step through the code, but it looks like you're mixing huge and tiny numbers all the time in ways likely to cause floating point math errors.

Simply, the floating point operations in C++ are not designed to work like that. There are specialized math libraries out there that can handle the math, but float and double won't do it.

Yeah, @joej, the only difference is that I’m casting as float the result of the sqrt function. Essentially, we are further quantizing the gravitational time dilation. What steps are taken during that further quantization are what I’m looking for, basically.

I’m sure it’ll have something to do with use of the std::numeric_limits

I used long doubles with Ubuntu, where they are twice as large as in Windows / Visual C++, and it gives practically the same answers as when using doubles. I don’t need a library.

I made it clesr from the start that yes, this is about precision. However, it’s not just precision, because there is definitely more code used during that cast… otherwise simply adjusting the precision would work (it does not).

Juliean said:
Double→float does not only reduce precision, it also has to clamp values that were somehow out of range of the smaller number.

Clamping could explain the big difference on the values given as example.
But if i convert a double value larger than FLT_MAX, the float result isn't clamped to that but actually infinity.

However, idk what long double does with Linux.

I made a post on Stack Overflow. Hopefully the trolls don't downvote it and get my ass banned…

https://stackoverflow.com/questions/78481482/in-c-what-is-the-standard-behaviour-of-static-castfloatsome-double-variabl

taby said:
I made a post on Stack Overflow.

They can only guess and point out the usual related topics.

But you say you get the same results on Windows too, so you could provide a minimal code example including numbers to reproduce the issue?
That's what you should do. Then we could see why exactly there is a large difference on results for your specific case.

Advertisement