Jump to content
  • Advertisement
Sign in to follow this  
TwoNybble

OpenGL GLSL Double Emulation Inconsistency Across Drivers

This topic is 610 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I have been working on a space-scale game that makes use of double emulation on the GPU to compute positions relative to the camera (RTE). This has worked beautifully on both my Debian (Linux) Laptop with an Intel HD 4000, and on my Windows PC (AMD R9 390). However, I recently upgraded my Debian distro, which I assume also installed a newer version of the Intel graphics drivers and Mesa (13.0.2). After this upgrade, my planet rendering lost all its precision and now the surface is like a "staircase" instead of the nice smooth land I had before. To confirm that this is a precision issue, I truncated the floating point operations on the working Windows build to only 32 bit, and it exhibited the same "staircase" surface from lack of precision.

 

I'm having difficulty figuring out where these precision issues are coming from. I am using identical code with the floating point type in GLSL, and both implementations return a floating point precision of p=23 when I query the GLSL vertex shader precision from OpenGL. I also suspected at first that it may be an optimization problem since double emulation has algorithms that could mathematically be simplified, but must be executed separately due to floating point rounding. However, when I set MESA_GLSL=nopt, which is the command to disable shader optimization, I still see the same effect.

 

Finally, to get to the root of the problem, I began dissecting each individual calculation on the GPU and writing them to a 32 bit floating point texture to read back on the CPU. I implemented the same function on the CPU side, which works as I expect. However, the results on the GPU side do no line up.

 

GLSL:

vec2 ds_add(vec2 a, vec2 b) {
	float t1 = a.x + b.x;
	float e = t1 - a.x;
	float t2 = ((b.x - e) + (a.x - (t1 - e))) + a.y + b.y;
	//the above is the standard DSFUN90 Knuth algorithm.

	vec2 ret;
	ret.x = t1 + t2;
	ret.y = t2 - (ret.x - t1);
	return ret;
}

Sample Parts:

 

b.x - e = 0

t1 - e = 1.2038513422012329
a.x - (t1 - e) = 0

 

ret.x = 4.3432750701904297

ret.y = 0

 

CPU:

glm::vec2 ds_add(glm::vec2 a, glm::vec2 b) {
	float t1 = a.x + b.x;
	float e = t1 - a.x;
	float t2 = ((b.x - e) + (a.x - (t1 - e))) + a.y + b.y;

	glm::vec2 ret;
	ret.x = t1 + t2;
	ret.y = t2 - (ret.x - t1);
	return ret;
}

Sample Parts:

 

b.x - e = 0

t1 - e = 1.2038512229919434 (diverges from GPU after 7 decimal places)
a.x - (t1 - e) = 1.1920928955078125e-07 (the small bit I expect to see on the GPU but only shows up on the CPU implementation)

 

ret.x = 4.3432750701904297

ret.y = 9.1924000855669874e-08 (missing from GPU result)

 

From the above sample output, the GPU is somehow not lining up with the CPU results. The inputs to this function are split identically on the CPU side, so the problem can't be there. But for completeness, the inputs for this test were:

 

double d1 = 1.20385132193021958120313214;

double d2 = 3.13942384018421752835103287;

 

Those are the two numbers that are being summed together. The CPU implementation above is very close to the purely double addition (to within 10^-14), while the GPU implementation is only within (10^-7), or no better than the purely float-type implementation. What is going on? Why are the newer drivers or Mesa breaking my program's past behavior?

Share this post


Link to post
Share on other sites
Advertisement

My guess is that the double precision types are being demoted to simply floats, but as to why, I don't know.

 

This probably doesn't directly answer your question, but I have used 64 bit fixed point math on GPUs before.  If used sparingly they aren't too bad performance wise, and addition/subtraction and multiplication are quite easy to implement.

Share this post


Link to post
Share on other sites

Unfortunately, there are a fair number of double emulation calls, so I think I'd like to stick to them if I can get it working consistently. I plan on making use of hardware doubles on GPUs on which they perform well

 

The doubles are not actually used on the GPU, or at all after they are split on the CPU side. The inputs to both the CPU and GPU ds_add function (vec2 a, vec2 b) are simply a vec2 that represents the "high" and "low" parts of the original double-type variable. The splitting function looks like this:

void ds_split(double d, float& hi, float& lo) {
    hi = (float) d;
    lo = (float) (d - hi);
}

However this is performed on the CPU for both the CPU and GPU tests, and so the inputs to each are as expected. For example, the double d1 is decomposed into a hi and lo piece that is used to create the vec2 a:

double d1 = 1.20385132193021958120313214;
float hi, lo;
ds_split(d1, hi, lo);
glm::vec2 a(hi, lo);

In the above example:

 

a.x = 1.2038513422012329

a.y = -2.0271013312367359e-08

 

So yes, the doubles are being transformed into 2 floats, but I don't believe this is where the problem lies. Thanks for the response!

Edited by TwoNybble

Share this post


Link to post
Share on other sites

I see what you're doing.  https://www.thasler.com/blog/blog/glsl-part2-emu did basically the same thing, but for him it seemed to work.
 
What puzzles me are the lines:
GLSL: t1 - e = 1.2038513422012329
CPU: t1 - e = 1.2038512229919434 (diverges from GPU after 7 decimal places)
If this is being done in single precision, you only get about 7 decimal points of precision (23 bits with an implied 24th irregardless of the exponent).  That output is far too large to be a single precision number, so in your debug/output code you're converting to double somewhere?
 
Also the line:
To confirm that this is a precision issue, I truncated the floating point operations on the working Windows build to only 32 bit, and it exhibited the same "staircase" surface from lack of precision.
 
I'm probably misunderstanding you somewhere; but from the code presented you are emulating doubles not directly using them.  So when you state that you truncated the doubles, it confuses me, because I don't see any doubles.  All the code you presented (apart from the inputs to the ds_split function) is single precision.  Were you running the entire simulation using double precision?  What I mean is were you emulating doubles using doubles?  Is there some version of ds_add in your code that looks like dvec2 ds_add(dvec2 a, dvec2 b)?  Or do you have two versions of the code, one that is all doubles (no emulation) and one that is emulated?

 

My understanding is that emulating doubles using singles has significantly less precision than using doubles directly.  It may be that your emulated doubles, while giving you more precision than singles, are still not giving you enough precision to hide the staircase effect.

Edited by Ryan_001

Share this post


Link to post
Share on other sites

I see what you're doing.  https://www.thasler.com/blog/blog/glsl-part2-emu did basically the same thing, but for him it seemed to work.

 

Yes, this was one of the sources I used for the double emulation in GLSL. And it did/does work for me as well. As I stated it worked on Debian Jessie with Intel HD 4000 graphics, and on my current Windows 7 computer with an AMD R9 390. However, it doesn't work on certain configurations (which I'm not totally sure of) such as Debian Testing (Linux kernel 4) on the same exact laptop, and a Dell Windows lab computer that I will need to look up the hardware details of.

 

 

What puzzles me are the lines:
GLSL: t1 - e = 1.2038513422012329
CPU: t1 - e = 1.2038512229919434 (diverges from GPU after 7 decimal places)
If this is being done in single precision, you only get about 7 decimal points of precision (23 bits with an implied 24th irregardless of the exponent).  That output is far too large to be a single precision number, so in your debug/output code you're converting to double somewhere?

 

Yes, you are correct here, this is a mistake where I previously do a std::setprecision(17) to look at some doubles earlier in the code. These are indeed both single precision numbers, and any digits after 7 are inaccurate in this case. Rest assured that both calculations are indeed single precision as I stated.

 

However, the other calculation (the third sample in which the GPU version is just 0) shows that the CPU variant (which is totally single precision floats) and the GPU variant (also single precision double emulation) do not agree despite doing the same thing. I realize that float may not conform to the IEEE standard on the GPU, but I should still have 23 bits (24 implied) of precision. It is also strange to me that this behavior has changed on the same exact GPU, with different drivers or configuration. I suppose it's possible they changed their float implementation but this seems odd.

 

 

I'm probably misunderstanding you somewhere; but from the code presented you are emulating doubles not directly using them.  So when you state that you truncated the doubles, it confuses me, because I don't see any doubles.

 

I'm afraid I've explained this quite poorly. Yes, I am indeed emulating doubles, and no where in the double emulation am I using doubles as you mention (except for splitting on the CPU side). I do a double sum of the numbers for comparison purposes (not shown), but the two implementations I showed above are indeed single precision double emulation. The CPU double emulation comes within two decimal digits of the "correct" double sum. The GPU double emulation is no more accurate than if I had cast them to floats.

 

Truncate was a poor choice of word on my part. I meant that I simply used single precision types instead of my double emulation in the GLSL on my Windows build to see what the effects of low precision are on the working simulation.

 

To summarize:

                    Windows 7 (AMD R9 390)     Debian Jessie (Intel HD 4000)    Debian Testing (Intel HD 4000)
Double Emulation       Correct Behavior,          Correct Behavior,                Incorrect Behavior,
(using single)         nearly "real" double       nearly "real" double             precision no better
                       precision.                 precision.                       than just using floats.

 

My understanding is that emulating doubles using singles has significantly less precision than using doubles directly.  It may be that your emulated doubles, while giving you more precision than singles, are still not giving you enough precision to hide the staircase effect.

 

You are correct that double emulation does not quite give the same precision as real doubles. However, my two working builds have been within 2 decimal digits, which is more than enough for my purposes. I know for a fact that when the double emulation is working correctly, there is enough precision and there is no staircase effect.

 

I am stumped as to why two seemingly identical implementations (GPU/CPU) are giving such significantly different results.

Edited by TwoNybble

Share this post


Link to post
Share on other sites

The issue is probably excessive optimization in the driver's GLSL compiler. OpenGL doesn't specify floating point operations as stringently as IEEE 754 does, so the compiler might be assuming associative behavior of floats, and in that case this:

ret.x = t1 + t2;
ret.y = t2 - (ret.x - t1);

can simplify to this:

ret.x = t1 + t2;
ret.y = 0;

Ergo, no extra precision. AFAIK there is nothing that you can do about. This StackOverflow question covers a similar situation and provides a couple workarounds (no guarantees that they work for all drivers/compilers): https://stackoverflow.com/questions/35497160/glsl-compiler-optimizations-lead-to-incorrect-behavior-with-floating-point-opera

 

I doubt the MESA_GLSL=nopt environment option affects hardware accelerated drivers in any way. I'd be very surprised if it did.

Edited by l0calh05t

Share this post


Link to post
Share on other sites

Thanks for the reply. I suspected that it may be optimization, but I had hoped that there would be more control over this behavior. I suppose until a better solution is found, I will have to make use of one of these workarounds. It's unfortunate that OpenGL does not specify optimization and other "standard" client-side features in GLSL compilers, but I guess I understand why they chose not to.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!