Sign in to follow this  

Procedurally-regenerating identical content on different machines (floating point determinism)

This topic is 654 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I've been designing the architecture for a distributed procedural content simulation engine for some time now, and recently it's become apparent that floating point determinism is going to become a thorn in my side with respect to my efforts to enable distributed setups, so for those of you with experience in (or reasonable knowledge of) floating point determinism issues, I'd really like to know how you'd approach this scenario:

 

Imagine you have two machines. The machines can be reasonably guaranteed to use modern, common hardware and operating systems (PowerPCs, IBM mainframes and other gotchas are not relevant here). In other words, both machines are either going to provide modern commodity x64 architectures on a server rack, or be two relatively-modern gaming machines of two friends whose machines are sharing the workload I'll describe. The operating system types can be constrained to the standard three (Windows/OSX/Linux).

 

Anyway, imagine we have two machines, and one of them is going to procedurally-generate some content based on known inputs. Let's say the content is a simple perlin noise-based terrain. Taking this a step further, let's say we procedurally-generate a set of physical objects (for brevity, perhaps they're simple wooden cubes of varying sizes and weights). We then use physics algorithms to scatter them around the terrain and allow them to settle based on physical characteristics such as gravity and the shape of the terrain.

 

Here's the catch:

  1. We don't know in advance which of the two machines will do the work.
  2. The machine that generates the content will not be able to guarantee that the other machine is available at the time that the work needs to be performed.
  3. The other machine will, later on, have to do generate the exact same outputs under the same conditions.
  4. We don't want to store the overall result as there will likely be too much data.
  5. It'd be nice to be able to offload some of the compute work (where relevant and reasonable) to the GPU (whatever is available on the machine in question).

Some places that I am willing to compromise:

  1. Reducing floating point precision
  2. Storing a limited subset of source data, as long as bulkier derivative data can be regenerated as needed
  3. Using large integers instead of floats
  4. Using fixed point calculations

Some additional questions:

  1. Is it possible to write integer-only versions of the popular coherent noise algorithms? (perlin/simplex/brownian/etc.)
  2. Can I get away with forcing IEEE754 floating point accuracy or will that compromise speed too much?
  3. Are 64-bit integers a thing on GPUs yet, or likely to become a thing any time soon?

As an addendum to this, I'm considering the possibility that this problem doesn't really have a good solution at this point in time, and that perhaps I need to be willing to simply store a lot of data and ensure that it is available when a machine needs to generate new content.

Share this post


Link to post
Share on other sites

Thanks for the reply, and what you have said sounds pretty much like what I've been anticipating is going to be a problem, particularly with respect to the fact that my design wants to facilitate a significant number of community-driven add-ons.

 



The most safe route is to use a software-based numerics library that guarantees deterministic results and does not rely on CPU-implemented functionality. They can potentially also be faster if the optimizer is allowed to hit them, whereas using the processor's floating point requires many pessimizations and constant validation that the system is still in the proper state.

 

Any you'd recommend?

Share this post


Link to post
Share on other sites

Sometimes you can get away with dropping the requirement for bit-for-bit identical results in the later stages of generation, as long as the early stages are perfectly identical.  Depending on the particular kind of procedural generation you're doing, this can possibly make things much easier.  Perhaps most of the numerical instability exists during the early/mid stages of generation, and so you can deal with that by using fixed point math, while a bunch of the heavy number crunching happens during the later stages, so you can switch to floating point and perhaps offload some of the work to the GPU.  Even if different GPUs produce different results, if the late-stage algorithms are numerically stable, they might still be close enough to work.

Share this post


Link to post
Share on other sites

This is something to consider. I have taken some time to peak through Unreals code to see what they did on a number of things. There is an item that just may very well be depreciated, but in the comments it described what it was used for.

The guys over at Epic actually created a custom floating point class for both 16 and 32bits. Their reasoning marked up in the comments was that a floating point was not always the same across systems.

The same was done by Starcraft, as mentioned in one of their GDC videos about AI. They use floating point arithmetic and ints to do the computation.

Edited by Tangletail

Share this post


Link to post
Share on other sites

Thanks for the replies.

 


STREFLOP is worth a look.
 
Though I don't see any particular barrier to designing new noise functions using integer math, and it's certainly possible to write them with a fixed-point math library.

 

Cheers, I'll take a look at that, and will investigate creating modified noise algorithms.

 


Have you considered using fixed point numbers rather than floating point?

 

Yep. I don't have a sense of how that would affect performance though. I think I'm going to have to simply experiment with this and come to my own conclusions.

 


Sometimes you can get away with dropping the requirement for bit-for-bit identical results in the later stages of generation, as long as the early stages are perfectly identical.  Depending on the particular kind of procedural generation you're doing, this can possibly make things much easier.  Perhaps most of the numerical instability exists during the early/mid stages of generation, and so you can deal with that by using fixed point math, while a bunch of the heavy number crunching happens during the later stages, so you can switch to floating point and perhaps offload some of the work to the GPU.  Even if different GPUs produce different results, if the late-stage algorithms are numerically stable, they might still be close enough to work.

 

Yeah I'd considered that, and it is certainly applicable with respect to the projecting of source data into client-side game assets. The problem is that the "server" component must be able to canonically represent all game state, including the results of positional changes due to physics. Because those tend to use compounded floating operations, I have a feeling they're going to be a problem for me. Perhaps there's an opportunity to get creative with physics code...

 


The guys over at Epic actually created a custom floating point class for both 16 and 32bits. Their reasoning marked up in the comments was that a floating point was not always the same across systems.

 

Sounds interesting, I'll look further into that, thanks.

Share this post


Link to post
Share on other sites
Interesting fact: Microsoft compilers will automatically use SSE on x64 builds and will not emit FPU instructions. This is something I've exploited in the past to great benefit, because SSE determinism is actually a lot easier and less fussy than x87 determinism.

NB this does not work on Linux or (I *think*) OSX.

Share this post


Link to post
Share on other sites


Interesting fact: Microsoft compilers will automatically use SSE on x64 builds and will not emit FPU instructions.

I can see how that would work for simple arithmetic and logic operations, but not so much on other operations.

 

Operations like exponents, logarithms, trig functions, and a few others are not available in SSE form but are implemented as intrinsics or direct FPU calls by most compilers, including Microsoft's.  Those also tend to be operations with high variation (still within tolerance but bit-for-bit differences) between chips.

Share this post


Link to post
Share on other sites

I know this problem from replay, or client side extrapolation. A long time I was seeking absolute precision. Now I do program occasionally calculations with currency, and this has to be precise to the cent. So anyways with a level generation I would now try to use a hybrid approach, where each level is once generated completely on the server. I would propagate the  tolerance, and if it gets to large, store the more or less random result of the server calculation in the level map. This is then combined with hints by the artists. Later the file should be able to be unpacked on mobile device without draining battery ( I am trying to justify the complete level generation ) . I am planning to do this for a simulation. A simulation which results will be used in the real world. I think games and CAM should share code. Premature optimization (here: sloppy, but fast  code) is the root of all evil.

Share this post


Link to post
Share on other sites

Interesting fact: Microsoft compilers will automatically use SSE on x64 builds and will not emit FPU instructions.

I can see how that would work for simple arithmetic and logic operations, but not so much on other operations.
 
Operations like exponents, logarithms, trig functions, and a few others are not available in SSE form but are implemented as intrinsics or direct FPU calls by most compilers, including Microsoft's.  Those also tend to be operations with high variation (still within tolerance but bit-for-bit differences) between chips.



I was suspicious of this, so I pulled up some available code in VS2013 and poked around.

Stepping through the disassembly of cosf() (implemented in f:\dd\vctools\crt\fpw32\tran\amd64\cosf.asm but not provided with VS itself, so I'm going on my assembly skills alone here), it looks like there is a purely SSE2 implementation. It isn't using a hardware instruction for the actual transcendental functions, it seems to be implementing the algorithm in software.

Sadly I don't have time to totally reverse engineer the math libraries for VS, but it seems like their claim of not emitting FPU instructions is accurate.

Share this post


Link to post
Share on other sites

I was suspicious of this, so I pulled up some available code in VS2013 and poked around.

Stepping through the disassembly of cosf() (implemented in f:\dd\vctools\crt\fpw32\tran\amd64\cosf.asm but not provided with VS itself, so I'm going on my assembly skills alone here), it looks like there is a purely SSE2 implementation. It isn't using a hardware instruction for the actual transcendental functions, it seems to be implementing the algorithm in software.

 

Interesting.  Seems I was wrong and you were right.

 

I was basing my comments on MSDN's description of what it does.  I didn't use an actual disassembly.  That's what I get for trusting the documentation.

 

Stepping through the code, I cannot reproduce what MSDN documentation for VS2015 says it should be doing.

 

 

The floating-point functions listed below do not have true intrinsic forms. Instead they have versions that pass arguments directly to the floating-point chip rather than pushing them onto the program stack:

acos
cosh
pow
tanh
asin
fmod
sinh

The floating-point functions listed below have true intrinsic forms when you specify /Oi, /Og, and /fp:fast (or any option that includes /Og: /Ox, /O1, and /O2):

atan
exp
log10
sqrt
atan2
log
sin
tan
cos

 

So just for fun, I put in a small function to call a few of them and print the results.  I made sure the flags matched those MSDN said were required.

 

You mentioned cosf() and x64.

 

Breaking into the disassembly of cosf() in debug build, it puts the value into a simd register (xmm0) and calls a function (__imp_cosf) that does a bunch of simd stuff, mostly fused multiply/adds. Release build does the same register and calls a different function (cosf), seems to be more based around shuffles and subtraction. I don't know what work it is doing, but it isn't calling the CPU function.

 

 
I then turned on all the optimization flags MSDN said were necessary and went through a few items on the chart above (cos, exp, tan, atan) and all called functions that seemed to be implemented with simd, packing the value into xmm0 and calling a library function. Only one called the math operation directly, sqrt() which packed into xmm0 and called the sqrtpd instruction. 

 

Switching from x64 to x86, setting all the optimization flags indicated by msdn, it still loads it to xmm0 and calls various functions (___libm_sse2_cos,  ___libm_sse2_exp, ___libm_sse2_tan, ___libm_sse2_atan), and only sqrt() was implemented as an intrinsic call to sqrtpd.

 

 

 

I tried a few other compiler option variations (/O1, /Og, /GL, etc.)  and could not get the functions to generate the FPU intrinsics or call a function that sends it directly to the FPU, as the documentation describes, on either x86 or x64 release builds.

 

So it looks like you're right, and the documentation is wrong about all the functions except sqrt().  I cannot get them to do as described either to use "versions that pass arguments directly to the floating point chip" or have "true intrinsic forms".  Except for sqrt() they all seem to call slower library functions.

Edited by frob

Share this post


Link to post
Share on other sites


The problem is that the floating point standards for many operations offer guarantees of precision and accuracy.  They need to be close enough to the target, and even the manufacturers (Intel and AMD) are explicit that the microcode for some operations change between chips giving results that meet the precision and accuracy standard but are not bit-for-bit identical.

 

I'll add to this that the same is also true of math functions in the standard library. Implementations of the math functions can vary between vendors and can change without notice between versions, major or otherwise.

 

Does anyone know if there are any cross-platform libraries for deterministic floating point?

 

 

 

Finally, is it really the case that you absolutely need bit-for-bit identical results? It seems to me that you don't even need consistency for purely cosmetic things, and that simulation usually is robust (or just loose) enough to deal with margins of error/non-determinism within an epsilon -- I have a hard time imagining why you need exact bit-wise agreement short of these values participating in a hash, but maybe that's a blind spot in my experience.

Share this post


Link to post
Share on other sites


Does anyone know if there are any cross-platform libraries for deterministic floating point?

Already mentioned by swiftcoder, STREFLOP (STandalone REproducible FLOating-Point) is one.

Share this post


Link to post
Share on other sites


Have you considered using fixed point numbers rather than floating point?

 

Yep. I don't have a sense of how that would affect performance though. I think I'm going to have to simply experiment with this and come to my own conclusions.

Conversion between floating point and fixed point is fairly expensive, but most of that expense is in the conversion of a floating point value to an integer. If you're going to use fixed point, it's probably wisest to do all your calculation with it.

I've used 16.16 fixed point quite a bit. On 32-bit archs that have a bit extending multiply (multiply two 32-bit numbers, get a 64-bit result) and a bit extended divide (divide a 64-bit number by a 32-bit number for a 32-bit result), and assuming your compiler recognizes that they can be used there, fixed point arithmetic probably outperforms floating point arithmetic.

On archs without a bit extended divide (such as ARM), division will be quite slow (in performance-sensitive situations where precision loss is acceptable, you can use this hack to get near-integer-division performance: divresult = dividend * ((0x80000000 / (unsigned)(divisor >> 1)) << 1);. Or you could use a format with a 15-bit fraction and use this trick for all division with no precision loss.

Fixed point is ideal for situations where results will be converted to integers at the end (IE generating images) because conversion from fixed point to an integer by truncation is a simple right-shift. I used it for a software rasterizer last year, and even wrote some impressively fast trig functions for it.

If you're using SIMD on Intel or AMD processors, though, multiplication is a total nightmare and performs poorly.

Edited by nfries88

Share this post


Link to post
Share on other sites

This topic is 654 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this