Procedurally-regenerating identical content on different machines (floating point determinism)

Started by
14 comments, last by nfries88 8 years, 1 month ago

I've been designing the architecture for a distributed procedural content simulation engine for some time now, and recently it's become apparent that floating point determinism is going to become a thorn in my side with respect to my efforts to enable distributed setups, so for those of you with experience in (or reasonable knowledge of) floating point determinism issues, I'd really like to know how you'd approach this scenario:

Imagine you have two machines. The machines can be reasonably guaranteed to use modern, common hardware and operating systems (PowerPCs, IBM mainframes and other gotchas are not relevant here). In other words, both machines are either going to provide modern commodity x64 architectures on a server rack, or be two relatively-modern gaming machines of two friends whose machines are sharing the workload I'll describe. The operating system types can be constrained to the standard three (Windows/OSX/Linux).

Anyway, imagine we have two machines, and one of them is going to procedurally-generate some content based on known inputs. Let's say the content is a simple perlin noise-based terrain. Taking this a step further, let's say we procedurally-generate a set of physical objects (for brevity, perhaps they're simple wooden cubes of varying sizes and weights). We then use physics algorithms to scatter them around the terrain and allow them to settle based on physical characteristics such as gravity and the shape of the terrain.

Here's the catch:

  1. We don't know in advance which of the two machines will do the work.
  2. The machine that generates the content will not be able to guarantee that the other machine is available at the time that the work needs to be performed.
  3. The other machine will, later on, have to do generate the exact same outputs under the same conditions.
  4. We don't want to store the overall result as there will likely be too much data.
  5. It'd be nice to be able to offload some of the compute work (where relevant and reasonable) to the GPU (whatever is available on the machine in question).

Some places that I am willing to compromise:

  1. Reducing floating point precision
  2. Storing a limited subset of source data, as long as bulkier derivative data can be regenerated as needed
  3. Using large integers instead of floats
  4. Using fixed point calculations

Some additional questions:

  1. Is it possible to write integer-only versions of the popular coherent noise algorithms? (perlin/simplex/brownian/etc.)
  2. Can I get away with forcing IEEE754 floating point accuracy or will that compromise speed too much?
  3. Are 64-bit integers a thing on GPUs yet, or likely to become a thing any time soon?

As an addendum to this, I'm considering the possibility that this problem doesn't really have a good solution at this point in time, and that perhaps I need to be willing to simply store a lot of data and ensure that it is available when a machine needs to generate new content.

Advertisement

So your looking for bit-for-bit identical results between arbitrary processors?

It is possible, but takes effort.

Achieving bit-for-bit identical results generally means disabling various optimizations, restricting yourself to basic operations, constantly verifying and resetting flags (and asserting hard if you discover they are in an unexpected state),

The problem is that the floating point standards for many operations offer guarantees of precision and accuracy. They need to be close enough to the target, and even the manufacturers (Intel and AMD) are explicit that the microcode for some operations change between chips giving results that meet the precision and accuracy standard but are not bit-for-bit identical.

Some operations, particularly the trig functions, have had known issues where they leave data in the extended bits that causes subtle issues within the required tolerance. They can be cleared, but again you need to track down and find any case where it might be an issue.

Optimizers generally need to be disabled. It is quite common for some optimizing compilers to generate several variants of a function and select the version matching the executable when the process is run. Since the different versions have different code paths the results tend to not be bit-for-bit identical.

Optimizers also like to occasionally switch between FPU and SIMD operations, many of those exchanges do not have bit-for-bit identical results.

If your code follows slightly different code paths, perhaps one doing a load/store between operations and another code path leaving it in a higher-precision register, it can cause differences. Any minor differences have the potential to give different results.

There are also shifts in floating point control that happen behind your back, like another library modifying floating point flags, can break repeatability. You'll be going along just fine and suddenly discover one of the FPU flags is different, then you need to track down which library set the flag and correct it from happening. Hopefully you can find and fix it, otherwise you'll need to find a way to prevent the library from interrupting, so multiprocessing code may need some serious locks around operations if you cannot stop them elsewhere.

The most safe route is to use a software-based numerics library that guarantees deterministic results and does not rely on CPU-implemented functionality. They can potentially also be faster if the optimizer is allowed to hit them, whereas using the processor's floating point requires many pessimizations and constant validation that the system is still in the proper state.

Thanks for the reply, and what you have said sounds pretty much like what I've been anticipating is going to be a problem, particularly with respect to the fact that my design wants to facilitate a significant number of community-driven add-ons.



The most safe route is to use a software-based numerics library that guarantees deterministic results and does not rely on CPU-implemented functionality. They can potentially also be faster if the optimizer is allowed to hit them, whereas using the processor's floating point requires many pessimizations and constant validation that the system is still in the proper state.

Any you'd recommend?

STREFLOP is worth a look.

Though I don't see any particular barrier to designing new noise functions using integer math, and it's certainly possible to write them with a fixed-point math library.

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

Have you considered using fixed point numbers rather than floating point?

"I would try to find halo source code by bungie best fps engine ever created, u see why call of duty loses speed due to its detail." -- GettingNifty

Sometimes you can get away with dropping the requirement for bit-for-bit identical results in the later stages of generation, as long as the early stages are perfectly identical. Depending on the particular kind of procedural generation you're doing, this can possibly make things much easier. Perhaps most of the numerical instability exists during the early/mid stages of generation, and so you can deal with that by using fixed point math, while a bunch of the heavy number crunching happens during the later stages, so you can switch to floating point and perhaps offload some of the work to the GPU. Even if different GPUs produce different results, if the late-stage algorithms are numerically stable, they might still be close enough to work.

"We should have a great fewer disputes in the world if words were taken for what they are, the signs of our ideas only, and not for things themselves." - John Locke

This is something to consider. I have taken some time to peak through Unreals code to see what they did on a number of things. There is an item that just may very well be depreciated, but in the comments it described what it was used for.

The guys over at Epic actually created a custom floating point class for both 16 and 32bits. Their reasoning marked up in the comments was that a floating point was not always the same across systems.

The same was done by Starcraft, as mentioned in one of their GDC videos about AI. They use floating point arithmetic and ints to do the computation.

Thanks for the replies.


STREFLOP is worth a look.

Though I don't see any particular barrier to designing new noise functions using integer math, and it's certainly possible to write them with a fixed-point math library.

Cheers, I'll take a look at that, and will investigate creating modified noise algorithms.


Have you considered using fixed point numbers rather than floating point?

Yep. I don't have a sense of how that would affect performance though. I think I'm going to have to simply experiment with this and come to my own conclusions.


Sometimes you can get away with dropping the requirement for bit-for-bit identical results in the later stages of generation, as long as the early stages are perfectly identical. Depending on the particular kind of procedural generation you're doing, this can possibly make things much easier. Perhaps most of the numerical instability exists during the early/mid stages of generation, and so you can deal with that by using fixed point math, while a bunch of the heavy number crunching happens during the later stages, so you can switch to floating point and perhaps offload some of the work to the GPU. Even if different GPUs produce different results, if the late-stage algorithms are numerically stable, they might still be close enough to work.

Yeah I'd considered that, and it is certainly applicable with respect to the projecting of source data into client-side game assets. The problem is that the "server" component must be able to canonically represent all game state, including the results of positional changes due to physics. Because those tend to use compounded floating operations, I have a feeling they're going to be a problem for me. Perhaps there's an opportunity to get creative with physics code...


The guys over at Epic actually created a custom floating point class for both 16 and 32bits. Their reasoning marked up in the comments was that a floating point was not always the same across systems.

Sounds interesting, I'll look further into that, thanks.

Interesting fact: Microsoft compilers will automatically use SSE on x64 builds and will not emit FPU instructions. This is something I've exploited in the past to great benefit, because SSE determinism is actually a lot easier and less fussy than x87 determinism.

NB this does not work on Linux or (I *think*) OSX.

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]


Interesting fact: Microsoft compilers will automatically use SSE on x64 builds and will not emit FPU instructions.

I can see how that would work for simple arithmetic and logic operations, but not so much on other operations.

Operations like exponents, logarithms, trig functions, and a few others are not available in SSE form but are implemented as intrinsics or direct FPU calls by most compilers, including Microsoft's. Those also tend to be operations with high variation (still within tolerance but bit-for-bit differences) between chips.

This topic is closed to new replies.

Advertisement