Replay & recorded games

Started by
32 comments, last by captain_crunch 8 years, 4 months ago

I don't see how that would work because during recording, you need to log everything as it is only when replaying you will know when they diverge, and at that point you need the full log from the recording session.


My application is comparing multiple game states over a network - once they've been determined as identical, then the saved local states can be forgotten.

I don't know about C#, but in C++, if you want names you can use:

#define Log(X) RealLog(#X, X)
Alternatively, forgetting the text format thing, can't you just write your game state out as a byte stream?
Advertisement

Despite what other people are saying here, it is possible to save and load floats via text, correctly. But you have to be more careful (formatting) about how you do it:

https://randomascii.wordpress.com/2012/03/08/float-precisionfrom-zero-to-100-digits-2/

https://randomascii.wordpress.com/2013/02/07/float-precision-revisited-nine-digit-float-portability/

I think there are two different issues discussed with floating point in this thread:

- Storage and retrieval (ie round-trip from binary form to text back to binary form), and

- Computing with floating point between different machines (with or without switching CPU architecture).

I agree it should be possible to store and retrieve a floating point number such that after loading it again you have the exact same value. Languages like Python and Java do this when printing a floating point number, maybe other languages do it too (it makes a lot of sense to have such a property).

The floating point computations on the other hand are much less portable. I know that it happens between different processor architectures, and also between different versions of numeric libraries (not so much due to the processor, but instead because the used algorithms evolve). Others said it also happens with different instances of the same processor architecture, which was new to me.

For the OP, the latter issues should be non-existent, since he aims to reproduce for one machine using one version of one compiler only. Even if different code paths would produce different results, that is already an error since the aim is perfect reproducible behavior.

Others said it also happens with different instances of the same processor architecture, which was new to me.


I've heard these claims before about x86 processors and I don't believe them. I think it originates from a hardware bug in certain Intel processors made in the 1990s. The number of these processors still in use by people wishing to play modern PC games seems likely to be low.

http://en.wikipedia.org/wiki/Pentium_FDIV_bug

Others said it also happens with different instances of the same processor architecture, which was new to me.


I've heard these claims before about x86 processors and I don't believe them. I think it originates from a hardware bug in certain Intel processors made in the 1990s. The number of these processors still in use by people wishing to play modern PC games seems likely to be low.

http://en.wikipedia.org/wiki/Pentium_FDIV_bug

Nope.

The FDIV bug (which I also got to work through) was entirely different.

As others have pointed out, and as many of the links call out, in order to get reliable behavior there are some steps that need to happen.

One of those steps is to limit the CPU's floating point precision. The x86 (actually x87) family performs floating point math with 80 bit floats. However, the variables for float as they are stored in memory are 32 bits. Even when you change the floating point behavior the internal calculations and look up tables rely having more bits available. This leads to all kinds of tricky behavior EVEN ON A SINGLE MACHINE.

Much of that is intentional. There have been many tradeoffs between performance, accuracy, and reproducability of floating point math. They can be adjusted, but you need to know that they exist. If you want an exactly repeatable simulation you will need to set specific options that slow down performance.

You can take steps to make that behavior more deterministic. You can limit the CPU's precision to match your variable sizes. You can force a specific rounding mode, and force floating point optimizations to be turned off (more on that later). Then you can validate that they are still set that way every time after you call certain graphics library functions or physics library functions. Unfortunately for developers there are many libraries that will switch your floating point processor parameters behind your back, so if you go this route you will have a constant battle of setting __controlfp(...) before doing any critical math. These operations are slow, but necessary for deterministic behavior.

There are also some quirks is how loading a few constants like the value of pi (FLDPI) and a few other constants will loads a 66 bit value for PI into the 80 bit register then rounded based on the current FPU control settings. These trigger known issues that are well documented, but inexperienced developers usually are not aware of them. You can get deterministic results but it may mean not using the intrinsic operations, using math libraries instead.

Another one of those things are "transcendental functions". The chip manufacturers have always been VERY clear that transcendental functions -- including trig functions like sine and cosine that programmer rely on all the time -- generate different results on different chipsets, and developers are warned about this in hardware manuals. These functions also tend to rely on the value of pi, which we just covered has some quirks. The Intel software developer manual has a three page description labeled "Transcendental Instruction Accuracy", warning about how they have a specific range of accuracy, how the accuracy and precision can vary, that outside that range they are inaccurate, that values must be reduced in order to have "good accuracy", and even the warning that "A recommended alternative in order to avoid the accuracy issues that might be caused by FSIN, FCOS, FSINCOS, and FPTAN, is to use good quality mathematical library implementations of the sin, cos, sincos, and tan functions". Again, the solution is to use a slower software math library for the functions rather than the hardware implementation.

Another is changes between compilations. In one compilation your compiler may choose to store intermediate results in the floating point stack, in another compilation your compiler may choose to push them back out to a register. In yet another compilation your compiler may switch the order to operations in ways that are permitted and within floating point tolerance but produce subtly different results. As a quick example, x = a + b + c. This may be compiled as ((a+b) + c), or it may be (a+(b+c)), or it may be ((a+c)+b). However, if any values are not at the same floating point scale, say you are looking at values off by two or three orders of magnitude, maybe 0.12345 and 0.0012345 and 0.00012345, the final results of the operation will be different based on the order they are computed. This means both disabling certain optimizations and also heavy use of parenthesis or breaking compound operations into multiple lines to ensure the compiler maintains order of operations.

Another of those is SIMD optimizations. Some of the SIMD functions are well documented to produce different results than the FPU versions. There are some minor differences for operations like fused multiply-add, bigger differences for things like sqrt and reciprocal instructions. The SSE version and the FPU version of the functions are computed differently, and many have different results. Your compiler may be helpfully choosing a faster computation method, but that method may be giving subtly different results. So to get deterministic results you need to disable these optimizations.

Another is more traditional non-SIMD optimizations and different code paths. Sometimes the compiler may recognize that a pattern is happening and reorder the instructions. Just like the "change between compilation" issue above, these optimizations can mean that even though you have the same C++ code in multiple places that does not mean the same CPU instructions are generated. In one place the compiler may recognize a pattern and apply an optimization, in another place the compiler may not be able to validate some assumptions and not apply the optimization.

Another is when compilers generate two different code paths depending on the detected hardware. Newer processors have CPU instructions that are faster or better than older alternatives and optimizing compilers frequently generate code that provides both options. One of them, called fused multiply-accumulate, follows the pattern x = x + (y*z), or x = x - (y*z). The compiler may generate a fused multiply-add or fused multiply-subtract instruction if it is available, or may use the older process of just doing the work directly. To ensure deterministic behavior across machines you need to disable these optimizations and compile only to an older chipset.

And as discussed, multiprocessing environments can trigger other non-deterministic results as values are shifted around internally. This can be overcome by adding locks around certain computations, which come with a performance penalty.

It is possible to make the result deterministic between machines, that is true. However, that task is not easy. That task requires disabling many optimizations. That task requires adding certain slow operations to ensure floating point state has not changed. That task requires much validation and that takes time.

Can it be done? Yes. But the process is painful and expensive.

I was commenting only on claimed different behaviour of a single executable on different machines, so compilation is not relavant.

I use 32-bit float operations (SSE doesn't internally use 80-bit extended precision) with no denormals (FTZ bit set), code floating point atomically (e.g. no x = a + b + c) and use compiler flags to restrict float optimisations. And you can only use + - * / sqrt() operations - not things like sin().


I stand by my claim that these things give identical behaviour on any x86 chip (except the ones with the FDIV bug).

I was commenting only on claimed different behaviour of a single executable on different machine

I stand by my claim that these things give identical behaviour on any x86 chip (except the ones with the FDIV bug).



Not even Intel or AMD give those guarantees.

In fact, both vendors explicitly disavow those properties.


The same executable with the same input on the same machine can give different results. This is well documented behavior of floating point. The default configuration gives non-deterministic results from the viewpoint of the application. The chip leaks information from other things going on in the system so that by default, run-after-run tests of complex floating point simulations can give different results. If you need bit-for-bit reproducible results you need to do much work, disable optimizations and limit your compiler's generated output to a subset of the available math operations.


Since you didn't seem to read the links given earlier on the Gaffer on Games site, let's just pull out two Intel white papers on the subject.



The first white paper from Intel: If I rerun the identical program on the identical input data on an identical processor, will I get an identical result? The answer is not "yes" or "no", but "maybe" with a long list of potential gotchas.

In their conclusion, they write up: "The run-to-run variations in floating-point results discussed above are in general very tiny, typically in the last bit of the mantissa for individual operations, so of order 1 part in 107 for single precision and 1 part in 1016 for double precision. These reflect the finite precision of floating-point arithmetic; one result is typically not more “correct” than another. These variations may become significant if there are large cancellations (of which the user is sometimes unaware), or if the user depends on bit-for-bit agreement for reasons of quality control. In either case, the variations in results may be a warning that the true numerical uncertainty in the result may be larger, sometimes a lot larger, than the user realizes."

Another white paper, "Why doesn't my application always give the same answer?" goes through several examples:

* "it is sometimes useful to have a degree of reproducibility that goes beyond the inherent accuracy of a computation. Some software quality assurance tests may require close, or even bit-for-bit, agreement between results before and after software changes, even though the mathematical uncertainty in the result of the computation may be considerably larger. The right compiler options can deliver consistent, closely reproducible results while preserving good (though not optimal) performance."

* "Slightly different results were observed when the same application was run on different numbers of processors [...] This was because loop bounds, and hence data alignment, changed when the problem decomposition changed to match the different number of MPI processes. This in turn changed which loop iterations were in the vectorized loop kernel and which formed part of the loop prologue or epilogue. Different generated code in the prologue or epilogue compared to the vectorized kernel can give slightly different results for the same data." Some libraries outside your visibility do slightly different things on different architectures. This can be fixed by forcing parameters to be other than their default or using different libraries.

* "Slightly different results were observed when re-running the same (non-threaded) binary on the same data on the same processor. This was caused by variations in the starting address and alignment of the global stack, resulting from events external to the program. The resulting change in local stack alignment led to changes in which loop iterations were assigned to the loop prologue or epilogue, and which to the vectorized loop kernel. This in turn led to changes in the order of operations for vectorized reductions". This can be fixed by turning off some automatic vectorization optimizations.

* "The math libraries contain function implementations that are optimized differently for different processors. The code automatically detects what type of processor it is running on, and selects the implementation accordingly. For example, a function involving complex arithmetic might have implementations both with and without Intel SSE3 instructions. The implementation that used Intel SSE3 instructions would be invoked only on a processor that was known to support these." So even though the same executable is used, different processors use different paths through the executable, giving different results. This can be fixed by forcing the compiler to ignore alternate chipsets, only using a single base set of instructions and ignoring chipset-specific optimizations.


Again, that is Intel warning developers that running the same binary with identical input on an identical processor can give different results. Neither sets of results are wrong, but they are not bit-for-bit identical.


AMD has similar developer notes. Microsoft's compiler has similar notes in their optimizer settings.

I stand by my claim that these things give identical behaviour on any x86 chip (except the ones with the FDIV bug).


Hopefully those links will change your mind. Otherwise, scroll back up to the Gaffer On Games articles about Deterministic Lockstep, and the other on Floating Point Determinism. Both of those discuss in depth that identical runs of the same binary on identical hardware give different results unless you go to lengths to ensure they are the same.

Unless you jump through a lot of very careful hoops, turn off a lot of optimizations, and introduce some slower operations, you can get different results even on the same computer with the same executable and the same exact input. This has nothing to do with the old FDIV bug.

It has everything to do with chip manufacturers enabling a tradeoffs between reproducability, precision, and performance.

Hopefully those links will change your mind. Otherwise, scroll back up to the Gaffer On Games articles about Deterministic Lockstep, and the other on Floating Point Determinism.


Unless you jump through a lot of very careful hoops, turn off a lot of optimizations, and introduce some slower operations, you can get different results even on the same computer with the same executable and the same exact input.


No, all the cases you mention are caught by:

I use 32-bit float operations (SSE doesn't internally use 80-bit extended precision) with no denormals (FTZ bit set), code floating point atomically (e.g. no x = a + b + c) and use compiler flags to restrict float optimisations. And you can only use + - * / sqrt() operations - not things like sin().


Yes, this prevents the compiler using fancy instructions like fused-multiply-add and reciprocal sqrt. But I don't consider the performance penalty significant, given the benefits it gives for networking model it enables.

As evidence of my claim of IEEE 754 specifying the exact binary result of calculations, I cite 'What Every Computer Scientist Should Know About Floating-Point Arithmetic' by David Goldberg:

http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

Which says:

"The IEEE standard requires that the result of addition, subtraction, multiplication and division be exactly rounded. That is, the result must be computed exactly and then rounded to the nearest floating-point number (using round to even)."

"One reason for completely specifying the results of arithmetic operations is to improve the portability of software. When a program is moved between two machines and both support IEEE arithmetic, then if any intermediate result differs, it must be because of software bugs, not from differences in arithmetic."

"There is not complete agreement on what operations a floating-point standard should cover. In addition to the basic operations +, -, × and /, the IEEE standard also specifies that square root, remainder, and conversion between integer and floating-point be correctly rounded. It also requires that conversion between internal formats and decimal be correctly rounded (except for very large numbers)."

Hopefully those links will change your mind. Otherwise, scroll back up to the Gaffer On Games articles about Deterministic Lockstep, and the other on Floating Point Determinism.


Unless you jump through a lot of very careful hoops, turn off a lot of optimizations, and introduce some slower operations, you can get different results even on the same computer with the same executable and the same exact input.


No, all the cases you mention are caught by:

I use 32-bit float operations (SSE doesn't internally use 80-bit extended precision) with no denormals (FTZ bit set), code floating point atomically (e.g. no x = a + b + c) and use compiler flags to restrict float optimisations. And you can only use + - * / sqrt() operations - not things like sin().


Yes, this prevents the compiler using fancy instructions like fused-multiply-add and reciprocal sqrt. But I don't consider the performance penalty significant, given the benefits it gives for networking model it enables.

As evidence of my claim of IEEE 754 specifying the exact binary result of calculations, I cite 'What Every Computer Scientist Should Know About Floating-Point Arithmetic' by David Goldberg:

http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

Which says:

"The IEEE standard requires that the result of addition, subtraction, multiplication and division be exactly rounded. That is, the result must be computed exactly and then rounded to the nearest floating-point number (using round to even)."

"One reason for completely specifying the results of arithmetic operations is to improve the portability of software. When a program is moved between two machines and both support IEEE arithmetic, then if any intermediate result differs, it must be because of software bugs, not from differences in arithmetic."

"There is not complete agreement on what operations a floating-point standard should cover. In addition to the basic operations +, -, × and /, the IEEE standard also specifies that square root, remainder, and conversion between integer and floating-point be correctly rounded. It also requires that conversion between internal formats and decimal be correctly rounded (except for very large numbers)."

Although I mostly agree (and am very annoyed by the apparent majority of programmers who seem to think floating point is some kind of unspecified magic), there is one case that is not covered by your original quote: parallelization. If you parallelize any kind of reduction (e.g. a sum) this can introduce non-determinism, unless you ensure that the order of operations is always the same. Integers with wraparound do not have this problem, however saturating integer operations have the same problem.

l0calh05t: Quite right - I had assumed single-threaded code.

That's the biggest performance trade-off, in my opinion. You could do multi-threaded computation, but like you say, you'd have to sort the results when combining.


frob: Now I realise I had made an assumption which I may not have made clear enough - I was referring to 'floating point' as just the bits covered by the IEEE 754 standard. I accept that everything outside it (like sin, inv sqrt etc) may not always give the same result across all processors.

This topic is closed to new replies.

Advertisement