Others said it also happens with different instances of the same processor architecture, which was new to me.
I've heard these claims before about x86 processors and I don't believe them. I think it originates from a hardware bug in certain Intel processors made in the 1990s. The number of these processors still in use by people wishing to play modern PC games seems likely to be low.
http://en.wikipedia.org/wiki/Pentium_FDIV_bug
Nope.
The FDIV bug (which I also got to work through) was entirely different.
As others have pointed out, and as many of the links call out, in order to get reliable behavior there are some steps that need to happen.
One of those steps is to limit the CPU's floating point precision. The x86 (actually x87) family performs floating point math with 80 bit floats. However, the variables for float as they are stored in memory are 32 bits. Even when you change the floating point behavior the internal calculations and look up tables rely having more bits available. This leads to all kinds of tricky behavior EVEN ON A SINGLE MACHINE.
Much of that is intentional. There have been many tradeoffs between performance, accuracy, and reproducability of floating point math. They can be adjusted, but you need to know that they exist. If you want an exactly repeatable simulation you will need to set specific options that slow down performance.
You can take steps to make that behavior more deterministic. You can limit the CPU's precision to match your variable sizes. You can force a specific rounding mode, and force floating point optimizations to be turned off (more on that later). Then you can validate that they are still set that way every time after you call certain graphics library functions or physics library functions. Unfortunately for developers there are many libraries that will switch your floating point processor parameters behind your back, so if you go this route you will have a constant battle of setting __controlfp(...) before doing any critical math. These operations are slow, but necessary for deterministic behavior.
There are also some quirks is how loading a few constants like the value of pi (FLDPI) and a few other constants will loads a 66 bit value for PI into the 80 bit register then rounded based on the current FPU control settings. These trigger known issues that are well documented, but inexperienced developers usually are not aware of them. You can get deterministic results but it may mean not using the intrinsic operations, using math libraries instead.
Another one of those things are "transcendental functions". The chip manufacturers have always been VERY clear that transcendental functions -- including trig functions like sine and cosine that programmer rely on all the time -- generate different results on different chipsets, and developers are warned about this in hardware manuals. These functions also tend to rely on the value of pi, which we just covered has some quirks. The Intel software developer manual has a three page description labeled "Transcendental Instruction Accuracy", warning about how they have a specific range of accuracy, how the accuracy and precision can vary, that outside that range they are inaccurate, that values must be reduced in order to have "good accuracy", and even the warning that "A recommended alternative in order to avoid the accuracy issues that might be caused by FSIN, FCOS, FSINCOS, and FPTAN, is to use good quality mathematical library implementations of the sin, cos, sincos, and tan functions". Again, the solution is to use a slower software math library for the functions rather than the hardware implementation.
Another is changes between compilations. In one compilation your compiler may choose to store intermediate results in the floating point stack, in another compilation your compiler may choose to push them back out to a register. In yet another compilation your compiler may switch the order to operations in ways that are permitted and within floating point tolerance but produce subtly different results. As a quick example, x = a + b + c. This may be compiled as ((a+b) + c), or it may be (a+(b+c)), or it may be ((a+c)+b). However, if any values are not at the same floating point scale, say you are looking at values off by two or three orders of magnitude, maybe 0.12345 and 0.0012345 and 0.00012345, the final results of the operation will be different based on the order they are computed. This means both disabling certain optimizations and also heavy use of parenthesis or breaking compound operations into multiple lines to ensure the compiler maintains order of operations.
Another of those is SIMD optimizations. Some of the SIMD functions are well documented to produce different results than the FPU versions. There are some minor differences for operations like fused multiply-add, bigger differences for things like sqrt and reciprocal instructions. The SSE version and the FPU version of the functions are computed differently, and many have different results. Your compiler may be helpfully choosing a faster computation method, but that method may be giving subtly different results. So to get deterministic results you need to disable these optimizations.
Another is more traditional non-SIMD optimizations and different code paths. Sometimes the compiler may recognize that a pattern is happening and reorder the instructions. Just like the "change between compilation" issue above, these optimizations can mean that even though you have the same C++ code in multiple places that does not mean the same CPU instructions are generated. In one place the compiler may recognize a pattern and apply an optimization, in another place the compiler may not be able to validate some assumptions and not apply the optimization.
Another is when compilers generate two different code paths depending on the detected hardware. Newer processors have CPU instructions that are faster or better than older alternatives and optimizing compilers frequently generate code that provides both options. One of them, called fused multiply-accumulate, follows the pattern x = x + (y*z), or x = x - (y*z). The compiler may generate a fused multiply-add or fused multiply-subtract instruction if it is available, or may use the older process of just doing the work directly. To ensure deterministic behavior across machines you need to disable these optimizations and compile only to an older chipset.
And as discussed, multiprocessing environments can trigger other non-deterministic results as values are shifted around internally. This can be overcome by adding locks around certain computations, which come with a performance penalty.
It is possible to make the result deterministic between machines, that is true. However, that task is not easy. That task requires disabling many optimizations. That task requires adding certain slow operations to ensure floating point state has not changed. That task requires much validation and that takes time.
Can it be done? Yes. But the process is painful and expensive.