In my game, I have implemented a replay system for debugging purposes. When recording it uses the command pattern (all player actions are stored as a command data structure in a list) and seed numbers for the RNGs to store enough data to replay a game session. I also store a small amount of data to compare with and verify during replay.
The problem is that the verification during replaying most often shows a diversion from the recorded game. This makes the system unusable for most purposes.
I have already sunk a lot of time into the system but still hope that it can be made more reliable. I just need a way to find out why and how the replay session diverges. One idea is to use my save game feature to store a CRC-number based on the entire simulation state instead of the few variables that I am storing now. This would give reliable information on when the diversion happens. But I still need to find the place in the code, or the specific simulation data that causes the diversion. Is there any other way than extensive logging to do this? And how would this work without bogging down the whole program.
Replay & recorded games
On the other hand, there are of cours plenty of opportunities of doing a replay not just slightly different, but alltogether wrong. Your usage of plural in "seeds for the RNGs" suggests that you have several RNGs running concurrently. Accessing them concurrently from different threads too, maybe? Bang. Scheduler assigns one time slice differently, and everything explodes. More information would obviously be needed to give a more fact-based answer (an uninitialized non-pointer variable somewhere is equally possible). But that's one rather obvious thing that can go wrong.
The game is single threaded and I only have one RNG for the simulation, so the description was a bit inaccurate.
I think it is impossible to do a perfect, non-diverging replayYou just declared OpenTTD cannot exist.
Network play in OpenTTD is being done by having each machine in the game update its local state, and only send the changes added by the users between the clients.
Obviously that system will break down when some clients compute different results. This is called a desync, and indeed, it's a nightmare to trace the cause.
There is a whole set of code conventions in place to avoid desyncs (I don't know them all, but eg no use of real numbers, since different cpu architectures have different ideas about rounding). In addition, there is code in place to dump all commands and their results as they are executed, which creates GBs of data if you let it run for a few hours at the server.
Here's some useful links:
http://gafferongames.com/game-physics/fix-your-timestep/
http://gafferongames.com/networked-physics/deterministic-lockstep/
OK, so according to what you describe, you have a (non-authorative?) server that merely records the changes made by (trusted?) clients.OpenTTD
That is different insofar as of course it is exactly known what happens at every tick, and sure enough you can play this back endlessly, and exactly binary-identically. Once the client has decided that this and that has happened (and shows it to the user), that is exactly what is being logged at the server, too.
What remains as an interesting problem is what happens if two clients decide to build a house in the same location at the same time. If clients are to decide on outcomes, this can be... mucho fun. Not even thinking about a cheating client...
But the OP stores player events (keypresses, mouse clicks, joystick... what else?) while a simulation goes on at some rate. Ideally a fixed timestep, but nothing has been said, it could in the worst case even be a FPS-dependent rate! Even with a fixed timestep simulation, however, it's very hard, if possible at all, to get a 100% identical playback, though.
The outcome is not fixed, only the inputs are. Ideally, of course, if the inputs are the same, the output should be binary-identical (computers are deterministic machines after all!). But a computer is not a completely deterministic machine, even in absence of floating point math. Not if scheduling, preemption, and timers are involved and are part of the equation.
https://platformrpg.wordpress.com/
If you haven't been designing for determinism from the start though you may have a hard time getting it to work after the fact. As PeterStock suggested, fix your time step.
Be sure you aren't losing numerical precision when saving floating points values. Don't store the number in a text format like JSON, it should be saved as a binary format.
If you are recording and replaying across different processors there may be floating point rounding differences and there is nothing you can do about it except switching to fixed point simulations which I don't recommend.
If you are recording and replaying across different processors there may be floating point rounding differences and there is nothing you can do about it except switching to fixed point simulations which I don't recommend.
Just having reliable replays on the same machine would be a big help in bug fixing.
I am using variable/non-fixed time steps. When recording, I store the elapsed time each frame in a text file in the form of a string like "0.0045674" the number of ticks which is an integer. During replay, I pass these values to the simulation. Not sure if this can cause problems?
It's not actually necessary to use a fixed time step to get repeatable/deterministic behaviour for replays, but it is (much) easier for certain applications (like keeping 2 games in sync over a network).
For variable time steps, you need to make sure you use the same sized time steps during replays, which it sounds like you already are.
Tracking down determinism bugs will be hard, but if you do get to the bottom of them (and there likely will be many - not just one!) then the knowledge you gain from it will likely be useful in the future :-)