Going back to the OP, I'd recommend doing the following to get the replay system robust:
* Do a binary save after every single step of the simulation.
* After each step, load the previous step's save, and apply the input again.
* Compare the CRC of the two states
* If there's a difference, then compare a binary save of the original step against a binary save from the repeated step
* Find out which byte of the binary saves is the first one that's different.
* Load one of the binary saves, but this time assert when you get to the different byte
* You should have a callstack that indicates which member of which entity has diverged which should give you a pretty strong clue about where you went wrong.
Do this all the time in your debug builds so you catch any new bugs immediately, and not just when you're explicitly trying to test your replay system.
I recently wrote an article which covers some relevant stuff (in the context of lockstep multiplayer): http://www.tundragames.com/minimizing-the-pain-of-lockstep-multiplayer/