Lockstep RTS: 90% done, 90% to go

Started by
10 comments, last by doctorsixstring 16 years, 11 months ago
The good news is that I have multiple instances of my game successfully running in lockstep across the loopback interface. Each instance runs in a small 400x300 window that allows me to issue commands from any player I want. I can select a few of one player's ships and give them a move order, and a very short time later the ships are moving on all the other player's screens. The bad news is that each game stalls fairly frequently. I'm still analyzing the data, but I'm having trouble coming up with any decent theories. The game instances just can't seem to communicate fast enough to keep ahead of the simulation updates. I'd like to hear any general advice or thoughts on what my problem might be. I doubt what I've said so far will be very useful, so I'll attempt to post a concise summary of my logic below. First, some background: My game is being developed in Python with PyGame and PyOpenGL. I'm using Python's socket module for my low-level communication. This will eventually be replaced with something like enet or RakNet, but for now the sockets allow me to easily send messages between players via TCP/IP. I'm working on a 1.6 ghz laptop. Running two instances of my app yield frame rates of roughly 80 FPS. Ping is usually around 10 ms, but I've had values as high as 20 ms across the loopback. The game will attempt to update the world every 50ms (i.e. 20 "ticks" per second). Player commands are scheduled two ticks in the future. Without further ado, here is my main loop with special emphasis on the lockstep networking code:

-------------------------------------

receive messages from other players over network

process local input and create game commands

sysTime = store local system time

if local input commands have been created:
	schedule commands in the local world simulation for world tick + 2
	send commands to other players

if player is host...
	lastAllReadyTick = last tick for which a "ready" message was received from all players
	if an "all ready" message wasn't sent for lastAllReadyTick...
		tickLength = calculate best tick length that can be acheived by each player (looks at pings and frame rates)
		send "all ready" message to all players.  The message includes the calculated tick length

if sysTime > nextTickTime:

	nextTickTime = sysTime + tickLength
	stalled = lastAllReadyTick < world tick

	if not stalled...
		if local player is a client and hasn't sent a "ready" message for world tick + 2...
			bestTickLength = highest average ping value to any other player OR local framelength (whichever is greater)
			send "ready" message to host

		update world simulation
		world tick += 1

render graphics
-------------------------------------
In case it isn't obvious from my pseudo-code, here are a few of my high-level thoughts: 1) Each player updates the world simulation at fixed intervals. 2) At the start of each tick, each client will send a "ready for tick X" message to the host. 3) When the host receives "ready for tick X" from all players, an "all ready for tick X" message will be sent to all other players. 4) The "lastAllReadyTick" variable stores the last tick for which an "all ready" message was sent or received. The game is stalled if the world simulation's tick exceeds this value. 5) The tickLength variable may change periodically, due to calculations by the host player's game. Changes to tick length will be when the next tick is scheduled. For example, the host player may schedule tick 7 immediately after setting tickLength to 55. Client #1 may not receive the updated tick length until tick 8, due to network latency. In that example, the host and client's tick lengths for tick 7 will be different. 6) The "world tick" variable is technically the simulation tick that will be updated next. Therefore, the game instance is considered "stalled" if lastAllReadyTick < world tick. If lastAllReadyTick == world tick, then the simulation still has one more tick to process. I apologize for the length of this post. Hopefully someone takes the time to read it and give me that small nugget of advice that will push me forward. I can even hope that this thread would eventually become a decent resource for future lockstep RTS developers. Thanks in advance, Mike
Advertisement
So, are you basically counting the game as 'stalled' if you don't get a response within the 50ms? My concentration span's too short to read all your code and explanations to try and work out what you mean, I'm afraid.

How much data do you send each time? And have you tried "mySocket.setsockopt(IPPROTO_TCP, TCP_NODELAY, 1)"?
The game is "stalled" if a response isn't received within 100ms, since I'm delaying the commands by two ticks.

I'll give TCP_NODELAY a shot.
I can't tell if it made a difference or not. I'm still getting stall messages. I'll analyze the log messages some more tonight after work and see if I can figure anything out.
If you're only sending small amounts of data relatively infrequently and using TCP, even with Nagle coalesing disabled you could be suffering from the way most stacks now utilise delayed ACKs. As insane as it sounds, see if the stalls happen less if you increase the amount of data you're sending.
If you are running multiple games on a single CPU, scheduling (as in CPU/threads/processes) may come into effect. The Windows pre-emption quantum can be as big as 150 milliseconds, depending on system settings.

Try adding a Sleep(10) after processing each game tick, in each client. This will reduce frame rate, but will probably improve responsiveness.

Last, for play over the internet, a two-step delay between command and action is not enough for practical play. I would suggest counting on at least a 250 ms latency for robustness.
enum Bool { True, False, FileNotFound };
Thanks for the ideas, guys. I'll try them all tonight.

I suppose I should have mentioned that my laptop is a dual-core Turion 64. How would that affect scheduling CPU time slices?
It'd probably help, unless they both end up getting scheduled onto the same core. Do you have any problems if just two copies are playing? If you hook it up to a second computer, do you have any problem with two instances on the one and a third on the other?
You're running a PyGame app; are you emptying the event queue every single frame (as you should be)?
Quote:Original post by Kylotan
You're running a PyGame app; are you emptying the event queue every single frame (as you should be)?


I believe so. At the start of the main loop, I have a large "for event in pygame.event.get():" loop. That should clear the event list, right?

I'm starting to think I have a fundamental flaw in my logic. The host instance on my laptop was stalling, even when the client instance was running on my desktop machine. Tick length was set to 100ms, and command delay was set to 5. Stalls seemed to occur less frequently, but they still happened fairly regularly.

Another idea I just had was to schedule ticks based on the scheduled tick time + tick length. Right now, I'm scheduling the next tick based on the current system time. If tick X was scheduled to occur at time 1000, it might not actually be processed until 1020, depending on how quickly the main loop is running. Tick X + 1 would then be scheduled for 1120, which significantly disrupts the "one tick every 100ms" rule. If I scheduled the next tick time for 1000 + 100, I would get a more consistent tick rate on all machines, which might help eliminate the stalls.

I didn't try padding my messages with extra data, so I'll try that tonight, too.

- Mike

This topic is closed to new replies.

Advertisement