Details of dynamic load balancing

Started by
73 comments, last by hplus0603 16 years, 6 months ago
I was looking at the Asheron's Call server technology, where the load is apparently dynamically distributed over several servers. I'm sure it's been used again more recently, but I'm out of touch with more modern MMOs. Anyway, I'm interested in how all this works, but can't find much information on it. So any answers or even speculation is welcome. I'm assuming there's something like a thin proxy layer between the client and the server pool, and that it can be told to switch which server each client is currently handled by. I'm unsure as to how the handover from one server to another would work though. One would assume that each character is managed by just one server at any given time - all processing regarding that character happens in that process. But however you divide the load up, I would assume it's possible for a character on one server to be able to see, and therefore interact with, a character on another server. In such a case, which server gets to make the call on what happens, and perform the relevant data writes etc.? And whatever arbitration strategy exists, how does it cope with race conditions where 2 servers have a different opinion of the interactions between 2 characters?
Advertisement
Quote:I'm assuming there's something like a thin proxy layer between the client and the server pool, and that it can be told to switch which server each client is currently handled by.


Connection proxy, front-end process, message router.

Quote:I'm unsure as to how the handover from one server to another would work though. One would assume that each character is managed by just one server at any given time - all processing regarding that character happens in that process.


Serialize it to another machine, let proxy know that the location has changed.

Quote:But however you divide the load up, I would assume it's possible for a character on one server to be able to see, and therefore interact with, a character on another server. In such a case, which server gets to make the call on what happens, and perform the relevant data writes etc.?


Up to you. There's always only one server that may change the state. All others will simply listen to these changes. Race conditions, reliability, ACID all come into play for transactions (trades, item transfers, ...)

Quote:And whatever arbitration strategy exists, how does it cope with race conditions where 2 servers have a different opinion of the interactions between 2 characters?


If you have that, you're in trouble. Interactions between servers must be reliable in the first place, so that all known race conditions are foreseen, and resolved in deterministic manner.

Situations where it doesn't matter may simply ignore it (inconsistent location could be one).

Another way is full-state synchronization, where all servers sync entire state, or wait until such synchronization is possible, before proceeding further.

Yet another way could be similar to enterprise solutions, where everything is passed through reliable global message queue, which enforces proper sequencing.
Quote:Original post by Antheus
Quote:But however you divide the load up, I would assume it's possible for a character on one server to be able to see, and therefore interact with, a character on another server. In such a case, which server gets to make the call on what happens, and perform the relevant data writes etc.?


Up to you. There's always only one server that may change the state. All others will simply listen to these changes. Race conditions, reliability, ACID all come into play for transactions (trades, item transfers, ...)


Yeah, I know it's 'up to me', the question is, how? :) If 2 people on separate servers approach each other and finally fight/talk/whatever, what strategy would be used for handling that? The immediate approach would be to have the mediator process pick the least loaded of the 2 servers in question and push the other character onto it. But is that always practical? Could there be 3rd (and 4th, 5th...) characters involved, perhaps at longer distances, which could break this strategy by ending up with all the characters on one server? Or is it just a case of ensuring that binary transactions can only happen at such a short range that this will never be a problem?

Quote:If you have that, you're in trouble. Interactions between servers must be reliable in the first place, so that all known race conditions are foreseen, and resolved in deterministic manner.


Again, I'm aware it's a problem, but am interested in how to approach it. eg. If 2 players on different servers cast an instant death spell on each other at approximately the same time, how do we approach that? At some point the client action can be flagged as affecting another player, which would trigger an atomic migration of the other player, which in turn means that there ends up being just 1 server that can handle it completely. I just wonder how this is done.

I suppose that any such event will have 2 characters, and events that reference characters on other servers, and pulling external players across from another server is an atomic operation enforced by the mediator process. The event on the 2nd server would then fail, and I suppose there has to be a clean way of dealing with that.
Quote:Again, I'm aware it's a problem, but am interested in how to approach it. eg. If 2 players on different servers cast an instant death spell on each other at approximately the same time, how do we approach that? At some point the client action can be flagged as affecting another player, which would trigger an atomic migration of the other player, which in turn means that there ends up being just 1 server that can handle it completely. I just wonder how this is done.


It's still up to you.

You need to establish some way to order the events. Which happened first? Or did they both happen at exactly the same time.

The communication between servers then needs to be transacted.

P1 (Server1, Time:116): cast Insta-death
send message1 from P1 to P2
P2 (Server2, Time:117): cast Insta-death
send message2 from P2 to P1
P2 receives message1.
P2 is already in middle of transaction
P1 receives message2.
P1 is already in middle of transaction


Now the problem occurs, since you have a dead-lock. You need a way to determine how to resolve this. One way is to abort one of the transactions. Using time-stamps, P1 shot first.

So it resolves like this:

P2 receives message1
message1 was started at 116, P2's transaction at 117.
Abort P2's transaction.
Send abort message to P2's transaction.
Roll back P2 to state before transaction was started.
Perform message1.
P2 dies.
Send transaction complete to P1.

That's one way, but it doesn't support events that occur in the same time.

It's not feasible to migrate clients across servers for each operation. What about large pvp, where there's AoE? You'd need to have hundreds of thousands of migrations for each possible interaction.

Another option is bucked synchronization. There, you process all events that a client received in time t..t+dt. In our case, it would be 110 - 120. Player one would perform insta-death and a few more actions. Player two would do the same, killing the other player in the process.

The state here wouldn't be perfectly accurate, but would be accurate enough. This also preserves the state, as long as actions on individual objects are performed atomically. Although by the time P1 receives message that it will be insta-killed by P2 (P2 has been already insta-killed by P1), P1 would still need to die.

This is the reason for warm-ups and cooldowns, and projectiles, and overall slow action. It blurs such events and makes them believable.


Transactions are tricky for other reasons. This is a trivial transaction. But consider 4 participants, all on their own servers. P1 affects P2,P3, which sends message to P1-P4. At same time, P2 affects P4, which requires feedback from P1. P3 and P4 both affect P1. Lots of possible dead-locks, roll-backs, conflicts, ...

This is why many actions are often not transacted, and inaccurate state is used instead where it doesn't matter (combat, movement), and transactions are used only for important infrequent slow events (trades, quests, spawns).

Trades are even worse. Let's say someone opens trade window, then goes AFK. Transaction started, what now. And player sending message will be aborted. Putting such character in the middle of combat zone will cause all AoEs to be aborted.
Quote:Original post by Antheus
It's not feasible to migrate clients across servers for each operation. What about large pvp, where there's AoE? You'd need to have hundreds of thousands of migrations for each possible interaction.


Surely it wouldn't need migration for each operation, just the first few, until all characters that are interacting are on the same server. Once you get to that point, that server owns all the actors it needs to perform future operations on them.

I would hope that area of effect attacks, and indeed many other interactions, could be handled by asynchronous messages, which can be broadcast by the client's current server and the results of which filter back later if needed.

I just wonder about how unwieldy it would get if you need to query data about a character that is on a different server, without resorting to explicitly asynchronous requests for data.

Quote:This is why many actions are often not transacted, and inaccurate state is used instead where it doesn't matter (combat, movement), and transactions are used only for important infrequent slow events (trades, quests, spawns).


Yeah. I can see the benefits of doing everything via asynchronous events wherever possible and not worrying about slightly out-of-date information, after all, even the Instant Death Spell example isn't a big deal if both casters end up dying.
Quote:Once you get to that point, that server owns all the actors it needs to perform future operations on them.


You cannot do that universally. If you can put all objects into a single server, then you don't need load balancing.

What happens once the server is full? There goes your benefit...

The concept of self-organizing maps has been explored, mostly with distributed hash tables. Even there, pre-calculated distributions are used.

Then there's another problem. Server that owns objects needs to maintain world data as well. How long does that take so set up, and how frequently can you switch it?

How do you limit the largest "size", whatever that means, that a single server can span. What happens, when that size isn't sufficient.

Your players form a chain spanning your entire world (1000 kilometers). Then they attack each other in sequence. All of them are migrated to single server, and that server now needs to load 1000km of terrain.

While contrived example, how can you prevent such event from occuring? Or do you rely on pure luck that some such chain won't occur.
Well, it's not much different from the case where loads of people were in one Everquest zone or something. It's certainly suboptimal, but it's not usually fatal in itself.

And it still leaves me wondering; if 2 characters are managed by different servers, how does server 1 query a character on server 2 for data? I suppose each server could have a read-only character cache, pulled from the persistence layer as needed, periodically invalidated somehow.
Quote:Original post by Kylotan
Well, it's not much different from the case where loads of people were in one Everquest zone or something. It's certainly suboptimal, but it's not usually fatal in itself.


Everquest had a lot of problems with zoning initially. Could have something to do with protocol. It was recently mentioned for SWG that player warping was not transacted, which results in frequent problems entering the dungeons when players cross different server boundaries.

Quote:And it still leaves me wondering; if 2 characters are managed by different servers, how does server 1 query a character on server 2 for data? I suppose each server could have a read-only character cache, pulled from the persistence layer as needed, periodically invalidated somehow.


It could. But what happens if a character dies, but you haven't been notified yet and use old values.

How often do you synchronize? On every change? By batch? You can also always delay by one game-step and take that into consideration in logic.

Or... You can go the Erlang way, and share nothing, just pass everything by value in messages. Which is not nearly as bad as it sounds. So rather than combat manager querying participants, Player 1 sends a message (P1, with a gun, 500 damage, fire) to Player 2. P2 then deals with this information, and applies the damage to itself.
Asheron's Call wasn't dynamic; it split the world across different servers with a fixed grid cell size AFAICR. When you walk close to a border, a ghost of you gets created on the other cell (just like a ghost of you is created on your client). When it's time to hand off, your client connects to the new server, is connected to two servers for a brief moment, and when the new server is ready, it requests authority from the old server; the server side objects switch between authoritive and ghost, and the hand-off is done.

Note that there is a small period of time where the transaction is currently happening, where the authority is nowhere -- the previous server has handed it off, the packages are in flight, and the new server hasn't yet accepted it. That's OK. If the servers crash, you'll go back to the last checkpointed state, just like when there's a single server authority.

Finally, if you don't want to have the client connected to two servers, you can switch the connection, and forward client inputs from ghost to master until the connection is changed. You could then drive the client re-connection based on where the data comes in -- don't tell the client to switch, until you receive input for a ghost. At that point, tell the client who the master is.
enum Bool { True, False, FileNotFound };
Quote:Original post by Antheus
Quote:Original post by Kylotan
I suppose each server could have a read-only character cache, pulled from the persistence layer as needed, periodically invalidated somehow.


It could. But what happens if a character dies, but you haven't been notified yet and use old values.

How often do you synchronize? On every change? By batch? You can also always delay by one game-step and take that into consideration in logic.


I don't know how often I'd synchronise, but I expect it it doesn't take long to broadcast a message to invalidate the data for one character. Then it'll just get synchronised next time it's requested. (Or earlier, if there was some sort of intelligent pre-caching.)

Quote:Or... You can go the Erlang way, and share nothing, just pass everything by value in messages. Which is not nearly as bad as it sounds.


It solves a large class of problems, but still leaves you with others. Possibly foremost is performance, since sending a network message to another server to retrieve data is going to be very slow compared to caching it. Almost as big a problem is the programming paradigm - nobody is going to be coding in Erlang, and wedging such an asynchronous system into C++ is not going to be pretty or intuitive to work with. instead of just being able to display "character[id]->hit_points", you now have to presumably form a message to request the character in question and pass in a callback to be executed when a reply arrives. No doubt it would fit well with Stackless Python's coroutines, and may be exactly how EVE Online does things. But it doesn't seem very practical for typical use.

Quote:Original post by hplus0603
Asheron's Call wasn't dynamic; it split the world across different servers with a fixed grid cell size AFAICR.


Several articles, including this one on Gamasutra by their lead designer, claim that the load balancing was indeed dynamic and that it wasn't geographically based. It's quite low on detail beyond that, however.

Quote:When you walk close to a border, a ghost of you gets created on the other cell (just like a ghost of you is created on your client).


I'm having trouble envisaging what happens if someone attacks your 'ghost' or otherwise attempts to interact with it. In fact, I think that's the crux of the entire problem for me.

Quote:When it's time to hand off, your client connects to the new server, is connected to two servers for a brief moment, and when the new server is ready, it requests authority from the old server; the server side objects switch between authoritive and ghost, and the hand-off is done.


Strangely I'd thought that you'd need a proxy to stay connected while changing servers, but had completely overlooked the obvious possibility of having 2 connections open at once. And this allows for bandwidth load balancing as well as CPU and memory load balancing, which is good.

This topic is closed to new replies.

Advertisement