Sign in to follow this  
Longstreet

Attempt at load balancing

Recommended Posts

I've been working for a few weeks (very slowly) on writing a set of classes on top of Open TNL (open torque network library) to handling balancing game instances on several servers. I had tried to do this last year with raknet, but I scrapped that when it became incomprehensible. I don't know how this type of thing is supposed to be done, so it's basically the most naive implementation possible. Each server is broken into a number of “pockets” (my terminology). A game world can be added to each pocket. The “world” can actually be anything; chat, voip, quake3, whatever... The idea being the worlds can trigger object/clients movement between pockets, such as walking over a server line, or entering a new chartroom, etc... The pocket manager for the server handles the thread that signals the worlds to update as frames if it needs to be. The master server keeps track of how much time each pocket takes to update one frame. It then uses an extremely simple (naive) greedy algorithm to move the pockets around so each server takes about the same amount of time to make one complete update. That's about as far as I've gotten. And of course it's very abstract. For instance, anyone can connect to the master server. The client must request to be put in an entry pocket of some type. That's all the master server does. The world implementation inside the pockets have to handle further authorization such as login, character selection, or whatever it actually does. I may just be wasting my time, but so far I haven't seen anything specific on the subject, and the subject interests me. I've worried about taking network load into account, but I'm not sure how to quantify it or weight it in the algorithm. What do other people think? Thanks,

Share this post


Link to post
Share on other sites
Live load balancing transfers are extremely costly in terms of time. It's really not recommended to move chunks of world around whilst live, as it's extremely difficult to accurately guage when to do so, except in the most blatant situations. Occasionally, load balancing transfers can cause more of a problem than not:

Let's consider a bog-standard MMO, broken into zones of varying size (large terrain zones and small 'building' zones representing the inside of one building). Authority for zones can be moved from process to process.

A transfer requires:
1a) Pack up of states within a given zone (freeze time), work on process.
1b) Simultaneously load up the zone on the destination process.
1c) Inform simulation layer (if multi-zone simulation is used) that events for the zone should be queued.

2a) Transfer of states to new process.
2b) Inform clients of new process IP / alter routing information for client if SW gateway / server multiplexing is used.
2c) Inform simulation layer (if multi-zone simulation is used) of new zone location.

3a) Unfreeze zone on destination process
3b) Resolve queued events for the zone on the destination process
3c) Unload and free resources associated with the zone on the original process.

As you can see, it's quite a lot of work to transfer the process, and causes lag not just in the zone being transferred, but also (if inter-zone interaction is allowed) on zones trying to interact with it.

Now, transferring live zones (full of clients) around is a nasty thing to do - each client experiences the additional lag of the transfer on top of whatever lag they already had (in essence much heavier lag for a bit, then far less).
A better idea is to maintain stats over a longer period of time to determine which processes should be in charge of which zones, and update them at cluster restart, or when the cluster is 'quiet' (few clients connected).

I maintain persistant network processing stats for each and every 'zone' in my hierarchy (POR's in my posts). These are used to determine placement on cluster startup (with spurious statistics smoothed out). I've also implemented a 'lag warning' mechanism where should at any point the lag become more than a cutoff period, an administrator can relocate the zone manually after examining the reason for the network processing delay (command queue, volume of data, hardware problem).

In short - don't try to load balance every cycle - you make things worse, and only transfer when you really have to.

Share this post


Link to post
Share on other sites
Sure, if the load distribution never varies much then it's not worth it. As you say, if load stays fairly constant, then the server will only rebalance itself one or a few times a day. It's there for the unexpected.

I haven't implemented the rebalance trigger yet, but I think it would need to be a time average standard deviation, triggering above some threshold. And the balance algorithm will prefer to move smaller zones first. Sure, for only 5% increase in performance it may not be worth it. But I would prefer a few seconds of zone lock if it keeps the server responsive the rest of the time; and not all types of application/games can depend on static load-heavy areas. Though this is a learning experience of course, I don't expect to make anything magical.

I'm also wanting to experiment with dynamically splitting zones. But I haven't thought of a way to make it generalized.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this