Distributed server architecture for load balancing

_winterdyne_ · 2005-10-27T12:47:30

Okay, seeing as my previous architecture thread didn't seem to draw a lot of flak, I guess most of it was at least semi-rational. I hope. However, I'm trying to puzzle a way around congestion based lag, and I'd like to ramble about it where it can get critique. Pick it apart as best you can please. Apologies if this seems long and drawn out, it is. But I hope you'll find it a relatively interesting read. I mentioned a relevance layer previously, and since what I'm talking about relies on the concept, I'll describe it here, along with a brief description of the architecture. Architecture: A game instance is run on one or more machines (boxes) and consists of a three kinds of processes - a game database server (pretty standard), a 'master' (world) server and a number of generic 'slave' (zone) servers, which operate in a heirarchy. There need only be one server process running on any given box, but there can only be one 'master' within a cluster (sometimes referred to as a microcluster). Edit: Connected clients are handled through a connection object which migrates between server processes as clients move around. The overall structure of the system looks sort of like this: Relevance Graph: The game universe in the system is not broken up into rectangular 2d zones, but instead is organised more like a heirarchical (sp?) scenegraph, with network-aware areas of variable size. This is referred to as the 'relevance graph'. Nodes in this graph can be thought of as 'points of relevance', and coincide with physical features, such as rooms, areas of terrain, etc. Actors and other objects in the game universe a 'relevance limpets' and are *always* attached to a point of relevance. The relevance graph is used to determine event relevance within the game system, both at the network and simulation layers. Anything that happens in a point of relevance is 'most likely' going to be important to everything contained in it. Neighbouring points may only receive events of a certain type (a soundproof, sealed glass box might only receive visually-oriented messages). The graph allows a designer to set what spreads how at any point in the world. Events also have a given radius of effect which is checked with the relevance graph as well. Neighbourhood is not used here to imply physcial adjacency, just that events of certain types occuring within (or passing through) one point of relevance may affect another. Absence of neighbourhood implies that points should have NO bearing on each other. Shortcuts are used around this system in some places, where a specific entity is the target of an event. The master server maintains a list of where all entities are, so a message can be efficiently delivered without propogating along the heirarchy as normal. A server process is given a node of the relevance graph to deal with, and it deals with that and all points leafward (generally 'contained by') of that point, unless another server takes control. Slave servers are also informed of a master, which will manage the addition of slaves. Slaves are informed of neighbour or child relations with other slaves, so direct communication amongst themselves is possible. All of the points of relevance that a server process has control over are referred to as its 'domain'. A domain, functionally, is close to a traditional mmorpg zone, whereas the point of relevance is more conceptually similar. Update process: Updating the relevance graph occurs in three phases, all of which occur in one server 'tick' in the server update threads. The threads on each of the servers in a microcluster are brought into sync during this process. Firstly, a logical update runs, which fires the events from their generators and adds them to any relevant recipients. High priority UDP is used to transfer these across server boundaries if required. This process must complete before the second phase can begin. This phase is started by the master server and triggers the process in slaves through a cascade effect. The second phase of the update is the handling of receivers that have had an event passed to them. This may include the addition of messages to those receivers' outbound queues, including appropriate network messages to an avatar's associated client. This process occurs simultaneously on all server processes in the microcluster, and is synchronised (started, flagged finished) by the master server. Multi processor servers may utilise worker threads to process more than one part of the heirarchy simultaneously. The final phase is network transmission (incoming messages are handled by a separate thread) to clients - this is done on each server in batches, a POR at a time. Firstly out of date or obsolete empirical state updates are discarded from the outbound queues, then those queues are processed and sent. This process has an overall timeout value, and unsent messages are preserved for the next cycle (and are sent in order/priority). Timeout occurences indicate congestion in the area and mark the process as congested. Timing and data stats are kept for data transmission and time to transmit at each POR and can be used to determine where the congestion is actually occurring. UNDER time exits (early finish) for this process is also recorded by the server process. Stats are transferred up to the master server regularly for congestion detection and handling. Illustrative example (edit): Consider the diagram below: Each box represents a node in the relevance graph. Here we have a simple game world where the two areas, Dark Forest and Dwarf Mountain are segregated. Both are 'adventuring areas' and it's assumed the player base will usually be evenly split between these two areas. As such, Dwarf Mountain has been assigned to a slave server. The heirarchies in each domain exhibit internal neighbourhood - Dark_forest_main, the POR that models the bulk of the Dark Forest is designated as a neighbour of the Dark_forest_clearing. Any message generated in either of these can have an effect (subject to range, etc) on the other. Cross-server neighbourhood is illustrated between the cellar, and the tunnel. This is a good example of where the population can be kept low (limited numbers will fit in a cellar or tunnel) to limit the amount of traffic between the two POR's. Parent-child relationships imply neighbourhood (what happens in the clearing is heard and seen in the cottage and vice versa) but only by one step - what happens in the clearing is NOT seen or heard in the cellar. Considerations: I expect the majority of traffic to be chat and object-description exchange. Such traffic should mostly be contained within a single server, with descriptions being drawn from cache, not database. Cross-server traffic is likely to be '/tells' or events occuring 'at the edges' of a server boundary. Good (physical) world design should keep this low. I don't want to enforce a 'zone limit' for occupancy unless I can possibly help it, apart from areas where such a limit makes sense (typically room-like areas). This is an option open to a game designer, and although potentially useful is not a requisite of the library I'm building. I do want to allow more than 50 players to gather in a quiet location and let them meet without them lagging up an entire server, especially somebody's combat. In games where real-time (or almost real-time) combat is used this would be highly annoying. There are a *lot* of mobiles wandering around the world. The system is NOT designed to cope with 'blanket coverage' of players, but more with uneven distributions. Mobiles are aggregated and simplified when not visible to a player, and indeed can usually be aggregated even when visible. The system's designed to be configured at startup (before players connect) so the master server can inform all the various slaves what they need to manage and allow time for load up and synchronisation of the abstract simulation layer (simplified geography and demographics used for statistical simulation of mobiles, resources etc). This takes some time. Congestion here I define as the situation where the network exchange TO CLIENTS in the server process update times out significantly and persistently, apparently due to traffic to an isolatable (leaf) point of relevance. Note that traffic is measured in complete UDP message sizes. Packet storming is a server security issue and is dealt with at the UDP level itself. The strategy I have in mind is as follows: Defining significant as a point where more than 10% of network traffic is left over still to process persistently; Defining persistently as for a period of approximately 3 second, or proportionately less if the volume of outstanding network traffic (congestion) increases. On detecting a congestion situation in a server process, we locate areas of congestion within that process' - first looking at leaf nodes averages, then progress up the relevance heirarchy until a disproportionate congestive average is found at a certain layer. We eventually determine an area of the heirarchy that is responsible for the congestion, and know how much traffic it has pending, how much it typically generates, and also the statistics of all processes, and as a result each box running such processes in the microcluster. What I can't decide is what to do with the guilty chunk of heirarchy once it's identified, and I'm trying to think of ways of reintegrating the heirarchy once it's no longer necessary to be segregated from its greater body - and indeed how to judge that situation. When such a segment of the relevance graph is definitely causing congestion it seems obvious to transfer ownership of that chunk to a quieter process. This has the obvious flaw of causing fragmentation of the relevance graph, which is a Bad Thing. I need to come up with a means of determining whether the cost of transferring a chunk of the graph is worthwhile, and some form of 'defragging' the graph occasionally. Does anyone know of any systems that do this, or has anyone got any ideas for things I might need to track to make this work efficiently? Sorry for the (exceedingly) long post, but hey, I'm scratching my head... and I need coffee. [Edited by - _winterdyne_ on September 19, 2005 5:49:04 AM]

_winterdyne_

530

Author

October 12, 2005 04:13 AM

Quote:
When the ship lurches, you want the player to lurch, too, not just stay rock solid on the ship's deck. In fact, gravity doesn't change when the ship moves, but the ship changes. If you run an actual physical simulation, it would be simpler to simply change the ship, keeping everything in world coordinates, because the physical simulation will take care of everything. ("having to update all the objects" is not a problem -- because they are simulated, they are updated every frame anyway)

I see what you're saying here, but this doesn't really affect the relevance graph, unless you consider falling from one POR to another. It does affect the physics layer for player position checking and may alter pathfinding hierarchies.

The effect you're describing could be achieved by using a matrix stack to concatenate transformations as the hierarchy is traversed, effectively giving each POR a transformation specific to its orientation and position.

Rather than simply having position, a POR has a matrix (or quat/vector) to dictate where it is. In most cases there won't be any rotation, just translation, since blocks are easier to work with in design terms.

The most complex objects, in terms of physics abstraction (collision hull), at any time on the server are likely to be static POR geometry sets. Other items are likely to be simple boxes, particles, spheres, cylinders, or self-righting cylinder entities such as creatures. I don't intend to implement rag-doll physics on the server, nor do I intend to model 'long footprint' entities (think a horse from above). All of these are transformed by the matrix stack.

In order to lurch (or indeed rotate relative to gravity), we do actually need to inverse-transform the inherited gravity vector which will dirty the physical abstractions for the POR. New accelerations can be calculated for them and they can then update as normal. Route paths for certain movement types will have to be recalculated based on the altered gravity vector (a floor may become a wall) so angular tolerances for paths on the route finding node map must be checked.
Since self-righters will have to self-right, a player associated with the simulation entity can receive a 'I have self righted by [optional quaternion]' message to make their avatar 'stagger', in full on Star Trek styley. The quaternion can be used to determine direction of stagger. Immediately following may be an 'I am falling' message.

Any POR can specify a new gravity vector - although this is intended for spaceships with no global gravity vector - allowing us to model artificial gravity environments but NOT centripetal force environments.

Winterdyne Solutions Ltd is recruiting - this thread for details!

_winterdyne_

530

Author

October 12, 2005 05:42 AM

Quote:
Also are there any additional complexities for an event that dont just have a simple origin and spherical effect (shout) but have more directional interactions (ie- an arrow fired) that may need to do a collision check/ LOS (line of site) and/or non-instantaneous (a traveling arrow) that itself moves over time and ack! may have secondary effects (like being visible/viewed by other obects as it moves...).

For a typical MMOG, I'd probably not model items such as arrow shots as explicit entities. An arrow shot would be calculated to either miss or hit, and the visible representation of it would be handled on the client. A standard spherical event would be used to notify of the shot, or the entry of the shot to a POR.

The mobile POR is designed to segregate a chunk of hierarchy - ideally within itself. Typically, the mobile POR would be 'teleported' into, rather than entered through a neighbour relation. It's not really designed for its contents to be interacted with.

Assuming you have a ship, with a deck, and belowdecks section, with several PORs. The deck, and entry to belowdecks (potentially visible) I'd model as a standard mobile limpet, with a complex collision hull. Mobiles on deck are linked to the limpet, characters below decks are not, they are tied to a mobile POR, since they are 'separated' from the world at large.
Linking the mobile POR to the limpet for the ship as well gives us a single point of control; Rotating the limpet (according to the normal of the bit of sea its on) rotates the mobile POR (but not its physical abstractions which are separate, but do get a modified gravity vector etc. as described).
Anything linked to the limpet inherits its transformation when updated, so any physical abstractions may get dirtied.

This allows the ship itself to be interacted with, as well as people on decks, as standard occupants of the fixed POR the ship is in, whilst keeping local interactions below decks separate from the world at large (keeping network updates down).

Winterdyne Solutions Ltd is recruiting - this thread for details!

hplus0603

11,917

October 12, 2005 10:51 AM

Quote:Unfortunately you will have to decide where you want the 'realism' to stop (in order to make a game that doesnt require a supercomputer to run in real time). Sure you could have the player lurch about (and check all the friction effects that keep objects in place most of the time) but what of the entire structure of the ship/boat. Do you want to have to calculate all those structures effects by every force upon them to calculate all the transformations for positions (culling methods would help this some). Its probably more cost effective to apply the various 'lurch' forces to the player and other 'moveables' within a local coordinate system to minimize the CPU load.

OK, I respectfully disagree. My daytime job includes a distributed simulation system that simulates all physical entities in global space, and it works very well. We've solved all the problems the OP talks about (although differently from his suggestion), and we've been operating since 2001. The servers, and clients, are regular x86 PC hardware, not supercomputers.

enum Bool { True, False, FileNotFound };

_winterdyne_

530

Author

October 12, 2005 11:39 AM

Bear in mind that I'm aiming for very small clusters with this library - it's intended use is for low-budget / indie projects, so a large cluster performing complex simulation of a large number of entities is out of the window.

Instead, since there may be large numbers of small items lying around, simplified physics have to be used serverside(particles for objects). Performing true physics on such items seems like a lot of work for a tiny cluster, which is preoccupied with game event handling.

Winterdyne Solutions Ltd is recruiting - this thread for details!

Anonymous

October 12, 2005 09:36 PM

Quote:Original post by hplus0603
Quote:Unfortunately you will have to decide where you want the 'realism' to stop (in order to make a game that doesnt require a supercomputer to run in real time). Sure you could have the player lurch about (and check all the friction effects that keep objects in place most of the time) but what of the entire structure of the ship/boat. Do you want to have to calculate all those structures effects by every force upon them to calculate all the transformations for positions (culling methods would help this some). Its probably more cost effective to apply the various 'lurch' forces to the player and other 'moveables' within a local coordinate system to minimize the CPU load.

OK, I respectfully disagree. My daytime job includes a distributed simulation system that simulates all physical entities in global space, and it works very well. We've solved all the problems the OP talks about (although differently from his suggestion), and we've been operating since 2001. The servers, and clients, are regular x86 PC hardware, not supercomputers.

Frame rate (or simulation cycles per second) ??
Total Object count??
Average number of objects in an overlapping vicinity??
Average events per object per cycle??
Seamless boundries??
Complex terrain (mobile vehicles where other object navigate in)??

Some game have a much higher simulation complexity than others.

Im considering scaleability since game worlds are getting bigger and more complex AND have situations where large numbers of very active players congregate in a small area (ie- like a battle or the 'bank').

hplus0603

11,917

October 13, 2005 12:04 AM

We don't have all of the data public, but yes, it's a continuous world that scales up by the amount of hardware you plug into the cluster; the actual mapping from world to hardware is heterogenous (not just same-sized squares). We step everything at 30 Hz (client and server), and the number of messages per object is "whatever interactions actually happen" (it's not really a limitation in the system). The limits to scalability are mainly related to how dense you want the congregations of simulated objects to be, and how complex they are, as well as what the CPU memory speed is.

You can check out the web site at http://www.forterrainc.com/ and you can also try the older version of the platform via the free trial download of http://www.there.com/ . Other things we do include simulation of an entire round planet, the size of Earth; a very believable model of human emotions; a fully working virtual economy driven by player-created content; and integrated voice chat that routes through the server (all interactions are server authenticated). There.com runs on pretty old server hardware; it has some "city plaza" type locations with hundreds of avatars and more hundreds of other simulated objects (user-customizable buildings, trees, etc).

And, yes, some players have taped down the "forward" button and driven vehicles around the entire world. It takes them about three weeks ;-)

enum Bool { True, False, FileNotFound };

_winterdyne_

530

Author

October 13, 2005 06:51 AM

Nice product!

Somewhat larger scale than I'm aiming, but similar goals.

I'm interested in what happens with large, active congregations in your system.
I assume your quadtree is stored on a centralised for the operating grid so node relations can be queried centrally, and I assume you're tracking stats for each node on that server as well. I assume that each node knows where the master server is (probably its own machine in the grid). I also assume that you can isolate a quadtree node that's causing lag, and I assume each process is aware of at least a portion of the quadtree to allow direct inter-process communication (given that you've mentioned a smart switch for the grid).

When a particular node in your quadtree is known to be causing lag, what do you do with it? You've mentioned that each node has its own process - are you moving processes to machines with less lag using beowulf as you mentioned earlier? Are you subdividing the node (similarly to the mechanism I have planned) and distributing the subnodes? As you mentioned, this doesn't cause problems with non-seamless worlds, but your reference to continuous implies its seamless, so taking a process out of the loop whilst shifting it in its entirity is going to cause synchronisation issues, made worse by the amount of active content in the node.

Hehe, here comes the 'I could tell you, but I'd have to kill you' post. ;-)

Winterdyne Solutions Ltd is recruiting - this thread for details!

hplus0603

11,917

October 13, 2005 11:21 AM

I could tell you how it all works, but first you'd have to sign a bunch of legal papers :-)

Quote:node relations can be queried centrally

The only thing we really need central querying for in the entire system is the relation "given this object ID, what is the home storage server for that object". Everything else is distributed, and scales by adding more discrete hardware, in one way or another.

We don't use Beowulf, but instead built our own application-layer clustering infrastructure.

Simulating objects will never make a remote query within the time of a single step -- doing that would kill performance. In fact, we could probably tolerate having a distributed data center (different servers in different centers), although that's not something we're officially supporting nor currently working to support.

We run one server process per machine. Running multiple processes has no advantage, because the area served by our processes can be irregular in shape (and even discontiguous, although that's usually not a great idea for other reasons). If we need to shift load, we change the area that each machine is responsible for, rather than moving the processes. Each simulating object knows how to move itself to the "most optimal" server for that object, so when we change around mappings, the appropriate objects will automatically migrate. Usually the players won't notice when their objects migrate (because of the "seamless streaming world" implementation, which already involves real-time migration).

enum Bool { True, False, FileNotFound };

_winterdyne_

530

Author

October 13, 2005 12:05 PM

Quote:Original post by hplus0603
I could tell you how it all works, but first you'd have to sign a bunch of legal papers :-)

Isn't that always the way? :-)

Quote:
We run one server process per machine. Running multiple processes has no advantage, because the area served by our processes can be irregular in shape (and even discontiguous, although that's usually not a great idea for other reasons). If we need to shift load, we change the area that each machine is responsible for, rather than moving the processes. Each simulating object knows how to move itself to the "most optimal" server for that object, so when we change around mappings, the appropriate objects will automatically migrate. Usually the players won't notice when their objects migrate (because of the "seamless streaming world" implementation, which already involves real-time migration).

So, given a change in area on a particular machine/process that change has to be migrated to all processes in the grid? You've stated you use a modified quadtree, I take it this is used to determine which is the most optimal server, given an objects extents and the known areas covered by each process in the grid. Elegant, given a fixed origin coordinate system. I also assume you are generally dealing with a 2D world (as far as zones are concerned).

Couple of questions, I was reading up on DungeonSiege's continuous world design and they ran across floating point precision errors at large distances. In short they overcame this by using an alterable point of reference. Are you using sliding scales for determining quadtree nodes (a 10km tree vs a 1m tree)?

Also, given an irregular shape, how do you determine continuity? Colinear edges on area perimeters? It's one of the reasons my fixed POR's have AABBs rather than arbitrary - I considered the design difficulties of placing continuous arbitrary hulls nightmarish, not to mention the fact I always hated Tetris, especially in 3d, whereas most people can easily figure out how to put together axis aligned box.

Winterdyne Solutions Ltd is recruiting - this thread for details!

Anonymous

October 14, 2005 04:14 AM

Quote:Original post by hplus0603
I could tell you how it all works, but first you'd have to sign a bunch of legal papers :-)

Quote:node relations can be queried centrally

The only thing we really need central querying for in the entire system is the relation "given this object ID, what is the home storage server for that object". Everything else is distributed, and scales by adding more discrete hardware, in one way or another.

We don't use Beowulf, but instead built our own application-layer clustering infrastructure.

Simulating objects will never make a remote query within the time of a single step -- doing that would kill performance. In fact, we could probably tolerate having a distributed data center (different servers in different centers), although that's not something we're officially supporting nor currently working to support.

We run one server process per machine. Running multiple processes has no advantage, because the area served by our processes can be irregular in shape (and even discontiguous, although that's usually not a great idea for other reasons). If we need to shift load, we change the area that each machine is responsible for, rather than moving the processes. Each simulating object knows how to move itself to the "most optimal" server for that object, so when we change around mappings, the appropriate objects will automatically migrate. Usually the players won't notice when their objects migrate (because of the "seamless streaming world" implementation, which already involves real-time migration).

It must be pretty messy shifting boundries (to do load leveling) with irregular/discontinuous areas. How long does a transition usually take when an entire area has to be locked down so that the new boundries can be send to the adjacent areas (a new area created...) and any objects reassigned to that new area?? I suppose some work could be done ahead of time to build up the data for a new area (adjacent areas prompted for the change..) before the actual transition.

At what degradation of the 30hz cycling target is a overly busy area split up ??
The heuristics for controling this automaticly must be nasty.

It always happens that whatever the worst case scenarios are, the players will wind up doing it. I would think that small busy areas would not really be fixable by this method because interzone events would still have to be transmitted to the adjacent areas (unless event filtering greatly lowers the number sent between areas or you still have the N^2 problem...).

[Of course its possible that you might have simulations with a very high load from NPC AI activation/reactive scenery and other secondary processing near players that requires farming out to more machines and would greatly outweigh the inter-machine event transfer overhead.]

I assume there is also an area anealing done as well....

Distributed server architecture for load balancing

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Distributed server architecture for load balancing

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines