Sign in to follow this  
Arjag

Many core MMO server

Recommended Posts

Hi, I'm currently trying to design / think through an architecture for game servers that will be able to utilize the processing power of a high number of cores. Intel's Nehalem architecture will feature up to 8 cores per chip with 2 threads per core. With four of those in one server a programmer will have access to 64 threads (32 cores). Basically two designs come into my mind: a) Big Seamless World: The world is (possible dynamically) partitioned into regions handled by different physical servers. The servers communicate with each other and transfer entities between each other as needed. From the player's pov it looks like one big server without zoning. b) One Server per zone: One zone is handled by one server. If a player wants to switch zones he gets a loading screen and his client connects to another server. Now two scenarios: - Rising number of players, but in distinct areas (ie in the same zone but not interacting with each other): a) works fine. If one server gets overloaded, readjust the borders or add new servers. b) Distributing non interacting players to different threads is doable. The server will reach a ceiling at some point, but is able to handle much more players than one server in a) since all objects are local and the server doesn't have any communication overhead. (A seamless server would have to use some sort of actor model to handle actions to remote objects, which is much more expensive). - Rising number of players, in one small area (eg a big event, a town raid....) a) Since this server is less efficient than b) it will max out earlier. The question is if adding more server to the scene (eg setting a zone border in the middle of the battlefield) will do any good. You get more progressing power but communication overhead increases also (a lot!). b) Theoretically this would work better than a). IF it's possible to efficiently distribute the workload to 64 threads. Note: The idea here is to build a server that can handle a lot of players in one place. A "seamless" server architecture means seamless within a zone. This can of course be one giant zone (which imposes other problems like a higher crash probability and problems if one server gets overloaded etc). On top of that other systems can be layered (like instancing or shards). Note2: Some task can be quite easily offloaded (authentication & login, chat, possibly even all client communication through proxy servers), this is about handling the main simulation. Note3: This is not tied to a specific game design or project. I'm just experimenting a bit around. Does that make any sense? :) - Is one heavily threaded server more efficient for a big fight than a cluster of servers? - Is a server with 32 cores feasible for such a task or will other bottlenecks limit this (memory bandwidth / usage, cache issues, network bandwidth)? - How to program such a server? - Does it even make sense to build a 32 core server, or is it more likely that we'll get one or two chips and a bunch of cell-like co-processors. How to map a game server onto those? Thanks for any insights you guys have on this subject :)

Share this post


Link to post
Share on other sites
I'm not sure if MMO servers are designed with specific hardware in mind. (but maybe, I don't know).

I can't imagine the servers not communicating with one another. I just plan to have the server applications open connections to either adjacent zones or all zones.

The basic idea from what I've been able to gather over time is that you want to create one zoneManager application and loginManager application. The login runs on 1 server (doesn't need anymore power) and the server applications work on one zone that they are given. The login server manages the server nodes. Okay say you have 9 zones and 4 servers. The server application can run many zones inside of it.

Okay so that's fine you have something like:

server 1 loginManager server
server 2 zoneManager server managing zones 1,2,3
server 3 zoneManager server managing zones 3,4,5
server 4 zoneManager server managing zones 6,7
server 5 zoneManager server managing zones 8,9

But oh no server 1 is hitting the server CPU capacity and it detects that zone 1 is at fault and requires much more processing. The login server is constanty querying this data from the server cluster. The login server find that server 4 and 5 are almost empty and tells them to spawn new zones loading in the required information (like level data) for zones 2 and 3 respectively. So server 4 has an empty instance of zone 2 and server 5 has an empty instance of zone 3. All players in zone 2 and 3 are told to open a new connection to server 4 and 5 which the login server told server 4 and 5 to expect their connections. So all the players on 2 and 3 and now connected to two servers. In what might be a second of lag zones 2 and 3 stop their simulation and come to a complete halt serializing everything over to server 4 and 5 which effectively copies the world. The players are told to make the change and drop their connection to server 1 and go to their new socket either connected to 4 or 5. Server one destroys zones 2 and 3 and begins churning the data only for zone 1 and the game balances. The loginManager is constantly looking at these situations and deciding when these server balances need to occur.

Share this post


Link to post
Share on other sites
Quote:
Intel's Nehalem architecture will feature up to 8 cores per chip with 2 threads per core. With four of those in one server a programmer will have access to 64 threads (32 cores).


If you watch Sutter's discussions, the current average performance gain from hyperthreads is in the order of 20%. So, 38 threads.

Quote:
a) since all objects are local and the server doesn't have any communication overhead.


If you want to scale, you want to replicate resources, not share them and use locks.

Proper distributed actor-like local communication overhead is about the same as network communication overhead, minus the wire latency. You need to fully persist the data you're sharing, then send it, and potentially wait till sender can accept it.

Local message passing in this way is some 5-10 slower than non-inlined function calls if no data is passed. Once you want to pass actual data, you will need to involve the following: lock/unlock, heap memory allocation/de-allocation, copy of all data you're sending. That's for each message.

Lock-less algorithms can be used instead of locking, but they make memory allocation considerably harder. Copy of data cannot be avoided for true distributed model. If you share it, then you need to lock it until receiver has processed the message in entirety.

Quote:
- Is one heavily threaded server more efficient for a big fight than a cluster of servers?


There isn't much difference really (except in physical size). It all depends on how you distribute the model.


And just as a side note - seamless severs aren't new, and for all practical purposes, they're a solved problem. They all have hard limits on capacity, but that limit is smaller than what users require. Especially in games, fidelity has proven more desirable than sheer size. On typical game servers the local population will be counted in dozens, since anything else becomes problematic from client-side rendering, as well as gameplay perspective.

Share this post


Link to post
Share on other sites
I agree; seamless servers that hand off between areas is not new, and is typically done on single- or dual-core machines with Ethernet between them.

The problem with putting zillions of cores on a die is that each core will have less memory bandwidth available to it. Thus, workloads on the cores must be structured to fit in the cache as much as possible. That makes multi-core machines great for partitioning things like per-object simulation, but makes them troublesome for doing things like global collision detection.

There are solutions, such as building everything as one big set of state machines, each of which machine gets a regular "tick" from a thread within a pool of service threads. If you can tie the type of service (collision vs simulation vs chat vs access control vs ...) to specific cores, that's even better, because you will have cache locality within each core.

Share this post


Link to post
Share on other sites

Dont bank on having all the processing power you need on one machine.

You will have to eventually scale beyond what fits on a single motherboard (many MMOS have a farm of hundreds or thousands of servers to meet their processing (a whole layer of player connection servers + a bank of servers for the zones + DB servers and often a bank of servers for the game NPC AI/behavior). Zones can interact at their boundries (and you can easily run more zones than you have cores...)

My project is heavy on AI and I gave up quite a while ago trying to fit everything on one computer (even wth single/dual quad cpus). Clusters of servers are needed to do half intelligent AI behaviors, and future interactive terrain vastly will increase the processing power needed for the Zone simulations.

You will have to accomodate crossing between server machines as well as between cores withing the same server. You still might do something clever by having certain threads be positioned adjacent on the same CPU for processing that requires fine grained data dependancies (faster locks within the same CPU than across a network link to another CPU/set of cores - memmory sharing). But you quickly run out of those opportunities and are forced to interact across the more costly inter machine communications.

I also wouldnt count on 2 threads per CPU as you might move some background task to a second thread on a core , but 2 primary thread will fight each other for resources too much to be efficient.

Share this post


Link to post
Share on other sites
Quote:
future interactive terrain vastly will increase the processing power needed


I'm not so sure about that. Interactive terrain really shouldn't be much more expensive, computationally, than current terrain. However, interactive terrain is a giant distribution (networking) challenge, as you need to get the terrain modification data to be able to interact in an area, and sometimes that data is pretty large.

That being said, these are really two different questions.

Question 1): How can I build an architecture that scales across nodes connected with a communication bus (Ethernet, NUMA)?

Question 2): How can I build an implementation that exploits all available power on nodes that have many CPU cores connected through shared memory?

The original question is question 2, however, the description of the question makes it seem as if you're trying to tackle question 1. Note that, when CPUs get big enough caches, or when mainstream servers become NUMA, then solutions for question 1 may start mattering even for question 2 (as the line will start blurring).


[Edited by - hplus0603 on March 22, 2008 12:07:09 PM]

Share this post


Link to post
Share on other sites
Something to consider:

How many players do you plan for a single core to handle?

How often are that many players going to be in a single zone? (zone in this case meaning either a minimum partition of your big seamless world, or an actual zone, however you end up) What's your pain threshold here (minimum server FPS for the game to be playable), and how many players can you handle and still be within that limit?

Depending on the nature of your game, how much server-side processing you're planning to do and how much of your game logic (AI, physics) you can push onto other servers, you may find that you -very- seldom will push the limits for one core here, for a particular zone/instance/partition.

In that case, you may not want to add the tenfold complexity of multithreading the simulation itself. Simply running more processes per physical server nets you the same scalability with regards to just running more zones at once (i.e. physical machine A could be running one separate process per core, each of which handle a variable number of zones depending on load).

And for this type of setup, two boxes with 4 cores each is better than one with 8, and is more cost-efficient to scale upwards.

Seamless worlds require more inter-server communication than fully zoned ones (zoned ones can just require player transfers, everything else can be partitioned off to separate server types) but it should still be kept to a minimum, so the same principles apply.

Share this post


Link to post
Share on other sites
Quote:
Original post by hplus0603
Quote:
future interactive terrain vastly will increase the processing power needed


I'm not so sure about that. Interactive terrain really shouldn't be much more expensive, computationally, than current terrain. However, interactive terrain is a giant distribution (networking) challenge, as you need to get the terrain modification data to be able to interact in an area, and sometimes that data is pretty large.





I was originally thinging of interactive furniture - things you could manipulate each with reactive scripted behaviors/triggers etc.. but then an extension of that with the real physics kind of terrain (ex - a rope bridge, or a pile of rocks that you could roll, and buildings made up of wall components that can be deformed/destroyed...)

Think of interactions done simultaneously by different players on the same objects and the server having to arbitrate/resolve the composite outcome of the interactions (and then post the changes to clients and possibly to other zones' overlap boundries AND to NPC AI running on other servers).


That could add up to ALOT more processing -- its like having many 1000s of the dim NPCs (we now have only dozens of) running in each zone. That also multiplies the amount of data required per zone.


Share this post


Link to post
Share on other sites
Quote:
Original post by wodinoneeye
Quote:
Original post by hplus0603
Quote:
future interactive terrain vastly will increase the processing power needed


I'm not so sure about that. Interactive terrain really shouldn't be much more expensive, computationally, than current terrain. However, interactive terrain is a giant distribution (networking) challenge, as you need to get the terrain modification data to be able to interact in an area, and sometimes that data is pretty large.





I was originally thinging of interactive furniture - things you could manipulate each with reactive scripted behaviors/triggers etc.. but then an extension of that with the real physics kind of terrain (ex - a rope bridge, or a pile of rocks that you could roll, and buildings made up of wall components that can be deformed/destroyed...)

Think of interactions done simultaneously by different players on the same objects and the server having to arbitrate/resolve the composite outcome of the interactions (and then post the changes to clients and possibly to other zones' overlap boundries AND to NPC AI running on other servers).


That could add up to ALOT more processing -- its like having many 1000s of the dim NPCs (we now have only dozens of) running in each zone. That also multiplies the amount of data required per zone.


There was a talk about this in GDC called "Mercenaries 2: Networked Physics in a Large Streaming World", should be worth a read.

Share this post


Link to post
Share on other sites
Quote:
Original post by wodinoneeye

Think of interactions done simultaneously by different players on the same objects and the server having to arbitrate/resolve the composite outcome of the interactions (and then post the changes to clients and possibly to other zones' overlap boundries AND to NPC AI running on other servers).


That could add up to ALOT more processing -- its like having many 1000s of the dim NPCs (we now have only dozens of) running in each zone. That also multiplies the amount of data required per zone.


That doesn't add as much to processing power as it requires you to fully synchronize everything. If you wish to preserve proper sequence of events, you are almost guaranteed to be unable to extrapolate, which makes it considerably more difficult to provide smooth gameplay.

A considerably bigger problem here would be properly resolving interactions where clients have latency 10-1000ms, all interacting with same object. Do you handle it immediately, then rolling back every time as needed, resulting in lots of re-processing and rubber-banding, or do you wait for slowest clients, resulting in large delays for everyone.

Speed of light constraining communications would be the limiting factor. Even with infinite CPU power, the client-side latency would remain the same.

Share this post


Link to post
Share on other sites
Quote:
Original post by hplus0603
That being said, these are really two different questions.

Question 1): How can I build an architecture that scales across nodes connected with a communication bus (Ethernet, NUMA)?

Question 2): How can I build an implementation that exploits all available power on nodes that have many CPU cores connected through shared memory?

The original question is question 2, however, the description of the question makes it seem as if you're trying to tackle question 1. Note that, when CPUs get big enough caches, or when mainstream servers become NUMA, then solutions for question 1 may start mattering even for question 2 (as the line will start blurring).


You are right, my original question was 2). But my main concern is how to handle more players in one place. There are solutions that scale with more players as long as they are not too close to each other (can be handled by different nodes). I know that it wont be possible to scale the "players in one place"-scenario the same way(1 server handles 40 players, 100 servers handle 4000 players does not work if they are sitting on top of each other), but getting that number up somewhat would be nice.

Compared to networked servers, using shared memory reduces communication overhead (even with NUMA the latency will be lower). That's why I asked how to design a server that runs on one many core box. But more generally spoken the question is:

How to build a server that:
- allows more players to interact with each other (in one place)
- that scales with future processor development (up to some limit)
(- scales horizontally)
(with scaling I mean scaling the number of players in one place. As some of you noted scaling the total number of players is already solved.)

Share this post


Link to post
Share on other sites
Quote:
Original post by Antheus
Quote:
Original post by wodinoneeye

Think of interactions done simultaneously by different players on the same objects and the server having to arbitrate/resolve the composite outcome of the interactions (and then post the changes to clients and possibly to other zones' overlap boundries AND to NPC AI running on other servers).


That could add up to ALOT more processing -- its like having many 1000s of the dim NPCs (we now have only dozens of) running in each zone. That also multiplies the amount of data required per zone.


That doesn't add as much to processing power as it requires you to fully synchronize everything. If you wish to preserve proper sequence of events, you are almost guaranteed to be unable to extrapolate, which makes it considerably more difficult to provide smooth gameplay.

A considerably bigger problem here would be properly resolving interactions where clients have latency 10-1000ms, all interacting with same object. Do you handle it immediately, then rolling back every time as needed, resulting in lots of re-processing and rubber-banding, or do you wait for slowest clients, resulting in large delays for everyone.

Speed of light constraining communications would be the limiting factor. Even with infinite CPU power, the client-side latency would remain the same.




You might be suprised how much CPU could be consumed having that many seperate interactive objects. Rremember this is a whole zone (probably smaller than current usage because of N^2 relations) which has to also handle crunch situations where large numbers of players congregate at one time. NPCs would most likely also interact with the terrain/passive objects. And yes I was considering attempting to resolve/handle temporal conflicts/rollbacks within as short a time as possible (majority of effects likely would be single player actuated in any case, but again crunch situation have to be handled).

Many events CAN be extrapolated/corrected fully (because they dont depend on split second results) and others could be brought into the 'good enough' range.

Share this post


Link to post
Share on other sites
Quote:
Original post by Arjag
Quote:
Original post by hplus0603
That being said, these are really two different questions.

Question 1): How can I build an architecture that scales across nodes connected with a communication bus (Ethernet, NUMA)?

Question 2): How can I build an implementation that exploits all available power on nodes that have many CPU cores connected through shared memory?

The original question is question 2, however, the description of the question makes it seem as if you're trying to tackle question 1. Note that, when CPUs get big enough caches, or when mainstream servers become NUMA, then solutions for question 1 may start mattering even for question 2 (as the line will start blurring).


You are right, my original question was 2). But my main concern is how to handle more players in one place. There are solutions that scale with more players as long as they are not too close to each other (can be handled by different nodes). I know that it wont be possible to scale the "players in one place"-scenario the same way(1 server handles 40 players, 100 servers handle 4000 players does not work if they are sitting on top of each other), but getting that number up somewhat would be nice.

Compared to networked servers, using shared memory reduces communication overhead (even with NUMA the latency will be lower). That's why I asked how to design a server that runs on one many core box. But more generally spoken the question is:

How to build a server that:
- allows more players to interact with each other (in one place)
- that scales with future processor development (up to some limit)
(- scales horizontally)
(with scaling I mean scaling the number of players in one place. As some of you noted scaling the total number of players is already solved.)





One way they have done it in the past was to narrow down the subset of processing to the interdependant data (ie- collision calcs for the moving objects) on dedicated CPUs/machines while leaving the independant processing (bookkeeping/ pre validation/ collision with static terrain) on a farm of client proxy servers (where paralleization could be done easily).

The N^2 relationships processing could then actually fit on a single thread on one CPU (or matbe now a multicore/shared memory) to cut out the lock overhead (or have faster locks) and communication delays required by coarser parallelization.


---

There may be a significant problem with getting enough cache memory for each core to run the kind of processing we need in this application. 64 cores each with 2-4 meg each might not happen for a VERY long time. There is also the problem of a memory architecture feeding that many cores from the same large data mass we have in these games. It all wont fit in cache and the constant stream of random data fetches likely would overwhelm current limited memeory paths.

Intermachine communications might improve (fibre 10Ghz, grid/multiport parallel
networks) but still is magnitudes slower than direct memory.

More complex/numerous interactions expected in future games will only exacerbate
the situation.

Share this post


Link to post
Share on other sites
Quote:
Original post by wodinoneeye

You might be suprised how much CPU could be consumed having that many seperate interactive objects. Rremember this is a whole zone (probably smaller than current usage because of N^2 relations) which has to also handle crunch situations where large numbers of players congregate at one time. NPCs would most likely also interact with the terrain/passive objects.


No, I wouldn't be surprised. But from my understanding the usual solution is to reduce the workload, lower the update rate, lower number of interactions, make interactions local, approximate the results.

All this obviously assumes O(n^2). Even under ideal conditions, increasing number of cores results in diminishing results. And N^2 will be IMHO hit bus/network considerably harder than CPU. So unless there's something you can do about that, I'd assume there's a hard ceiling on how much you can do.

But eventually, there's a hard limit on how much one can process. And at that point, unless the problem can be vectorized, I don't believe it's possible to push beyond.


Quote:
Many events CAN be extrapolated/corrected fully (because they dont depend on split second results) and others could be brought into the 'good enough' range.


Obviously, those aren't counted in this workload.

Share this post


Link to post
Share on other sites
Quote:
Original post by wodinoneeye
Quote:
Original post by slvmnd
You should check out project darkstar. Seems interesting at least.
Looks like its Java based. I would think this kind of thing would need C for the low level (performance critical layers).
With JIT compilation, etc. Java is barely slower than C at this point, and the networking support is better than the majority of C libraries provide.

Share this post


Link to post
Share on other sites
Quote:
You should check out project darkstar. Seems interesting at least.


If anyone mentions Java vs C anymore in this thread, I will lock the thread.

The problem with Project Darkstar is that it doesn't scale AT ALL for dependent objects. Sun has publicly stated that a Darkstar server cannot be used for physical interaction. Also, because all objects are serialized Java chunks, you can't run regular queries on the objects; you HAVE to find the objects using the object reference network.

If you have collision detection for players, then the immediate interaction can only be with at most 14 other players (imagine stacking oranges around a single orange). If you have fast memory, you can scale this, by putting all the "close" players in a single piece of memory, and having different cores work on different players, making sure that no player is allowed to move into penetration with another players old OR projected new position. This, in effect, limits the density of players because of their physical volume ("personal space" if you will), to numbers that will soon become tractable.

If I remember correctly, Sun already is shipping (or at least talking about shipping) a 64-core SMP CPU.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this