MMOs and modern scaling techniques

Ben Sizer · 2014-07-20T00:15:34

(NB. I am using MMO in the traditional sense of the term, ie. a shared persistent world running in real-time, not in the modern broader sense, where games like Farmville or DOTA may have a 'massive' number of concurrent players but there is little or no data that is shared AND persistent AND updating in real-time.) In recent discussions with web and app developers one thing has become quite clear to me - the way they tend to approach scalability these days is somewhat different to how game developers do it. They are generally using a purer form of horizontal scaling - fire up a bunch of processes, each mostly isolated, communicating occasionally via message passing or via a database. This plays nicely with new technologies such as Amazon EC2, and is capable of handling 'web-scale' amounts of traffic - eg. clients numbering the the tens or hundreds of thousands - without problem. And because the processes only communicate asynchronously, you might start up 8 separate processes on an 8-core server to make best use of the hardware. In my experience of MMO development, this is not how it works. There is a lot of horizontal scaling, but instead of firing up servers on demand, we pre-allocate them and tend to divide them geographically - both in terms of real world location so as to be closer to players, and in terms of in-game locations, so that characters that are co-located also share the same game process. This would seem to require more effort on the game developer's part but also imposes several extra limitations, such as making it harder to play with friends located overseas on different shards, requiring each game server to have different configuration and data, etc. Then there is the idea of 'instancing' a zone, which could be thought of as another geographical partition except in an invisible 4th dimension (and that is how I have implemented it in the past). MMOs do have a second trick up their sleeves, in terms of it being common to farm out certain tasks to various heterogeneous servers. A typical web app might just have many instances of the front-end server and one database (possibly with some cache servers in between), but in my experience MMOs will often have specific servers for handling authentication, chat and communications, accounts and transactions, etc. It's almost like extreme refactoring; if a piece of functionality can run asynchronously from the gameplay then it can be siphoned out into a new server and messaging to and from the game server set up accordingly. But in general, MMO game servers are limited in their capacity, so that you can typically only get 500-1500 players in one place. You can change the definition of 'place' by adding instancing and shards, you can make the world seem to hold more characters by seamlessly linking servers together at the boundaries, and you can increase concurrency a bit more via farming out tasks to special servers. So I wonder; are we doing it wrong? And more specifically, can we move to a system of homogeneous server nodes, created on demand, communicating via message passing, to achieve a larger single-shard world? Partly, the current MMO server architecture seems to be born out of habit. What started off as servers designed to accommodate a small number of people grew and grew until we have what we see today - but the underlying assumption is that a game server should (in most cases) be able to take a request from a client, process it atomically and synchronously, and alter the game state instantly, often replying at the same time. We keep all game information in RAM because that is the only way we can effectively handle the request synchronously. And we keep all co-located entities in the same RAM because that's the only way we can easily handle multiple-entity transactions (eg. trading gold for items). But does this need to be the case? My guess is that the main reason we can't move to a more distributed architecture comes partly down to latency but mostly down to complexity. If characters exist across an arbitrary number of servers, any action involving multiple characters is going to require passing messages to those other processes and getting all the responses back before proceeding. This turns behaviour that used to be a single function into either a coroutine (awkward in C++) or some sort of callback chain, also requiring error-detection (eg. if one entity no longer exists by the time the messages get processed) and synchronisation (eg. if one entity is no longer in a valid state for the behaviour once all the data is collected). This seems somewhat intractable to me - if what used to be a simple piece of functionality is now 3 or 4 times as complex, you're unlikely to get the game finished. And will the latency be too high? For many actions, I expect not, but for others, I fear it would. But am I wrong? Outside of games people are writing large and complex applications using message queues and asynchronous behaviour. My suspicion is that they can do this because they don't have a large amount of shared state (eg. world and character data). But maybe it's because they know ways to accomplish these tasks that somehow the game development community has either not become aware of or simply not been able to implement yet. Obviously there have been attempts to mix the two ideas, by running many homogeneous servers but attempting to co-locate all relevant data on demand so that the actual work can be done in the traditional way, by operating atomically on entities in RAM. On paper this looks like a great solution, with the only problem being that it doesn't seem to work in practice. (eg. Project Darkstar and various offshoots.) Sending the entities across the network so that they can be operated on appears to be like trying to send the mountain to Mohammed rather than him going to the mountain (ie. sending the message to the entity). What you gain in programming simplicity you lose in serialisation costs and network latency. A weaker version of this would be automatic geographical load balancing, I suppose. So, I'd like to hear any thoughts on this. Can we make online games more amenable to an async message-passing approach? Or are there fundamental limitations at play?

Networking and Multiplayer Programming

Started by Kylotan June 10, 2014 01:26 PM

65 comments, last by wodinoneeye 9 years, 9 months ago

Kylotan

10,510

Author

June 12, 2014 10:17 PM

If you have GDC Vault access, look up a talk by Pat Wyatt from GDC 2013 I think it was... maybe 2012.

If you don't have GDC Vault access, what's wrong with you?! :-P

It's far too expensive for me, unfortunately.

Kylotan

10,510

Author

June 12, 2014 10:33 PM

I wouldn't put so much trust on "how web developers approach scalability".

I appreciate that a lot of what used to be considered the state-of-the-art is now not considered best practice. But still, there are sites today deploying technology that services many more concurrent clients than single-shard MMOs can manage. The question is whether it would be possible for MMOs to do the same... or not. From what people are saying, the answer appears to be "Yes, of course... if you can overcome the complexity... which nobody is going to talk about in any detail". ;)

But still the problem remains that we have one socket per TCP connection and that sucks hard.

That's not really the problem I am trying to discuss though. Firstly, because you can distribute the front end servers quite easily. And secondly, because you don't necessarily need to use TCP for your main client connection anyway.

My experience of MMOs is that once you've done your optimisation on the I/O level - eg. getting your buffers the right size, perhaps using a proxy so that your game server is not spending half its time servicing network interrupts, etc - you'll hit a CPU cap with the gameplay before you hit a cap imposed by networking delays between the server and the player clients. Character interactions are O(N2) whereas your number of connections is only O(N), and the value of N is greater for character interactions because it includes NPCs. The cap in my experience seems to be around 500-1500 players per process, depending on how complex the computations for each player are.

Sure, at a very high level with distributed servers like Amazon EC2, these paradigms work. But beware that a user waiting 5 second for the search results of their long-lost friend on Facebook is acceptable(*). A game with a 5 second lag for casting a spell is not.

Sure. That's the backbone of my suspicion. The argument I had which inspired me to start this thread included the other guy saying that my traditional approach is quite obviously not widely used because it would cost half a million dollars per month on Amazon EC2. Trying to tell him that most MMOs - in the original meaning of the word - do not and will not run on EC2, would probably have been futile.

hplus0603

11,916

June 12, 2014 10:53 PM

There are two reasons not to run games on EC2:

1) Amazon charges an arm and a leg for bandwidth. You can buy it MUCH cheaper in a co-lo facility.

2) Virtualization induces scheduling jitter, which impacts real-time physics simulation. If your CPU suddenly goes away for 100 milliseconds, that's a six frame stutter, which is quite noticable. When I measured this, it could get as bad as 1500 milliseconds.

The real world performance of network infrastructure is not even slightly approaching 50% light


15:48 ~ jwatte@AF002000$ traceroute www.interserver.net
traceroute to www.interserver.net (198.41.189.28), 30 hops max, 60 byte packets
 1  * * *
 2  208.71.159.129 (208.71.159.129)  0.299 ms  0.573 ms  0.492 ms
 3  117.Vl117-Cr01-PAIX-PAL.unwiredltd.net (204.11.106.45)  2.014 ms  1.902 ms  1.864 ms
 4  209.63.145.114 (209.63.145.114)  3.698 ms  3.710 ms  3.687 ms
 5  be-1.br02.chcgildt.integra.net (209.63.82.186)  56.066 ms  56.075 ms  56.052 ms
 6  xe-1-2-0.edge01.ord02.as13335.net (206.223.119.180)  54.250 ms  53.702 ms  53.516 ms
 7  198.41.189.28 (198.41.189.28)  53.469 ms  53.513 ms  53.503 ms
15:48 ~ jwatte@AF002000$

The light transmission time from Oakland to New York and back (2900 miles each way) in copper (about 2/3 the speed of vacuum) is about 46 milliseconds. In this case (from well-connected data center to well-connected data center) we are substantially CLOSER to speed-of-light than 50%. (46/53 is about 86% speed of light.)

Most of the delay comes from slow residential "last mile" connection issues, and WiFi access points, which may vary from sub-millisecond to dozens-of-milliseconds.

enum Bool { True, False, FileNotFound };

VFe

120

June 12, 2014 11:12 PM

3) The storage throughput is terrible. If you write anything to disk it's a colossal pain. Unfortunately much friendlier providers(DO,Linode) storage pools are tied to server size, with an absolute max.

Also, I won't argue data-center to data-center speed. You're absolutely correct. I wouldn't personally ignore last mile infrastructure though when discussing gaming, which is assumption I made in my response.

hplus0603

11,916

June 13, 2014 03:01 AM

Kylotan wrote:

The question is whether it would be possible for MMOs to do the same... or not. From what people are saying, the answer appears to be "Yes, of course... if you can overcome the complexity... which nobody is going to talk about in any detail". ;)

I agree with you, except I see no "of course" there.

As you said, character/character interaction is N-squared; connections (and web architecture) is all built around scaling out the N problem. No real-time physics simulation engine exists that scales out across machines along the axis of the number of cross-interacting entities, although they exist (expensively, see DIS) for making each separate actor extremely complex.

If your needs match those of Farmville, an EC2 based web solution is great. The developers were quoted as saying "we're glad we had scripted bringing up more EC2 instances, because we couldn't have done so manually to keep up with the growth in demand."
I would love for there to exist a similarly flexible solution and architecture for the N-squared character interaction problem. But there doesn't, for rather deep technical reasons as described above.

enum Bool { True, False, FileNotFound };

fir

-460

June 13, 2014 07:25 AM

There are two reasons not to run games on EC2:

1) Amazon charges an arm and a leg for bandwidth. You can buy it MUCH cheaper in a co-lo facility.

2) Virtualization induces scheduling jitter, which impacts real-time physics simulation. If your CPU suddenly goes away for 100 milliseconds, that's a six frame stutter, which is quite noticable. When I measured this, it could get as bad as 1500 milliseconds.

The real world performance of network infrastructure is not even slightly approaching 50% light
15:48 ~ jwatte@AF002000$ traceroute www.interserver.net
traceroute to www.interserver.net (198.41.189.28), 30 hops max, 60 byte packets
 1  * * *
 2  208.71.159.129 (208.71.159.129)  0.299 ms  0.573 ms  0.492 ms
 3  117.Vl117-Cr01-PAIX-PAL.unwiredltd.net (204.11.106.45)  2.014 ms  1.902 ms  1.864 ms
 4  209.63.145.114 (209.63.145.114)  3.698 ms  3.710 ms  3.687 ms
 5  be-1.br02.chcgildt.integra.net (209.63.82.186)  56.066 ms  56.075 ms  56.052 ms
 6  xe-1-2-0.edge01.ord02.as13335.net (206.223.119.180)  54.250 ms  53.702 ms  53.516 ms
 7  198.41.189.28 (198.41.189.28)  53.469 ms  53.513 ms  53.503 ms
15:48 ~ jwatte@AF002000$
The light transmission time from Oakland to New York and back (2900 miles each way) in copper (about 2/3 the speed of vacuum) is about 46 milliseconds. In this case (from well-connected data center to well-connected data center) we are substantially CLOSER to speed-of-light than 50%. (46/53 is about 86% speed of light.)

Most of the delay comes from slow residential "last mile" connection issues, and WiFi access points, which may vary from sub-millisecond to dozens-of-milliseconds.

does such ping results physically reliable? i mean when ping from A city to B city, and ping gives 50 ms it means that it really is avaliable in b after a 50 ms delay? (ping gives one way travel time? how it does the clocks synchronisation?) - as far as i know life it may show that

this value as some half theoretical one and real physical times may be larger here (though i know very little about this, im just 'investigating'/curious)

Tribad

981

June 13, 2014 07:34 AM

ping is answered on a very low layer of the protocol stack. No application is involved. So it is nearly the time needed for the physical transport from a to b.

hplus0603

11,916

June 13, 2014 03:30 PM

i mean when ping from A city to B city, and ping gives 50 ms it means that it really is avaliable in b after a 50 ms delay? (ping gives one way travel time?

Ping gives the two-way travel time. The time from A to B is half that of Ping.

If the sending application and the receiving application are written properly, and are running on servers that are properly provisioned (not overloaded,) then the application-to-application time will be very similar to the ping time. However, if there are problems in the implementation of the application, or the management of the server, application-to-application time may be a lot longer.

enum Bool { True, False, FileNotFound };

fir

-460

June 13, 2014 03:57 PM

i mean when ping from A city to B city, and ping gives 50 ms it means that it really is avaliable in b after a 50 ms delay? (ping gives one way travel time?

Ping gives the two-way travel time. The time from A to B is half that of Ping.

If the sending application and the receiving application are written properly, and are running on servers that are properly provisioned (not overloaded,) then the application-to-application time will be very similar to the ping time. However, if there are problems in the implementation of the application, or the management of the server, application-to-application time may be a lot longer.

well at least some very good news,

I got an radio waves internet connection and got pings 150-250 ms

(ocassionaly 400, 600 ms) (think its low - this is because this radio wave connestion?) Does usually people with underground connection

have it much lower? can there be assume some reasonable average ping, and some fast connection average ping?

Does mmo games work at much slower rate than those pings?

hplus0603

11,916

June 13, 2014 05:21 PM

Yes, for radio-based internet, it's typically frequency arbitration and occasionally drops/collisions that increase the latency.

For wired connections, how much your ISP adds depends on the quality of their network and their willingness to peer with well-connected back ends.

Comcast (my home ISP) adds about 20 milliseconds going 10 miles, and adds between 0.01% and 10% packet drop depending on who I'm trying to talk to.

MMOs can live with seconds of latency. It depends on the play style. If the play-style focuses on physics simulation and player/player interaction, low latency is very important (like for an FPS.)

enum Bool { True, False, FileNotFound };

MMOs and modern scaling techniques

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

MMOs and modern scaling techniques

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines