MMOs and modern scaling techniques

Ben Sizer · 2014-07-20T00:15:34

(NB. I am using MMO in the traditional sense of the term, ie. a shared persistent world running in real-time, not in the modern broader sense, where games like Farmville or DOTA may have a 'massive' number of concurrent players but there is little or no data that is shared AND persistent AND updating in real-time.) In recent discussions with web and app developers one thing has become quite clear to me - the way they tend to approach scalability these days is somewhat different to how game developers do it. They are generally using a purer form of horizontal scaling - fire up a bunch of processes, each mostly isolated, communicating occasionally via message passing or via a database. This plays nicely with new technologies such as Amazon EC2, and is capable of handling 'web-scale' amounts of traffic - eg. clients numbering the the tens or hundreds of thousands - without problem. And because the processes only communicate asynchronously, you might start up 8 separate processes on an 8-core server to make best use of the hardware. In my experience of MMO development, this is not how it works. There is a lot of horizontal scaling, but instead of firing up servers on demand, we pre-allocate them and tend to divide them geographically - both in terms of real world location so as to be closer to players, and in terms of in-game locations, so that characters that are co-located also share the same game process. This would seem to require more effort on the game developer's part but also imposes several extra limitations, such as making it harder to play with friends located overseas on different shards, requiring each game server to have different configuration and data, etc. Then there is the idea of 'instancing' a zone, which could be thought of as another geographical partition except in an invisible 4th dimension (and that is how I have implemented it in the past). MMOs do have a second trick up their sleeves, in terms of it being common to farm out certain tasks to various heterogeneous servers. A typical web app might just have many instances of the front-end server and one database (possibly with some cache servers in between), but in my experience MMOs will often have specific servers for handling authentication, chat and communications, accounts and transactions, etc. It's almost like extreme refactoring; if a piece of functionality can run asynchronously from the gameplay then it can be siphoned out into a new server and messaging to and from the game server set up accordingly. But in general, MMO game servers are limited in their capacity, so that you can typically only get 500-1500 players in one place. You can change the definition of 'place' by adding instancing and shards, you can make the world seem to hold more characters by seamlessly linking servers together at the boundaries, and you can increase concurrency a bit more via farming out tasks to special servers. So I wonder; are we doing it wrong? And more specifically, can we move to a system of homogeneous server nodes, created on demand, communicating via message passing, to achieve a larger single-shard world? Partly, the current MMO server architecture seems to be born out of habit. What started off as servers designed to accommodate a small number of people grew and grew until we have what we see today - but the underlying assumption is that a game server should (in most cases) be able to take a request from a client, process it atomically and synchronously, and alter the game state instantly, often replying at the same time. We keep all game information in RAM because that is the only way we can effectively handle the request synchronously. And we keep all co-located entities in the same RAM because that's the only way we can easily handle multiple-entity transactions (eg. trading gold for items). But does this need to be the case? My guess is that the main reason we can't move to a more distributed architecture comes partly down to latency but mostly down to complexity. If characters exist across an arbitrary number of servers, any action involving multiple characters is going to require passing messages to those other processes and getting all the responses back before proceeding. This turns behaviour that used to be a single function into either a coroutine (awkward in C++) or some sort of callback chain, also requiring error-detection (eg. if one entity no longer exists by the time the messages get processed) and synchronisation (eg. if one entity is no longer in a valid state for the behaviour once all the data is collected). This seems somewhat intractable to me - if what used to be a simple piece of functionality is now 3 or 4 times as complex, you're unlikely to get the game finished. And will the latency be too high? For many actions, I expect not, but for others, I fear it would. But am I wrong? Outside of games people are writing large and complex applications using message queues and asynchronous behaviour. My suspicion is that they can do this because they don't have a large amount of shared state (eg. world and character data). But maybe it's because they know ways to accomplish these tasks that somehow the game development community has either not become aware of or simply not been able to implement yet. Obviously there have been attempts to mix the two ideas, by running many homogeneous servers but attempting to co-locate all relevant data on demand so that the actual work can be done in the traditional way, by operating atomically on entities in RAM. On paper this looks like a great solution, with the only problem being that it doesn't seem to work in practice. (eg. Project Darkstar and various offshoots.) Sending the entities across the network so that they can be operated on appears to be like trying to send the mountain to Mohammed rather than him going to the mountain (ie. sending the message to the entity). What you gain in programming simplicity you lose in serialisation costs and network latency. A weaker version of this would be automatic geographical load balancing, I suppose. So, I'd like to hear any thoughts on this. Can we make online games more amenable to an async message-passing approach? Or are there fundamental limitations at play?

Networking and Multiplayer Programming

Started by Kylotan June 10, 2014 01:26 PM

65 comments, last by wodinoneeye 9 years, 9 months ago

JMDeruty

112

July 17, 2014 10:10 AM

In my opinion, gameplay comes first, then we imagine software to make it possible. I mean that building an angine to enable compact crowds of players with collision & physics simulation don't looks like enough to define the player experience and what really matters. Is the compact crowd (public transportation like) even something desirable to provide a good experience? If not, the issue should be dealt with before (disable collisions in crowded/marketplace areas)

And if you need that in the game (large scale -interesting- close combat battles?), I doubt that a generic engine would do.

That mean a lot of testing & tweaking, keeping in mind that what's important is not accurancy but player experience according to the gameplay mechanics. Players everywhere would mean that individuals loose significance. That's not PvP, but Mob vs Mob. Maybe some sort of LOD with precise local interactions then working with densities and statistical behaviors at longer range to reduce the quantity of informations required. In this case you have to deal with the issues listed by hplus0603. At the end it depends on the game's priorities and compromises.

PS: I have met web architects advising to use simple stateless loadbalancing for realtime multiplayer with Redis to manage send queues and game state. If you need physics or whatever, it won't work. Too much latency, and that's not Redis fault. The software is fine but not made for this. In my opinion, that's shoe horning REST optimisation into something which cannot be made stateless. Because there are to many player interactions. It works for chats however because: 1) There is not so much user interactions 2) latency is not really a concern.

hplus0603

11,916

July 17, 2014 04:09 PM

In my opinion, gameplay comes first, then we imagine software to make it possible.

More than one game company has died by designing something that, in the end, the engineering team couldn't actually deliver.

enum Bool { True, False, FileNotFound };

Servant of the Lord

33,739

July 17, 2014 05:33 PM

In my opinion, gameplay comes first, then we imagine software to make it possible.

More than one game company has died by designing something that, in the end, the engineering team couldn't actually deliver.

Death by Romero. Or Blackley if your prefer.

While gameplay and story definitely trumps graphics, designers need to know what tech limits they have and work within (and press against and stretch) those limits as far as possible, not dream up whatever amazing thing comes to their mind, and then realize 5 months before the deadline that you'll have to make major cuts to actually be able to release the game on current technology. Having to cut alot of gameplay features is like poorly resizing an image and leaving ugly graphical artifacts. It ruins the cohesiveness and polish of the game.

JMDeruty

112

July 17, 2014 05:39 PM

In my opinion, gameplay comes first, then we imagine software to make it possible.

More than one game company has died by designing something that, in the end, the engineering team couldn't actually deliver.

Fine, I cannot disagree with that But that's not the what I meant: Many companies died by designing something which was way more complicated and beautiful than what the actual need required. So without clear goal & specifications, we (myself first) tend to overthink things.

Sorry to have implied that. At the end this question is more a question of well designed development process, and evaluation. And thanks for the reference, I didn't know it.

More in line with the topic, a few years ago I had read from Bigworld about using simulated players to evaluate the scaling behavior of MMOs. With the availability of AWS or Azure, do you know if this kind of testing has been used effectively by studios? Apart from the difficulty of simulating pertinent player behaviors, are there other pitfalls?

Servant of the Lord

33,739

July 17, 2014 06:58 PM

More in line with the topic, a few years ago I had read from Bigworld about using simulated players to evaluate the scaling behavior of MMOs. With the availability of AWS or Azure, do you know if this kind of testing has been used effectively by studios? Apart from the difficulty of simulating pertinent player behaviors, are there other pitfalls?

I don't know about MMOs, but the indie game SpyParty (a 1 vs 1 game) wrote up a series of articles about using Amazon to loadtest their lobby and game hosting server:

Loadtesting for Open beta, Part 1 & [Part 2], [Part 3], [Part 4]

I found it to be an interesting read.

Jurie Horneman

125

July 18, 2014 12:26 PM

Fascinating subject. A couple of observations and half-remembered stories, which I hope contribute a little bit, even if they don't answer Ben's original questions:

- A good friend of mine worked on a proper MMO back in 2001 (Rhyzome), and I remember him doing the math for me and showing that he had, I dunno, 10 CPU cycles to decide which data to send where (based on the data fields per player and the number of players per server). So he had to write multiple levels of prioritization algorithms. He told me this story to illustrate how unsuited big, expensive web servers were for MMOs. This has probably changed, although I don't know to which degree.

- Unlike what Ben said I am seeing signs of splitting things up into multiple servers per service type in the web world. It seems to have become the architectural style du jour, in fact. But I have no strong evidence, this is just what I'm picking up.

- State seems to be the key difference. HTTP is stateless, games not so much. The rising popularity of unit testing and test-driven development correlates with the rise of web dev. The same goes for functional programming and the rise of back end languages like Erlang and Clojure.

- I've seen a case in a social game company of engineering pushing back design because of increased state. I assume this happens a lot. I've certainly seen enough cases of eventual consistency in social games.

- "Proper" MMOs are probably the cheetahs (or koalas) of the engineering world. They grew for the historical reasons Ben described, and now they occupy a niche that is very hard to fill in any other way than the way they do it.

- If you want to know how the big web companies do scalability, I highly recommend http://highscalability.com There's a lot of material (presentations, papers) available. Favorite arcane detail: bidding algorithms to control EC2 costs...

wodinoneeye

1,691

July 20, 2014 12:05 AM

Higher complexity can require scaling to be even greater (not linear) than previous simpler games.

(Good) NPC AI for example with farmed-out AI processing -- and those seperate NPC AI computers having to maintain their own local world representations - with all the volumes of world map updates flowing (hopefully across a high speed server network). Now with the greatly increase data traffic through individual world map zone servers (the state book keeping process) that starts overwhelming/burdening them (requiring yet more of them and the overhead of the zone/area edge handling - for large continuous worlds)

Communication bound limitations as a secondary effect to the data processing bound (and AI uses magnitudes more CPU and significantly more local data per 'smart' object)

The new complexity can require another O(N^2) expansion as there are more interactions across CPUs handling fewer and fewer objects each (and the traffic having to go across the much slower network interface instead of within the same shared memory space)

--------------------------------------------[size="1"]Ratings are Opinion, not Fact

wodinoneeye

1,691

July 20, 2014 12:15 AM

Ping times are a source of complexity and gameplay challenges, but they are not a source of scalability problems.

Ping times for wired connections will not drop dramatically in the future, because they are bound by the speed of light -- current internet is already within a factor of 50% of the speed of light, so the maximum possible gains are quite well bounded.

I'm not sure if I'm misreading you. But I feel like what you said is very misleading. The real world performance of network infrastructure is not even slightly approaching 50% light. We typically max at 20% in best case scenarios.

The majority of transit time is eaten up by protocol encoding/decoding in hardware, and improving the hardware or the protocol can dramatically increase transit latency. Ex. Going from tcp to infinband inside a cluster can reduce latency from 2milliseconds to nanoseconds.

Not saying it's practical by any means, but we're bound by switches/protocols far more than light.

And by the number of 'hops' along the way of the path the data takes (repeating the above overhead over and over).

Maybe in the future we will have a more 'Johnny Canal' (https://screen.yahoo.com/johnny-canal-000000884.html) type Internet system

with fewer hops but well that costs lots of cash ...

--------------------------------------------[size="1"]Ratings are Opinion, not Fact

MMOs and modern scaling techniques

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

MMOs and modern scaling techniques

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines