Jump to content

  • Log In with Google      Sign In   
  • Create Account


MMOs and modern scaling techniques


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
67 replies to this topic

#1 Kylotan   Moderators   -  Reputation: 3338

Like
7Likes
Like

Posted 10 June 2014 - 07:26 AM

(NB. I am using MMO in the traditional sense of the term, ie. a shared persistent world running in real-time, not in the modern broader sense, where games like Farmville or DOTA may have a 'massive' number of concurrent players but there is little or no data that is shared AND persistent AND updating in real-time.)

 

In recent discussions with web and app developers one thing has become quite clear to me - the way they tend to approach scalability these days is somewhat different to how game developers do it. They are generally using a purer form of horizontal scaling - fire up a bunch of processes, each mostly isolated, communicating occasionally via message passing or via a database. This plays nicely with new technologies such as Amazon EC2, and is capable of handling 'web-scale' amounts of traffic - eg. clients numbering the the tens or hundreds of thousands - without problem. And because the processes only communicate asynchronously, you might start up 8 separate processes on an 8-core server to make best use of the hardware.

 

In my experience of MMO development, this is not how it works. There is a lot of horizontal scaling, but instead of firing up servers on demand, we pre-allocate them and tend to divide them geographically - both in terms of real world location so as to be closer to players, and in terms of in-game locations, so that characters that are co-located also share the same game process. This would seem to require more effort on the game developer's part but also imposes several extra limitations, such as making it harder to play with friends located overseas on different shards, requiring each game server to have different configuration and data, etc. Then there is the idea of 'instancing' a zone, which could be thought of as another geographical partition except in an invisible 4th dimension (and that is how I have implemented it in the past).

 

MMOs do have a second trick up their sleeves, in terms of it being common to farm out certain tasks to various heterogeneous servers. A typical web app might just have many instances of the front-end server and one database (possibly with some cache servers in between), but in my experience MMOs will often have specific servers for handling authentication, chat and communications, accounts and transactions, etc. It's almost like extreme refactoring; if a piece of functionality can run asynchronously from the gameplay then it can be siphoned out into a new server and messaging to and from the game server set up accordingly.

 

But in general, MMO game servers are limited in their capacity, so that you can typically only get 500-1500 players in one place. You can change the definition of 'place' by adding instancing and shards, you can make the world seem to hold more characters by seamlessly linking servers together at the boundaries, and you can increase concurrency a bit more via farming out tasks to special servers.

 

So I wonder; are we doing it wrong? And more specifically, can we move to a system of homogeneous server nodes, created on demand, communicating via message passing, to achieve a larger single-shard world?

 

Partly, the current MMO server architecture seems to be born out of habit. What started off as servers designed to accommodate a small number of people grew and grew until we have what we see today - but the underlying assumption is that a game server should (in most cases) be able to take a request from a client, process it atomically and synchronously, and alter the game state instantly, often replying at the same time. We keep all game information in RAM because that is the only way we can effectively handle the request synchronously. And we keep all co-located entities in the same RAM because that's the only way we can easily handle multiple-entity transactions (eg. trading gold for items). But does this need to be the case?

 

My guess is that the main reason we can't move to a more distributed architecture comes partly down to latency but mostly down to complexity. If characters exist across an arbitrary number of servers, any action involving multiple characters is going to require passing messages to those other processes and getting all the responses back before proceeding. This turns behaviour that used to be a single function into either a coroutine (awkward in C++) or some sort of callback chain, also requiring error-detection (eg. if one entity no longer exists by the time the messages get processed) and synchronisation (eg. if one entity is no longer in a valid state for the behaviour once all the data is collected). This seems somewhat intractable to me - if what used to be a simple piece of functionality is now 3 or 4 times as complex, you're unlikely to get the game finished. And will the latency be too high? For many actions, I expect not, but for others, I fear it would.

 

But am I wrong? Outside of games people are writing large and complex applications using message queues and asynchronous behaviour. My suspicion is that they can do this because they don't have a large amount of shared state (eg. world and character data). But maybe it's because they know ways to accomplish these tasks that somehow the game development community has either not become aware of or simply not been able to implement yet.

 

Obviously there have been attempts to mix the two ideas, by running many homogeneous servers but attempting to co-locate all relevant data on demand so that the actual work can be done in the traditional way, by operating atomically on entities in RAM. On paper this looks like a great solution, with the only problem being that it doesn't seem to work in practice. (eg. Project Darkstar and various offshoots.) Sending the entities across the network so that they can be operated on appears to be like trying to send the mountain to Mohammed rather than him going to the mountain (ie. sending the message to the entity). What you gain in programming simplicity you lose in serialisation costs and network latency. A weaker version of this would be automatic geographical load balancing, I suppose.

 

So, I'd like to hear any thoughts on this. Can we make online games more amenable to an async message-passing approach? Or are there fundamental limitations at play?



Sponsor:

#2 hplus0603   Moderators   -  Reputation: 5163

Like
7Likes
Like

Posted 10 June 2014 - 10:08 AM

If I remember correctly, City of Heroes actually would spin up additional instances of "the same area" when player counts got too high. We also did that in There.com, to support large parties. This is good from a player point of view, too; you'd rather see all the players you can interact with, than having 1000 players in the same area and you can only see the nearest three meters...

 

When it comes to distributed simulation with scalable player density (not just player count,) that's a much tougher nut to crack. Web applications do not have nearly the level of coupling between different business objects that games do. When researching this back in the day, the best option I could come up with would be one where objects are only allowed to affect other objects one tick into the future, and there would be a big cross-bar of messaging of "these are all the interactions I detected this turn" that would be exchanged between the simulation servers between ticks. There would still be a n-squared scaling problem between servers, but with 40 or even 100 Gbps local networking, you can get pretty high in numbers before that becomes a bottleneck. This architecture lets objects be homed on a particular server, and then roam the entire world (assuming the server can keep the static world in RAM) and interact with all other objects with a one-tick delay. This is also very similar to the original DIS distribution model, except DIS typically had a single entitly (plane, tank, etc) per network node, rather than, say, 1000 entities per node.

 

The other problem is that dense simulation IS an N-squared problem. If everyone piles in a big dogpile, you will potentially have interactions between all players to all other players. There's no way around that. Same thing as collision detection -- in the worst case, every object really DOES interact every other object.


enum Bool { True, False, FileNotFound };

#3 Kylotan   Moderators   -  Reputation: 3338

Like
5Likes
Like

Posted 10 June 2014 - 11:01 AM


If I remember correctly, City of Heroes actually would spin up additional instances of "the same area" when player counts got too high. We also did that in There.com, to support large parties. This is good from a player point of view, too; you'd rather see all the players you can interact with, than having 1000 players in the same area and you can only see the nearest three meters...

 

Yeah, I alluded to that at the end of my 3rd paragraph. To be honest I hate the 'instancing' approach, but it does help you scale and some players prefer it (eg. so that they and their friends get a dungeon all to themselves).

 

 

 


Web applications do not have nearly the level of coupling between different business objects that games do.

 

I am inclined to agree. However I met someone yesterday who vehemently claimed that modern online games are already using web scaling approaches - though, when pressed, he was unable or unwilling to name a game that does this and which has a large amount of shared state, citing only that League of Legends has 500K concurrent players. But as we know, that is actually 50K concurrent games each with 10 players - quite a different problem.

 

 

 


[...]the best option I could come up with would be one where objects are only allowed to affect other objects one tick into the future[...]

I've seen similar approaches used in threaded systems to reduce contention, and it seems to work well, if you are able to get all the logic right. I am still concerned about algorithms such as trading, however. Is there an easy way to conduct atomic trades in such a system that is impossible to exploit? I am aware (though not too familiar) with concepts like three-phase commits for such purposes, but I suspect that writing all events that affect 2 or more entities using such a system would be too complex.

 

I guess I am generally interested in whether we're missing any tricks from web and app development, by being stuck in our ways and developing servers in much the same way we did 10 years ago. For example, some people are suggesting holding all data in persistent storage and manipulating it via memcached, and on a similar line Cryptic once insisted that they needed every change to characters to hit the DB (resulting in them writing their own database to make this feasible), rather than the usual "change in memory, serialise to DB later" approach. But do these methods pay off?


Edited by Kylotan, 10 June 2014 - 11:07 AM.


#4 SeanMiddleditch   Members   -  Reputation: 5113

Like
2Likes
Like

Posted 10 June 2014 - 11:59 AM

Your best bet is to just look at documentation and GDC presentations from companies already doing single-shard MMOs, like CCP (with EVE Online).

#5 VFe   Members   -  Reputation: 120

Like
3Likes
Like

Posted 10 June 2014 - 03:02 PM

I'm in a somewhat similar situation, I'm currently involved in design of scalable game systems. I come from a "big data" devops background, and have been facing many of the same issues. I have the same impression you got, that what us guys writing "large applications" do is completely alien to most game developers who only have game dev backgrounds.

 

 

My take-away so far:

 


  • The game itself being designed to be scalable is probably the biggest thing. GIGO. Some types of game-play are just inherently bad for scalability. So you're left with a choice of latency vs scalability.

  • In many cases, there's ways to increase the complexity of events to break them apart to increase their scalability. In situations where this is asymptotically impractical or impossible, you need to design out the local machine state. That is to say, if events must be processed sequentially, design them so that the sequential elements can be processed on any server. Borrow from REST where you can. The trade-off here is you're often increasing your processing or network over-head as much as 3x...but honestly, it's often worth it. It's counter-intuitive, especially for game developers to design purposefully wasteful systems.

  • Another problem for game devs in particular, OOP(design and principle, not necessarily the languages), is the enemy. It requires an extreme paradigm shift, the more state you can design away, the more scalability you get.


 

We're using an system very similar to what you're describing, asynchronous message passing across geographically diverse clusters(implemented in Clojure and Erlang, if anyones curious, with Riak as our back-end database), for large parts of our game system. And it works really well.

 

I'd say you can use these methods for just about any game system outside FPS, fighting games, maybe racing. Anything that is super latency sensitive is going to end up being a bad choice, but for everything else, I think it's a major win.

 

 

I guess I am generally interested in whether we're missing any tricks from web and app development, by being stuck in our ways and developing servers in much the same way we did 10 years ago. For example, some people are suggesting holding all data in persistent storage and manipulating it via memcached, and on a similar line Cryptic once insisted that they needed every change to characters to hit the DB (resulting in them writing their own database to make this feasible), rather than the usual "change in memory, serialise to DB later" approach. But do these methods pay off?

 

 

This imho is actually a really powerful method for games, similar to how we're doing things. Where every action is recorded to persistent storage(Riak), and then other players are essentially reading other players writes in a sense. With the memcache or application layer managing it.

 

Depending on the DB setup, this can be fast enough for real-time, but as with anything it's a tradeoff ala CAP theorem. For us, we decided that a little bit more latency in exchange for more scalability was a worthwhile tradeoff.


Edited by VFe, 10 June 2014 - 03:13 PM.


#6 Kylotan   Moderators   -  Reputation: 3338

Like
3Likes
Like

Posted 10 June 2014 - 03:59 PM

Your best bet is to just look at documentation and GDC presentations from companies already doing single-shard MMOs, like CCP (with EVE Online).

 

EVE's architecture - as far as I can tell - is not really any different from most of the others, in that it's geographically partitioned (specifically, one process per Solar System). They have some pretty hefty hardware on the back-end for persistence, which presumably is why they don't need to run multiple shards. I'm guessing that they don't have a great need for low latency either, which helps.

 

I don't know any other single-shard MMOs that are of a significant size; I'd be interested to learn of them (and especially of their architecture).


Edited by Kylotan, 10 June 2014 - 03:59 PM.


#7 Kylotan   Moderators   -  Reputation: 3338

Like
3Likes
Like

Posted 10 June 2014 - 04:10 PM


Some types of game-play are just inherently bad for scalability.

 

Sure. My hypothesis is that the traditional MMORPG is bad for scalability. Lots of actions depend on being able to query, and conditionally modify, more than one entity simultaneously. If I want to trade gold for an NPC's sword, how do we do that without being able to lock both characters? It's not an intractable problem - there are algorithms for coordinating trades between 2 distributed agents - but they are 10x more complex to implement than if a single process had exclusive access to them both.

 

(The flippant answer is usually to delegate this sort of problem to the database; but while this can make the exchange atomically, it doesn't help you much with ensuring that what is in memory is consistent, unless you opt for basically not storing this information in memory, which brings back the latency problem... etc)

 


every action is recorded to persistent storage(Riak), and then other players are essentially reading other players writes in a sense. With the memcache or application layer managing it.

 

This sounds interesting, but I would love to hear some insights into how complex multi-player interactions are implemented. Queries that can be high performance one-liners when the data is in memory are slow queries when you call out to memcache. And aren't there still potentially race conditions in these cases?



#8 hplus0603   Moderators   -  Reputation: 5163

Like
2Likes
Like

Posted 10 June 2014 - 08:05 PM

modern online games are already using web scaling approaches


There really are two kinds of things going on in most games.
One is the real-time component, where many other players see my real-time movements and my real-time actions.
The other is the persistent game state component, which is not that different from any other web services application.
The trick is that, if you try to apply web services approaches to the real-time stuff, you will fail, and if you try to apply the real-time approach to the web services stuff, you will pay way too much to build whatever you want to build.

Some projects have tried to unify these two worlds in one way or another (for example, Project Darkstar) but it's never really worked out well. In my opinion, doing that is a bit like trying to unify real-time voice communication with one-way streamed HD video delivery -- the requirements are so drastically different, it just doesn't make any sense to do that.
That analogy isn't entirely bad: One-way streamed video is a lot more like "web apps" than real-time voice communication, which is more like interactive games, except without the N-squared interactions and physics rules.

So, one way of approaching the scalability problem is to dumb down the gameplay until it's no longer a real-time, interactive, shared world. That doesn't make for fun games for those people who want those things. The other way is to be careful about only building the bits that you really do need for your desired gameplay, and optimizing those parts to the best of the state of the art.
Getting the game developers, and the web developers, to understand each other, and work under the same roof, AND making sure that the right tool is used for the right problem, while delivering the needed infrastructure for gameplay development, is really hard to pull off; I've seen many projects die because it misses one of those "legs" and doesn't even know what it's missing. (A bit of the Dunning-Kruger effect, coupled with "I've got a great hammer!")
enum Bool { True, False, FileNotFound };

#9 Washu   Senior Moderators   -  Reputation: 4992

Like
3Likes
Like

Posted 10 June 2014 - 09:35 PM


EVE's architecture - as far as I can tell - is not really any different from most of the others, in that it's geographically partitioned (specifically, one process per Solar System).

Yes, one system per process, however they have the ability to move individual systems around to allow for extra processing power for when a system needs it, when you've got 7.5k players participating in a single battle in a single system... it requires a lot of processing power, the physics system alone takes a huge amount of time.

 


They have some pretty hefty hardware on the back-end for persistence, which presumably is why they don't need to run multiple shards.

The reason they don't run multiple shards is that it doesn't fit the game. The economy and the resources used to build everything is the core, multiple shards would defeat that purpose. As such it was architected around the idea of a single unified system.

 


I'm guessing that they don't have a great need for low latency either, which helps.

Up until the introduction of time dilation, latency and system resources was actually the major factor in determining the winner of a fight. When they switched out from standard blocking sockets to IOCP they reduced the system resource issue, but latency was still a major factor in determining who won a fight. Prior to TD and the IOCP switchup, the dreaded black screen of death (as it was known) was the major issue facing players in determining the winner of a fight. In essence, whomever got the most players into a system first... won. People with less latency (mainly brits and europeans) had a distinct advantage here, due to the location of the EVE data center.

 

I do recommend reviewing their GDC presentations, especially on their backend architecture, its quite flexible, now... it wouldn't necessarily work in a zoneless system, although I bet through some clever trickery and using player locality you could engineer a system of virtual boxes or spheres of players that overlapped and allowed you to grow or shrink on demand.


In time the project grows, the ignorance of its devs it shows, with many a convoluted function, it plunges into deep compunction, the price of failure is high, Washu's mirth is nigh.
ScapeCode - Blog | SlimDX


#10 snacktime   Members   -  Reputation: 291

Like
3Likes
Like

Posted 10 June 2014 - 10:04 PM

Ok so I'll chime in here.

 

Distributed archtiectures are the right tool for the job.  What I've seen is that the game industry is just not where most of the innovation with server side architectures is happening, so they are a bit behind.  That has been my observation from working in the industry and just looking around and talking to people.

 

When it comes to performance, it's all about architecture.  And the tools available on the server side have been moving fast in recent years.  Traditionally large multiplayer games weren't really great with architecture anyways, and for a variety of reasons they just kind of stuck with what they had.  This is changing, but slowly.

 

I think the game industry also had other bad influences like trying to equate the lessons learned on client side performance to the server. 

 

The secret to the current distributed systems is that almost all of them are based on message passing, usually using the actor model.  No shared state, messages are immutable, and there is no reliability at the message or networking level.  Threading models are also entirely different.  For example the platform I work a lot with, Aka, in simple terms passes actors between threads instead of locking each one to a specific thread.  They can use amazingly small thread pools to achieve high levels of concurrency.

 

What you get out of all that is a system that scales with very deterministic performance, and you have a method to distribute almost any workload over a large number of servers.

 

Another thing to keep in mind is that absolute performance usually matters very little on the server side.  This is something that is just counter intuitive for many game developers.  For an average request to a server, the response time difference between a server written in a slow vs a fast language is sub 1ms.  Usually it's in microseconds.  And when you factor in network and disk io latencies, it's white noise.  That's why scaling with commodity hardware using productive languages is common place on the server side.  The reason why you don't see more productive languages used for highly concurrent stuff is not because they are not performant enough, it's because almost all of them still have a GIL (global interpreter lock) that limits them to basically running on a single cpu in a single process. My favorite model now for being productive is to use the JVM but use jvm languages such as jruby or closure when possible, and drop to java only when I really need to.

 

For some of the specific techniques used in distributed systems, consistent hashing is a common tool.  You can use that to spread workloads over a cluster and when a node goes down, messages just get hashed to another node and things move on.

 

Handling things like transactions is not difficult I do it fairly often.  I use an actor with a FSM, and it handles the negotiation between the parties.  You write the code so all messages are idempotent, and from there it's straight forward.

 

Handling persistence in general is fairly straight forward on a distributed system.  I use Akka a lot, and I basically have a large distributed memory store based on actors in a distributed hash ring, backed by nosql databases with an optional write behind cache in between.  Because every unique key you store is mapped to a single actor, everything is serialized.  For atomic updates I use an approach that's similar to a stored procedure.  Note that I didn't say this was necessarily easy.  There are very few off the shelf solutions for stuff like this.  You can find the tools to make it all, but you have to wire it up yourself.

 

Having worked on large games before, my recent experiences with distributed systems has been very positive.  A lot of it comes down to how concurrency is handled.  Having good abstractions for that in the actor model makes so many things simpler.  Not to say there are not any challenges left.  You hit walls with any system, I'm just hitting them much later now with almost everything.



#11 hplus0603   Moderators   -  Reputation: 5163

Like
4Likes
Like

Posted 11 June 2014 - 11:17 AM

Distributed archtiectures are the right tool for the job. What I've seen is that the game industry is just not where most of the innovation with server side architectures is happening, so they are a bit behind.


You are absolutely correct that game developers often don't even know that they don't know how to do business back ends effectively. Game technologies in general, and real-time simulations in particular, are like race cars running highly tuned engines on a highly controlled race track. Business systems are like trucks carrying freight over a large network of roads. Game developers, too often, try to carry freight on race cars.

Race cars are great at what they do, though. That's the bit that is fundamentally different from the kind of business object, web architectures that you are talking about. (And I extend that to not-just HTTP; things like Storm or JMS or Scalding also fits in that mold.) When you say that "the optimization of a single server doesn't matter," it tells me that you've never tried to run a real physics simulation of thousands of entities, that can all interact, in real time, at high frame rates. There are entire classes of game play and game feel that are available now, that were not available ten years ago, explicitly because you can cram more processing power and higher memory throughput into a single box. A network is, by comparison, a high-latency, low-bandwidth channel. If you're not interested in those kinds of games, then you wouldn't need to know about this, but the challenge we discuss here is games where this matters just as much as the ability to provide consistent long-term data services to lots of users.

I see, equally often, business-level engineers, and web engineers, who are very good at what they do, failing to realize the demands of game-type realtime interactive simulation. (Or, similarly, multimedia like audio -- check out the web audio spec for a real train wreck some time.)

A really successful, large scale company, has both truckers and race car drivers, and get them to talk to each other and understand each other's unique challenges.

Edited by hplus0603, 11 June 2014 - 12:58 PM.

enum Bool { True, False, FileNotFound };

#12 Kylotan   Moderators   -  Reputation: 3338

Like
4Likes
Like

Posted 11 June 2014 - 12:47 PM

This is basically the discussion that led me to post here - two sides both basically saying "I've done it successfully, and you can't really do it the way that the other people do it". Obviously this can't be entirely true. smile.png  I suspect there is more to it.

 

Let me ask some more concrete questions.

 

 

 


No shared state, messages are immutable, and there is no reliability at the message or networking level.

 

Ok - but there is so much in game development that I can't imagine trying to code in this way. Say a player wants to buy an item from a store. The shared state model works like this:

BuyItem(player, store, itemID):
    Check player has enough funds, abort if not
    Check itemID exists in store inventory, abort if not
    Deduct funds from player
    Add funds to store
    Add instance of itemID to player inventory
    Commit player funds and inventory to DB
    Notify player client and any observing clients of purchase

This is literally a 7-line function if you have decent routines already set up. Let's say you have to do it in a message-passing way, where the store and player are potentially in different processes. What I see - assuming you have coroutines or some other syntactic sugar to make this look reasonably sequential rather than callback hell - is something like this:

BuyItem(player, store, itemID):
    Check player has enough funds, abort if not
    Ask store if it has itemID in store inventory. Wait for response.
    If store replied that it did not have the item in inventory:
        abort.
    Check player STILL has enough funds, abort if not
    Deduct funds from player
    Tell store to add funds. Wait for response.
    If store replied that it did not have the item in inventory:
        add funds to player
        abort
    Add instance of itemID to player inventory
    Commit player funds to DB
    Notify player client and any observing clients of purchase

This is at least 30% longer (not counting any extra code for the store) and has to include various extra error-checks, which are going to make things error-prone. I suspect it gets even more complex when you try and trade items in both directions because you need both sides to be able to place items in escrow before the exchange (whereas here, it was just the money).

 

So... is there an easier or safer way I could have written this? I wouldn't even attempt this in C++ - without coroutines it would be hard to maintain the state through the routine. I suppose some system that allows me to perform a rollback of an actor would simplify the error-handling but there are still more potential error cases than if you had access to both sides of the trade and could perform it atomically.

 

You talk about "using an actor with a FSM", but I can't imagine having to write an FSM for each of the states in the above interaction. Again, comparing that to a 7-line function, it's hard to justify in programmer time, even if it undoubtedly scales further. I appreciate something like Akka simplifies both the message-passing and the state machine aspects, so there is that - but it's still a fair bit of extra complexity, right? (Writing a message handler for each state, swapping the message handler each time, stashing messages for other states while you do so, etc.)

 

Maybe you can generalise a bit - eg. make all your buying/selling/giving/stealing into one single 'trade' operation? Then at least you're not writing unique code in each case.

 

As for "writing the code so all messages are idempotent" - is that truly practical? I mean, beyond the trivial but worthless case of attaching a unique ID to every message and checking that the message hasn't been already executed, of course. For example, take the trading code above - if one actor wants to send 10 gold pieces to another, how do you handle that in an idempotent way? You can't send "add 10 gold" because that will give you 20 if the message arrives twice. You can't send "set gold to 50" because you didn't know the actor had 40 gold in the first place.

 

Perhaps that is not the sort of operation you want to make idempotent, and instead have the persistent store treat it as a transaction. Fair enough, and the latency wouldn't matter if you only do this for things that don't occur hundreds of times per second and if your language makes it practical. (But maybe there aren't all that many such routines? The most common one is movement, and that is easily handled in an idempotent way, certainly.)

 

Forgive my ignorance if there is a simple and well-known answer to this problem; it's been a while since I examined distributed systems on an academic level.


Edited by Kylotan, 11 June 2014 - 12:47 PM.


#13 ApochPiQ   Moderators   -  Reputation: 15063

Like
4Likes
Like

Posted 11 June 2014 - 04:22 PM

There's a lot of confusion and mistruth surrounding the way successful MMOs tend to be implemented. This can be attributed to any number of causes, but I think the bottom line is twofold:

1. Successful MMO developers know a lot more about distributed scale than the "hrrr drrr web-scale" crowd tends to realize.
2. Successful MMO developers rarely divulge all the secrets to their success. This feeds into Point 1.


These are hard problems, no doubt. It takes truly excellent engineering to solve them. People who claim to have found solutions but can't point to shipped and operational code are almost certainly in for a rude surprise when they actually try and put millions of players on their systems.

Scale is a very nonlinear thing. A lot of people intuit scale as being roughly linear... a hundred people is twice as many as fifty people, right? The fact is, you can get truly bizarre behavior with high-scalability systems. You might go from thousands of connections and 2% CPU usage and 90% free RAM to all of your hardware deadlocked and overheating by just adding a few dozen connections. You might get things that run faster by some metrics when loaded up with players. And so on.

Scalable distributed software is like quantum physics. If you think you can intuit what's going on, you're fucking delusional, period. This stuff is notoriously difficult and messy.


To (sort of, not really) answer the concrete questions that were posed earlier...

Yes, you need to make virtually all messages and transactions idempotent. The exceptions are fairly boring cruft, like pings and keep-alives and whatnot. But be careful if you assume that a ping/keep-alive isn't going to become a potential source of performance pain.

And no, you don't need to write a huge amount of extra code. Much of it can be abstracted and factored into reusable components fairly easily. You only need to write three-phase commit once, and then apply that routine to your transactions going forward; and so on. If you design carefully, you can eliminate a lot of the redundant cruft, and most of your code reads like a DSL that speaks in terms of messages and transactions.

That said, it does take writing it the hard way once or twice to learn where you can refactor effectively; I don't know of any shortcuts to that.

#14 snacktime   Members   -  Reputation: 291

Like
0Likes
Like

Posted 11 June 2014 - 04:26 PM

This is basically the discussion that led me to post here - two sides both basically saying "I've done it successfully, and you can't really do it the way that the other people do it". Obviously this can't be entirely true. smile.png  I suspect there is more to it.

 

Let me ask some more concrete questions.

 

 

 


No shared state, messages are immutable, and there is no reliability at the message or networking level.

 

Ok - but there is so much in game development that I can't imagine trying to code in this way. Say a player wants to buy an item from a store. The shared state model works like this:

BuyItem(player, store, itemID):
    Check player has enough funds, abort if not
    Check itemID exists in store inventory, abort if not
    Deduct funds from player
    Add funds to store
    Add instance of itemID to player inventory
    Commit player funds and inventory to DB
    Notify player client and any observing clients of purchase

This is literally a 7-line function if you have decent routines already set up. Let's say you have to do it in a message-passing way, where the store and player are potentially in different processes. What I see - assuming you have coroutines or some other syntactic sugar to make this look reasonably sequential rather than callback hell - is something like this:

BuyItem(player, store, itemID):
    Check player has enough funds, abort if not
    Ask store if it has itemID in store inventory. Wait for response.
    If store replied that it did not have the item in inventory:
        abort.
    Check player STILL has enough funds, abort if not
    Deduct funds from player
    Tell store to add funds. Wait for response.
    If store replied that it did not have the item in inventory:
        add funds to player
        abort
    Add instance of itemID to player inventory
    Commit player funds to DB
    Notify player client and any observing clients of purchase

This is at least 30% longer (not counting any extra code for the store) and has to include various extra error-checks, which are going to make things error-prone. I suspect it gets even more complex when you try and trade items in both directions because you need both sides to be able to place items in escrow before the exchange (whereas here, it was just the money).

 

So... is there an easier or safer way I could have written this? I wouldn't even attempt this in C++ - without coroutines it would be hard to maintain the state through the routine. I suppose some system that allows me to perform a rollback of an actor would simplify the error-handling but there are still more potential error cases than if you had access to both sides of the trade and could perform it atomically.

 

You talk about "using an actor with a FSM", but I can't imagine having to write an FSM for each of the states in the above interaction. Again, comparing that to a 7-line function, it's hard to justify in programmer time, even if it undoubtedly scales further. I appreciate something like Akka simplifies both the message-passing and the state machine aspects, so there is that - but it's still a fair bit of extra complexity, right? (Writing a message handler for each state, swapping the message handler each time, stashing messages for other states while you do so, etc.)

 

Maybe you can generalise a bit - eg. make all your buying/selling/giving/stealing into one single 'trade' operation? Then at least you're not writing unique code in each case.

 

As for "writing the code so all messages are idempotent" - is that truly practical? I mean, beyond the trivial but worthless case of attaching a unique ID to every message and checking that the message hasn't been already executed, of course. For example, take the trading code above - if one actor wants to send 10 gold pieces to another, how do you handle that in an idempotent way? You can't send "add 10 gold" because that will give you 20 if the message arrives twice. You can't send "set gold to 50" because you didn't know the actor had 40 gold in the first place.

 

Perhaps that is not the sort of operation you want to make idempotent, and instead have the persistent store treat it as a transaction. Fair enough, and the latency wouldn't matter if you only do this for things that don't occur hundreds of times per second and if your language makes it practical. (But maybe there aren't all that many such routines? The most common one is movement, and that is easily handled in an idempotent way, certainly.)

 

Forgive my ignorance if there is a simple and well-known answer to this problem; it's been a while since I examined distributed systems on an academic level.

 

Handling something like a transaction is really not that different in a distributed system.  All network applications deal with unreliable messaging, reliability and sequencing has to be added in somewhere.  Modern approaches put it at the layer that defined the need in the first place, as opposed to putting it into a subsystem and relying on it for higher level needs, which is just a leaky abstraction and an accident waiting to happen.

 

If a client wants to send 10 gold to someone and sends a request to do that, the client has no way of knowing if the request was processed correctly without an ack.  But the ack can be lost, so the situation where you might have to resend the same request is present in all networked applications.

 

Blocking vs non blocking is mostly an issue that comes up at scale.  For things that don't happen 100,000 times per second, you don't need to ensure that stuff doesn't block.

 

As for FSM's, I use them a lot but mostly because I can do them in ruby, and ruby DSL's I find easy to read and maintain.  That Akka state machine stuff I don't like, never used it.  Seems a lot more awkward then it needs to  be.   Things like purchasing items is something that's not in the hot path, I can afford to use more expressive languages for stuff like that.



#15 Kylotan   Moderators   -  Reputation: 3338

Like
3Likes
Like

Posted 11 June 2014 - 05:08 PM


1. Successful MMO developers know a lot more about distributed scale than the "hrrr drrr web-scale" crowd tends to realize.
2. Successful MMO developers rarely divulge all the secrets to their success. This feeds into Point 1.

 

And yet, pretty much every published example of MMO scaling seems to focus on the old-school methods. You'd think that, given how much has been said on the matter, that there would be at least one instance of people talking about using different methods, but I've not seen one. I was hoping someone on this thread would be able to point me in the right direction. Instead, I'm in much the same position as I was before I posted - people insist that newer methods are being used, but provide no citations. :)

 


Yes, you need to make virtually all messages and transactions idempotent.

 

I'd love to see an example of how to do this, given that many operations are most naturally expressed as changes relative to a previous state (which may not be known). I assume there is literature on this but I can't find it.

 


You only need to write three-phase commit once, and then apply that routine to your transactions going forward; and so on.

 

I suspected this might be the case but I am sceptical about the overhead in both latency and complexity. But then I don't have any firm evidence for either. :)



#16 Kylotan   Moderators   -  Reputation: 3338

Like
2Likes
Like

Posted 11 June 2014 - 05:30 PM


If a client wants to send 10 gold to someone and sends a request to do that, the client has no way of knowing if the request was processed correctly without an ack. But the ack can be lost, so the situation where you might have to resend the same request is present in all networked applications.

 

I think we're crossing wires a bit here. Reliable messaging is a trivial problem to solve (TCP, or a layer over UDP), and thus it is easy to know that either (a) the request was processed correctly, or will be at some time in the very near future, or (b) the other process has terminated, and thus all bets are off. It's not clear why you need application-level re-transmission. But even that's assuming a multiple-server approach - in a single game server approach, this issue never arises at all - there is a variable in memory that contains the current quantity of gold and you just increment that variable with a 100% success rate. Multiple objects? No problem - just modify each of them before your routine is done.

 

What you're saying, is that you willingly forgo those simple guarantees in order to pursue a different approach, one which scales to higher throughput better. That's fine, but these are new problems, unique to that way of working, not intrinsic to the 'business logic' at all. With 2 objects co-located in one process you get atomicity, consistency, and isolation for free, and delegate durability to your DB as a high-latency background task.


Edited by Kylotan, 11 June 2014 - 05:32 PM.


#17 snacktime   Members   -  Reputation: 291

Like
2Likes
Like

Posted 11 June 2014 - 06:09 PM

 


If a client wants to send 10 gold to someone and sends a request to do that, the client has no way of knowing if the request was processed correctly without an ack. But the ack can be lost, so the situation where you might have to resend the same request is present in all networked applications.

 

I think we're crossing wires a bit here. Reliable messaging is a trivial problem to solve (TCP, or a layer over UDP), and thus it is easy to know that either (a) the request was processed correctly, or will be at some time in the very near future, or (b) the other process has terminated, and thus all bets are off. It's not clear why you need application-level re-transmission. But even that's assuming a multiple-server approach - in a single game server approach, this issue never arises at all - there is a variable in memory that contains the current quantity of gold and you just increment that variable with a 100% success rate. Multiple objects? No problem - just modify each of them before your routine is done.

 

What you're saying, is that you willingly forgo those simple guarantees in order to pursue a different approach, one which scales to higher throughput better. That's fine, but these are new problems, unique to that way of working, not intrinsic to the 'business logic' at all. With 2 objects co-located in one process you get atomicity, consistency, and isolation for free, and delegate durability to your DB as a high-latency background task.

 

 

So this is an interesting topic actually.  The trend is to move reliability back up to the layer that defined the need in the first place, instead of relying on a subsystem to provide it.  

 

Just because the network layer guarantees the packets arrive doesn't mean they get delivered to the business logic correctly, or processed correctly.   If you think 'reliable' udp or tpc makes your system reliable, you are lying to yourself.

 

http://www.infoq.com/articles/no-reliable-messaging

http://web.mit.edu/Saltzer/www/publications/endtoend/endtoend.txt

http://doc.akka.io/docs/akka/2.3.3/general/message-delivery-reliability.html



#18 hplus0603   Moderators   -  Reputation: 5163

Like
3Likes
Like

Posted 11 June 2014 - 06:49 PM

First: Try writing a robust real-time physics engine that can support even a small number like 200 players with vehicles on top of the web architecture that @snacktime and @VFe describe. I don't think that's the right tool for the job.

Second:

You'd think that, given how much has been said on the matter, that there would be at least one instance of people talking about using different methods, but I've not seen one.


Personally, I've actually talked a lot about this in this very forum for the last ten-fifteen years. For reference, the first one I worked on was There.com, which was a full-on single-instance physically-based virtual world. It supported full client- and server-side physics rewind; a procedural-and-customized plane the size of Earth; fully customizable/composable avatars with user-generated content commerce; voice chat in world; vehicles ridable by multiple players, and a lot of other things; timeframe about 2001. The second one (where I still work) is IMVU.com where we eschew physics in the current "room based" experience because it's so messy. IMVU.com is written almost entirely on top of web architecture for all the transactional stuff, and on top of a custom low-latency ephemeral message queue (written in Erlang) for the real-time stuff. Most of that's sporadically documented in the engineering blog: http://engineering.imvu.com/
enum Bool { True, False, FileNotFound };

#19 ApochPiQ   Moderators   -  Reputation: 15063

Like
2Likes
Like

Posted 11 June 2014 - 07:26 PM

If you have GDC Vault access, look up a talk by Pat Wyatt from GDC 2013 I think it was... maybe 2012.

 

If you don't have GDC Vault access, what's wrong with you?! :-P



#20 fir   Members   -  Reputation: -452

Like
0Likes
Like

Posted 12 June 2014 - 04:02 AM

cool thread, though i feel to much incompetent to say something here,

 

can someone say to me what pings (ping times) has to do it with this?

Is the ping times the source of most troubles or problem is in something different? Is there some perspective that ping times will drop noticably in the 

www in the future? can someone say how the range of them is in todays world? How responsible is usually an average connection between a player and a game serwer i mean time to player --?--> serwer , then processing and time to read other side data?






Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS