SpatialOS single shard MMO

Started by
29 comments, last by hplus0603 6 years, 8 months ago

Hi! I'm Gabriel, former game dev and very interested in niche topics such as client/server network architectures or pathfinding for games, now working at Improbable. I hope I can clarify a few things about SpatialOS - apologies for the long reply but there’s loads of great stuff to discuss here!

@khawk your understanding of SpatialOS is pretty much spot-on. An alternative explanation that I really like is this: imagine the traditional Entity-Component-System architecture, where each system is a distributed system rather than a thread on a server. SpatialOS lets you do this, without having to actually write a distributed system.

Quote

 FWIW, I did some press time with them at GDC

I was there!!! From your description it sounds like you tried Worlds Adrift by Bossa Studios, or some of the other games by our partners (like Chronicles of Elyria by @JWalsh, whom I also had the pleasure to meet!). Did anyone give you a 1:1 tour of our Wizards demo game? We're now offering this tour on the website, and I can't recommend it enough - it should make the core concepts and the workflow clear to everyone (if it doesn't, please message me - I run the team responsible for this content, so feedback is more than welcome!)

@hplus0603

Quote

latencies cannot be managed, where noisy neighbors can flood your network,

Yes, but this is also the case for any client-server game, and it depends on your internet connection at home; whether there's a single server or a swarm of workers on the other side can't improve that.

Quote

processes that communicate intimately end up being placed on different floors of a mile-long data center

The internal latencies in a mile-long data center are so small compared to the latency from your home to the datacenter, that the latency you experience as a gamer will be dominated by the latter (as I said above, this is no different to connecting to a game with a single server).

That said, SpatialOS is called SpatialOS for a reason :) Locality of reference is one of the core concepts of how the load balancing and worker allocation algorithms work.  We go to great lengths to make sure that entities in close proximity in the virtual world are physically close in the datacenter - usually within the same physical server.

To see all this in action, I'd suggest you take a look at

" rel="external nofollow">Worlds Adrift, which has to be pretty much real-time because it's a very physics-heavy game. Every single thing in the game is physically simulated, including individual ship parts - as you can see in that video!
Quote

N-squared, as we know, leads to an upper limitation to the number of objects that can go into a single server. Designing your game to avoid this, helps not just servers, but also gameplay.

Yep, agreed. Just like we can't work around the speed of light to provide zero end-to-end latency, we can't do much about the way O(n2) works :( As you point out, this has to be solved with a mixture of game design and clever software techniques.

Note that O(n2) appears in different places. Interactions of objects within a worker are one, as you point out, but there's also O(n2) network communication between workers, and we can and do something about it - smart distribution and migration of entities between physical servers. This is another non-trivial problem that SpatialOS solves, and which is invisible to the developer and the player (see this for an example).

@hplus0603 your thoughts about the impact of the perception range are very interesting. SpatialOS does this in a different way, though.

In SpatialOS, the allocation of workers is dynamic, and it follows the workload of the simulation around. This minimises the migrations that need to happen. There is co-simulation where the areas overlap, but we make this invisible both to developers and players. Distributed physics is a particularly fun example of this - we've written about that in a blog post.

Second, SpatialOS does this at the component level rather than the entity level. So you could have a game world simulated by 100 physics workers (e.g. instances of PhysX) but only 10 game logic workers if it's a physics-heavy game. Or 10 physics workers, 50 AI workers and 5 pathfinding workers. The point is that every kind of worker needs to "see" just a narrow subset of the components of an entity (generally, the ones it is able to simulate), and this reduces the bandwidth requirements tremendously. Workers follow the workload, and they do this in a layered way, so in practice there are far less physical migrations than you may think.

@drainedman

Quote

I can't afford $50k but I am planning on building a beowulf cluster to support a kind of MMO with 100k+ entities (not all human but all are persistent).

Sounds like a job for SpatialOS :) You get the benefits of running a cluster, but without any of the complexity of setting up and running a cluster. In fact, you don't have to write networking code at all - you see this in practice in our Wizards demo.

About the $50k, are you aware of our Games Innovation Program? We understand this is a concern for a lot of developers, so in a nutshell, we've partnered with Google Cloud to offer subsidies to usage costs to users enrolled in the Program. Read more details in this article. Also, development on your local machine is always free, so you can try SpatialOS with your SDK of choice, try the Wizards demo or the Pirates tutorial, etc.

@Kylotan

Quote

Potential downsides include hosting and operation costs, since you can probably get it cheaper via a specialised solution (although you also run the risk of over-specifying and paying for capacity you don't need), and development costs, because you have to write the entire game using their paradigm, which may not suit you.

You would be surprised! We ran the numbers in detail, and the economics may not be so different, especially when you really scale your game up. And at that point you also need to consider the cost and expertise of having a dedicated infrastructure / DevOps team. Spilt Milk, the creators of Lazarus, touch upon these topics in this article. As a small team with no previous large scale networking experience, they went from zero to a continuous playable alpha with 3000 concurrent players in 4 months.

Once again, this was a great thread to read! Happy to address any other questions you may have :)

Advertisement
1 hour ago, ggambett said:

Yes, but this is also the case for any client-server game, and it depends on your internet connection at home;

 

1 hour ago, ggambett said:

The internal latencies in a mile-long data center are so small compared to the latency from your home to the datacenter, that the latency you experience as a gamer will be dominated by the latter

interesting, so what range of networking would be ideal from the players point?

 I mean if only players with the highest bandwidth can play then what is the point of a huge game world?

 

How do you plan to deal with the view range, it is notable that in the Wizards demo the camera is setup to prevent the player from seeing into the distance. The Pirate tutorial has nothing there, only a flat plane.

 

Isn't the point of making a large world to have enough space for a huge player base, for players to see they are in a large open world and to have hundreds of players interacting with each other at once.

At the moment it still looks like having servers for each region is a better idea than having one large server where you lump players. Then there also is the fact that the further a player is from the server the more they will lag.

(Replying to ggambett, 2 posts up)

Hi Gabriel, thanks for coming on here and answering some questions. I can see why the product is an attractive one for people who aren't able or willing to manage their own hosting. That is one reason why it is a more attractive proposition than what my former company was offering 10 years ago, because although we provided a very similar architecture, the expectation there was that the game developer would provide and manage their own servers. So things have changed there.

I still think the paradigm shift necessary to use such a system can be complex. The "zero networking code" aspect is not a big deal these days since most game engines offer that to some degree with state replication - but learning to write code with fewer 'shared-everything' assumptions and more message passing is tricky for many game developers. That's not a criticism of your tech, as the problem is intrinsic to running a distributed simulation. But it's also why some games have gone down the WoW route and simply decided it wasn't a problem that was worth solving.

Additionally, trying to describe it as like an entity-component system for the cloud is a negative in this regard because it's clear from posts on Gamedev.net that most developers aren't comfortable with that approach and struggle significantly with creating clear partitions between components and in handling complex multi-entity/component interactions. That's arguably why the 2 major engines allow full communication between arbitrary entities and components - the alternatives make otherwise easy interactions quite complex to handle.)

 

Quote

 

Quote

latencies cannot be managed, where noisy neighbors can flood your network,

 

Yes, but this is also the case for any client-server game, and it depends on your internet connection at home; whether there's a single server or a swarm of workers on the other side can't improve that.


 

 

I think you misunderstood me. I'm talking entirely about things that go in inside a virtualized, cloud-hosted data center.

Because it uses virtualization for the machine hosts, you are subject to the requirements of the virtualization platform, and that often introduces significant (many milliseconds) latencies in scheduling, because the VM hosts are all optimized for batch throughput, not for low-latency real-time processing. For real-time simulation running close to full machine utilization, a physics step time that goes from 15.5 milliseconds to 17.5 milliseconds will make you miss your deadline. For real-time physics games, I much prefer bare metal for this reason.

It's also interesting that you mention co-simulation across visibility borders and PhysX at the same time. PhysX is not deterministic, so any co-simulation across borders will diverge. With enough authoritative network state snapshots, you can mash that with brute force, of course.

Regarding the "borders moving with load," that's something we looked at an implemented for There.com, but it ended up not being useful for real gameplay, because players tended to gather in the same kind of gathering places. Meanwhile, the view distance across borders (i e, how much you need to co-simulate) has to be determined by the "visibility range" of your objects. If your object is a missile cruiser with a range of 150 kilometers, you have to have an instance of the object on any server that touches this radius, so that it can do target acquisition. (Either you have an instance of the cruiser on each server within the radius, OR you bring a copy of each object within that radius to the cruiser's server -- if there are fewer cruisers than targets, you want the former, for hopefully obvious reasons.) If you have a soldier with a sniper rifle with a 2 kilometer scope, you have to be able to see each object within two kilometers, or the player will be sad.

I'm pointing this out, not to cast shade on SpatialOS, but to show that any distributed server framework has to be used with gameplay design that goes hand-in-hand with the networking/simulation capabilities, and each solution will bring with it specific limitations you have to accept as a game designer. Using words such as "invisible to the developer" or "without having to think about distribution" sounds great in marketing, but ends up not actually being helpful to the end developer. And, honestly, is actually untrue for all but the most trivial kinds of games. I've found that the companies that end up doing the best in gaming are those that are clear about pros and cons about their systems, and that do not make over-simplified promises they cannot actually deliver on (without tons of caveats) in their marketing.

 

enum Bool { True, False, FileNotFound };
1 hour ago, Scouting Ninja said:

 I mean if only players with the highest bandwidth can play then what is the point of a huge game world?

I didn't say that; apologies if I expressed myself in an unclear way. What I tried to say is that the player's latency to a datacenter is the same regardless of whether there's a single server inside the datacenter running the game, or a cluster of a hundred (because the internal latency is minimal). So the network requirements of a SpatialOS game are no more strict than a regular game running on a single server. On the flip side, the SpatialOS game is not limited by whatever that single server and engine can handle (and servers can get only so big).

 

1 hour ago, Scouting Ninja said:

How do you plan to deal with the view range,

Each worker has a configurable "checkout radius", which is effectively the view range. There are interesting LOD techniques you can apply to updates to minimise the bandwidth impact of a big checkout radius.

We also have the concept of streaming queries, which allows workers to "see" entities that wouldn't normally fall within their checkout radius. This is used in Worlds Adrift, for example, where islands are visible from vast distances, much farther away from where a worker would need 60hz position updates for things in their surface. Hopefully this also answers @hplus0603's question (although this is the only instance of "view across borders", because the border of the region of interest of a client is the view range).

 

2 hours ago, Scouting Ninja said:

Isn't the point of making a large world to have enough space for a huge player base, for players to see they are in a large open world and to have hundreds of players interacting with each other at once.

Absolutely, and this is exactly the kind of experience that SpatialOS enables - take a look at Worlds Adrift for an example of exactly that :)

 

2 hours ago, Scouting Ninja said:

At the moment it still looks like having servers for each region is a better idea than having one large server where you lump players.

That may be more appropriate for some types of games. SpatialOS offers different load-balancing modes, and one of them puts workers in a static configuration. Note that even in this case, workers aren't equivalent to servers handling regions; all the workers combine to simulate a single continuous game world with no hard boundaries, regardless of whether you choose a static or a dynamic worker allocation setup.

 

2 hours ago, Kylotan said:

I can see why the product is an attractive one for people who aren't able or willing to manage their own hosting.

Making massively distributed systems is not exactly simple, especially for non-embarrassingly-parallelizable problems such as physics. Of course as a game developer you want to spend most of your time and effort making a game and exploring the creative possibilities of large worlds, not a spatially distributed cloud compute platform!

But even studios that are experts in, and famous for, long-running MMOs see the value in using SpatialOS. The clearest example I can offer of this is Jagex, the creators of Runescape, with whom we've recently partnered with.
 

2 hours ago, Kylotan said:

I still think the paradigm shift necessary to use such a system can be complex. [...] That's not a criticism of your tech, as the problem is intrinsic to running a distributed simulation.

There is a bit of a paradigm shift, absolutely. But what you get for your investment in learning how to work with SpatialOS is the possibility of building games of a scale and richness that is currently beyond the reach of most game developers.

 

2 hours ago, Kylotan said:

most developers [...] struggle significantly with creating clear partitions between components [...] That's arguably why the 2 major engines allow full communication between arbitrary entities and components - the alternatives make otherwise easy interactions quite complex to handle.)

I would argue that's not a limitation of the ECS (or ECW) architecture. In fact, this limitation doesn't exist in SpatialOS - any component of any entity can communicate with any other component of any other entity by sending a command (essentially RPCs), no matter where it is (in the game world, or in computational terms).

 

1 hour ago, hplus0603 said:

PhysX is not deterministic, so any co-simulation across borders will diverge. With enough authoritative network state snapshots, you can mash that with brute force, of course.

I have limited knowledge of physics simulation, but I understand stable simulation is not trivial, even on a single instance of a physics engine; forcing more than one physics engine to cooperate, where they aren't even aware of the other's existence, is not a matter of brute force :) I refer again to this experiment we made (the video is pretty cool!).

But the broader point is that a SpatialOS game developer doesn't even have to think about this - no discussion about whether to use brute force or something more subtle, no code to deal with this or with anything related to co-simulation.

 

1 hour ago, hplus0603 said:

Using words such as "invisible to the developer" or "without having to think about distribution" sounds great in marketing, but ends up not actually being helpful to the end developer.

But this is pretty much true in the case of SpatialOS, and I say this as both as a game developer and a hardcore software engineer, not as a marketing person. As discussed above, there is a bit of a paradigm shift required, but you really don't have to think in terms of implementing or running a distributed system; game logic involves workers receiving state updates, doing whatever computation they need, and sending back state updates. There really isn't any need to write networking code, or any kind of manual synchronization code, or in general, even being aware that you're making a massively distributed game rather than a single-player game running in a single machine, except in the broadest terms (e.g. commands may fail, so you may need to add some custom retry logic if you're not happy with the defaults).

Don't take my word for this; you can play actual games built on SpatialOS right now (Worlds Adrift, Lazarus). You can download the SDK and play with the Wizards demo or the Pirates tutorial right now. It's all freely available, fully documented, in production, and with enough examples and starter projects to get you started (github.com/spatialos).

Quote

 

game logic involves workers receiving state updates, doing whatever computation they need, and sending back state updates


 

I know what you're talking about. I built a system that had many similar properties, including this programming model for entities. Our sales people used the same marketing claims. 

It turns out, there are things developers do "as a matter of course" that ends up generating way too much RPC traffic to scale well. Developers need to know what the distribution decomposition is, if they want to get anywhere near (say, within an order of magnitude of) the theoretical maximum performance of the system. Naive developers, even using your carefully crafted API that attempts to make developing distributed objects easy, and "hiding" RPC/messaging, WILL flood your system to the point where scalability is 1/100th of what it should be.

Further, developers will assume that RPC or events are reliable, AND bounded time. As you know you can't get both of those at the same time across a lossy network ("two generals problem.")

If, in the context of "paradigm shift," you suggest that developers also need to train themselves to know about these things, then yes, inside the paradigm you live inside the paradigm. But that paradigm includes limitations that are imposed by your particular distribution model. That's an unavoidable outcome of distributed games, and it's what makes distributed games (and other systems) an order of magnitude harder to work with than in-RAM single-player games, and pretending that they're the same does nobody any favors. (Except possibly salespeople on commission who would prefer to close deals early over closing the right deals -- luckily, I've managed to avoid most of those in my life!)

 

enum Bool { True, False, FileNotFound };

This is how my tiny brain understands the problem.

Ultimately locality is at the core of the it (server side).

Suppose an entity is processed 30 times in 1 second.

In order for this entity to process its behaviour within that slice of time it must take in information of its nearby space. For example a brick falling through space will need to know about its neighbours so it can go bumpity-bump-bump with other bricks. We can cheat a bit by generally disregarding entities a long way away to reduce the number of interactions from squared to linear.

When working within a single process we can do entity neighbour lookups very quickly as RAM access is pretty quick (we have seen this used to great effect with CUDA physx demos and so on). 10,000 entities will mean 300,000 neighbour queries on top of physics, behaviour calculations etc per second. Quite manageable within one process.

However, to scale up to more entities we want to split the workload across two or more nodes and the processes cannot access each other's RAM. non-local entities (entities from different process) must talk to each other by some other medium.

Quick, large distributed shared memory isn't an option with current hardware (although surely some hardware guru could build it), so we use something like standard networking. Because of this our communication speed between non-local entities has dropped by a factor of about 200 or worse.

To compound this latency we find that highly dynamic environments such as MMO's will vary the load which can mean high volumes of traffic in concentrated areas. Some entities will travel insanely fast across multiple nodes (speeding bullets, airplanes, cars).

Also some very selfish entities are particularly inconsiderate and want to exchange HIGH volumes of data and want to do it instantly. Transactions in market places springs to mind. Or perhaps a car with a 10 hour pre-planned route. Or an entity with a daily schedule.

We can't really ever get around these limitations until we (somehow) increase the speed of access across all nodes to be as quick as if they were all a single unified node.

Meanwhile designing the game/simulation is critical to making the whole experience balance out. I don't even think its possible in the generic sense to make a scaleable MMO - with current hardware. Actors, services, workers, etc I view as a kind of syntax flim-flam, froo-froo. It does little to address the limitations.

My personal pet favourite tech to tackle this problem is MPI https://en.wikipedia.org/wiki/Message_Passing_Interface

You are not wrong :-)

Quote

 

We can cheat a bit by generally disregarding entities a long way away to reduce the number of interactions from squared to linear.


 

 

It's actually still quadratic, but quadratic in a smaller number (number of entities divided by number of servers, times number of entity copies needed for cross-border resolution.)

Similarly, a single locality query for "nearby" objects is not a constant cost, but actually has a cost that is linear in number of neighbors. While there may be 300,000 queries for 10,000 entities at 30 Hz, each query may return more than one entity, and thus may cost more than "1" along a few cost metrics (storage, memory touched, entities to check against, etc.)
 

Quote

 

My personal pet favourite tech to tackle this problem is MPI 


 

MPI lets you send messages between processes, using non-lossy but also not-real-time-aware TCP RPC.

This is a useful primitive to use when building distributed systems, but it doesn't really get at the real question, which is "how do you structure your game design to make best use of distributed servers, and avoid placing undue burden on the server system that you have?"

It seems like SpatialOS implements a particular kind of trade-off and an API to support developing entities for that trade-off. Other systems do a similar thing, reaching different conclusions with different base assumptions. This is why "how can I compare these different platforms?" is such a hard question, because it totally depends on the specifics of your game. Farmville works great on plain Amazon ECC web server instances. Unreal Tournament, not as much.

 

enum Bool { True, False, FileNotFound };
13 minutes ago, hplus0603 said:

This is a useful primitive to use when building distributed systems, but it doesn't really get at the real question, which is "how do you structure your game design to make best use of distributed servers, and avoid placing undue burden on the server system that you have?"

It seems like SpatialOS implements a particular kind of trade-off and an API to support developing entities for that trade-off. Other systems do a similar thing, reaching different conclusions with different base assumptions. This is why "how can I compare these different platforms?" is such a hard question, because it totally depends on the specifics of your game. Farmville works great on plain Amazon ECC web server instances. Unreal Tournament, not as much.

 

Indeed MPI is just a tool and not a complete solution.

SpatialOS is a "solution" platform but it remains to be seen that it can do the more demanding applications such as a huge Unreal tournament, for example.

Of course as you pointed out we have seen this kind of tech many times before. Pikkoserver, Shinra spring to mind as the latest failures.

I don't think its an impossible feat as such, just a bit limiting with current hardware. I don't really know! I suspect the answer lies in a more hardware solution than software however.

14 hours ago, ggambett said:

 

17 hours ago, Kylotan said:

most developers [...] struggle significantly with creating clear partitions between components [...] That's arguably why the 2 major engines allow full communication between arbitrary entities and components - the alternatives make otherwise easy interactions quite complex to handle.)

I would argue that's not a limitation of the ECS (or ECW) architecture. In fact, this limitation doesn't exist in SpatialOS - any component of any entity can communicate with any other component of any other entity by sending a command (essentially RPCs), no matter where it is (in the game world, or in computational terms).

I think hplus0603 already touched upon this, but the problem is that this decomposition never comes for free. When you split logic over multiple objects you have to decide how and when those objects communicate, and what gets communicated. It's rare that any desired communication is impossible but it's common that certain communications become more verbose, more complex, slower, or all of the above.

With a distributed simulation the problem gets worse. If one component has to send an asynchronous message to another to get data, that is quick but complex. If you wrap the asynchronous message and the receipt of a response into an RPC call, that is simple but slow. I see that your commands use a Request/Response pair, which is basically the former approach, and is exactly what we used at the first MMO company I worked at. It is elegant at the networking and simulation level, and complex at the game logic level.

Again, I'm not criticising the SpatialOS technology - just stating that it really does require a mental shift to implement things using such a model and that developers still have to think like a network programmer even if they never have to worry about packets or serialisation. :)

This topic is closed to new replies.

Advertisement