Thoughts about protocol & technology choices

Started by
3 comments, last by Antheus 16 years, 11 months ago
Hello everyone! Before sharing my thoughts id like to introduce myself making clear what my goals are and what id like to attain. I am 23 years old, living in Germany, actually going through an apprenticeship as application developer in a purely office-centric company. Been into hacking code ever since i got my first "real personal" computer and i like my job very much. Who else could combine tiny bits of information and tranform those into something actual meaningful? ;) However, since i started working in that company i slowly became aware of the fact that I do indeed hate office-centric stuff. Prolly because it is mainly database chat, process, display and print. And the number of people actually enjoying my work is very limited, something i also do not like much. The whole is somewhat unsatisfying imho. I do not think i can continue to work as programmer in office-centric companies for long, since my mind is getting blunt over time :( Since I got aware that i might have wasted prolly years of my life playing games got the mad idea of getting into the game industry. Now here is the problem: Since every major company producing (commercially successful) games require the applicant to provide references i cannot fullfil (yet), i decided to design an architecture situated to serve as core network technology for MMO-style games. I started brainstorming about the right way to allow a large number of players to be part of a unique, consistent world. I have been reading around extracting the following statements from a huge amount of postings, articles and opinions on this topic 1. Hardware is cheap 2. Personnel is expensive Since i took these statements to heart I decided to use .NET to build my toolchain on. Performance loss is compensated by cheap hardware, code maintainability is good, refactoring can be done within minutes of course only if you got a clean design at start. Furthermore i believe Microsoft will be abstracting the complexity of multicore programming a lot in .NET-Framework which will give programmers the time to focus on other important parts of the code instead of debugging failures on multiple cores. Well, this beeing said, id be pleased if you would review my concept and let me know your opinions and suggestions. The architecture consists of the following servers/services: 1. Service Broker The service broker receives periodically UDP beacons from the logon and zone servers. This allows it to decide to distribute client connections among the servers. 2. Authentication/Logon Does process the incoming logon requests, retrieves details from the billing gateway, drops the connection on authentication failure or initiates a session which is being transmitted to the service broker which is mandatory for the latter load balancing and billing enforcement (such as a playtime elapsed event) 3. Billing Gateway Generic software which will retrieve details regarding billing on request by the authentication/logon server. Will send an error to in case there is no more playtime available, will send the timestamp used by the playtime elapsed event. Now here is the tricky but highly interesting part, the zone servers and its synchronization process. 4. Zone Zone servers do receive incoming client requests, get the proper connection parameters from the service broker and serve the clients until one of the following events is being triggered: 4.1 Player level reaching limit This event will be triggered once the player count exeeds n% of the player limit. Now thing are getting interesting. Since we are exeeding our precalculated resource limit, we need to split load across multiple zone servers via synchronization. The zone server reaching critical levels will send a request to the service broker which will return him a node which has spare resources. On the physical level the whole operation is not consuming a lot of time, depending on the load of the server we are trying to synchronize to. The response time will, with proper limits configured, not exeed the timespan between the the last and the next update cycle. Once the two zone servers got ACKd the transfer of the client, the service broker is getting notified of the successful transfer. Once the Service broker got the notification it will send the node address and port to the client which will immediately disconnect from the inital zone server and connect to the given node. Once this happened, the two zone servers keep synchronizing the world state through a dedicated connection at a fixed rate which is 1/2 the time for an update cycle in an asynchronous manner, not to clog up the network. There are many minor details i may not be aware of, however i think my concept is ready for production once i got your opinions and suggestions. It would be nice of you to point at possible bottlenecks or simply blatant logical errors. Ill take all suggestions and advices into account :) Thanks for your attention, Your, Raven Noir
Advertisement
First of all, Im not an expert. I just have gathered info asking and reading.
To start, hardware might be cheap, but MMOG servers tend to require good hardware, and there is a limit to the hardware you can stak in a server. So, I would recommend using C++. If the design is good (and you seem to be a very methodic and organized programmer) it will be maintainable. Second, it will be portable, supposong you use some portable network layer like sdl_net, enet or Raknet. Portability is a plus. Currently I can run my very simple server under XP or Linux, works the same.
The rest of your ideas sounds good, perhaps a bit complex for my taste, but I agree that you have a pretty nice and scalable design there that I would like for my own project.
I'm not sure about Mono long-term support, which is something you should consider since much of servers favor linux/unix variants. I'm also not sure about networking implementations under those platforms. I imagine they are pretty solid, but only testing will show how they behave under real stress.

Quote:The zone server reaching critical levels will send a request to the service broker which will return him a node which has spare resources.
On the physical level the whole operation is not consuming a lot of time, depending on the load of the server we are trying to synchronize to. The response time will, with proper limits configured, not exeed the timespan between the the last and the next update cycle.


What happens if there are no spare resources?
How do you handle a node that lags and cannot keep up?
How do you restore a failing, misbehaving, malfunctioning node?
What happens if a node fails?

How is the world simulation split across these nodes? Are all the objects in simulation remote? Or is the world partitioned? If former, how will you keep the traffic down to reasonable level when there's lots of updates. If later, how will you synchronize the objects that are on zone boundaries?

When distributing objects across different nodes, how do you intend to solve data consistency problems and potentially rollbacks?

How will the game handle gameplay issues where players bunch up in same area? That will overload zone if they are spatially partitioned. Do you refuse player entry in the area?

Will client be written in C# as well? If not, how will you keep two different codebases synchronized with respect to object models and network traffic?

How will storage be realized? How will you insure data integrity? How will you future proof the databases, or what kind of mechanisms will you provide for supporting schema migrations?

What kind of spatial partitioning, client bandwidth throttling and area of interest management do you intend to employ? What kind of mechanism will you use for world state synchronization? The usual baseline + delta aproach, something more rigid (FPS style), something more lax (SecondLife geometry data)? Which type of proximity search? Quad trees, R trees, some adaptive geometry trees? Or none at all, and entire zone data is sent?

What kind of object model will you employ for server-side objects? How many objects do you expect? How will this model be updated and maintained, how much impact will it have on code? Will it be fully distributed (aka CORBA, Ice), or more hard-coded (One process/service, one process per function or even zone? )

How will you stress test the server? Up to 500 users, single machine will do. But above that number, the load becomes unviable. How to test various failure aspects, or better yet, how to prove viability of your server without the usual months of beta testing where the annoying bugs get worked out?

I don't expect you to answer most of this, but these are all just the high-level design choices I've encountered.
Quote:Original post by Antheus
I'm not sure about Mono long-term support, which is something you should consider since much of servers favor linux/unix variants. I'm also not sure about networking implementations under those platforms. I imagine they are pretty solid, but only testing will show how they behave under real stress.


Because of the the questionable licensing terms of windows vista I have been doing some research on mono runtime performance since id love to run my software on unix-like operating systems, too.
The mono runtime for windows is approx. 2.5 times slower than the .NET runtime built by Microsoft. However, i cannot rely on Novell to pay the developers of mono. I once ran into a platform dependent bug, mono clogging up the CPU in a virtual machine and not executing my binary on FreeBSD. I did offer shell access to the to the developers, but they refused my offer, telling me that Novell is paying them to make mono work on Linux and there was very little interest in making it work on non-Linux platforms.

Since i had been told so, i did drop the idea of using Mono, since it is way slower than on Windows and it aims towards Linux only. Yeah, you heard it: I do not like Linux much because of catastrophic code quality and the widespread lie of "being more secure than windows".

To be honest, anyone can configure any OS and make it immune to attacks with a brain and a little technical knowledge :)

Quote:What happens if there are no spare resources? What happens if a node fails?

Hopefully this will never happen, since i wrote a syslog server which is storing messages to a database which can be queried periodically to identify possible error sources and/or bottlenecks.

Quote:How do you handle a node that lags and cannot keep up? How do you restore a failing, misbehaving, malfunctioning node?

I am still unsure about the right action to take if this happens.
In case the software does crash, triggering some exeption, data can be synchronized to the database. In case of a hardware failure all clients were forced to login again anyway.

The best would be to:

1. make it auto detect that things are going wrong
2. make it query the service broker getting a node address which can handle the number of connected clients
3. notify that node
4. receive ACK for transfer/synchronisation
5. transfer each client
6. notify the service broker about the successful transfer of all clients
7. notify the service broker that it cannot handle clients anymore, requiring some attention from tech staff

Quote:How is the world simulation split across these nodes?

The world is being split into zones, each having the same size.

Quote:Are all the objects in simulation remote?

Sorry, i do not fully understand this question. What is "remote" in this case? Please explain "remote" in this context.

Quote:Or is the world partitioned? If former, how will you keep the traffic down to reasonable level when there's lots of updates.

I thought it would be sane to built each server with 3 different NICs. Each handling traffic exclusively for

1. Incoming connections from clients
2. Synchronizations between zones, health monitoring and logging
3. Database queries and updates

This will, by all means, increase administrative overhead and triple the wiring, however, it will make life easier splitting up traffic physically and logically, making it easier for system operators to identify the source of a possible networking problem. All traffic would benefit from exclusive usage rights of of the interface it passes on. Binding on any interface will let the kernel do the dirty job of traffic routing between the interfaces.

Quote:If later, how will you synchronize the objects that are on zone boundaries? When distributing objects across different nodes, how do you intend to solve data consistency problems and potentially rollbacks?

This one gave me headaches, seriously :)
Once a client approaches a border of a zone, the server currently handling that client will notify the server(s) close to that border that his client might be crossing borders soon. Once the other server(s) got that notification they are supposed to create an object for that client, synchronizing the coordinates, status and other parameters every 1/2 of server/client update cycles.
Once the client crossed the border to the other server, the server marks its client object "remote" whilst the server now handling the client will mark it "local". The server that handled the client initially will destroy the client object once the coordinate do exeed its "view" range.

Quote:How will the game handle gameplay issues where players bunch up in same area? That will overload zone if they are spatially partitioned. Do you refuse player entry in the area?

Well, the best solution would be to design key locations far away from borders, in order to avoid such a situation. But, I have seen this scenario myself, it cannot be avoided.

I earlier mentioned the service broker which will receive UDP health beacons.

Once the non critical player limit of a zone has been reached, the service broker will be informed and delegate new clients to other server which will be synchronizing clients with the remote/local distinction scheme until they leave the border area.

Quote:Will client be written in C# as well? If not, how will you keep two different codebases synchronized with respect to object models and network traffic?


The client being written in C# would, of course, benefit the most from the already existing codebase. Lets say the first four bytes of a packet, could contain a header which can be used to lookup locally to figure out what format the rest of the packet has.

By using this technique, packets could be modeled inside of an UML tool than can generate output for multiple languages.

Quote:How will storage be realized? How will you insure data integrity? How will you future proof the databases, or what kind of mechanisms will you provide for supporting schema migrations?


I planned to use NHibernate to leverage the developer (aka. me) from writing SQL, also allowing the developer to switch to any RDBMs supported by the .NET-Framework and NHibernate.

As far i know both can handle Oracle, MS Sql and MySql.

In case the database needs to be migrated, there is plenty of tools that can do that in a fast and reliable way.

Quote:What kind of spatial partitioning, client bandwidth throttling and area of interest management do you intend to employ? What kind of mechanism will you use for world state synchronization? The usual baseline + delta aproach, something more rigid (FPS style), something more lax (SecondLife geometry data)? Which type of proximity search? Quad trees, R trees, some adaptive geometry trees? Or none at all, and entire zone data is sent?

You got me there, which is damn good because I have to figure out how to solve this problem :)

I thought the easiest and most reliable way to keep the client in synch with the world is to send one packet per object once the client logs in and alter the game state once some player or server side actor triggers an update action.
In the worst case, a large scale combat scenario, id expect a total of approx. 300 up to 600 players.

This, of course, is a rather large amount of data, however, with a packet per second limit id be pushing them in a queue and sending out the info per object being in the range of action of the client. In case there are some packets/objects once the next update cycle is reached, id remove them and push new ones in the queue.

Most games i have been playing the last few years had a server side bandwidth upstream limit of approx 10kb/s per client (been graphing traffic with SNMP a lot).

The average size of "initial object" packets i had been constructing were approx. ~500 bytes large, allowing me to dump approx 20 objects in the first second. However, many games i have been playing never checked if the client renderer was ready when they connected. Since i do not want to make the same mistake, id set an "invulnerable and petrified" flag on the client until every object in his range of action has been sent and the renderer is ready to draw. This is giving me plenty of room to send the packets to the player since his computer is being busy loading resources.

Lets assume the renderer is busy for approx 10 seconds loading stuff. Id have time to send 200 initial object to the client. Since i do not expect to have 200 players in the range of action i have spare bandwidth i could use to apply object states.

Oh, i forgot to mention all of my objects are contracted by the interface ISendPriority :)

Id implement an interface contracting each object to have the world coordinates as a Vector3, a base movement, movement direction and acceleration, too. Based on that data the server could do some interpolation and prediction by using quad trees.

Quote:What kind of object model will you employ for server-side objects? How many objects do you expect? How will this model be updated and maintained, how much impact will it have on code? Will it be fully distributed (aka CORBA, Ice), or more hard-coded (One process/service, one process per function or even zone? )

I did partially answer one question earlier, please forgive me for this mistake :)

However, i am used to designing software with the worst case in mind because i already had the unpleasant honor to see companies going down the drain because they underestimated the numbers.

Because of this id expect 200.000 up to 500.000 objects (items) per zone on the floor plus the 200 to 500 players plus approx 5.000 NPCs.

Each object/item has at least 4 fields (unique id and a Vector3) each 8 bytes long, grand total of 32 bytes per object, multiplied by the total count of total items and actors gives 15.28MB RAM usage.

Once an object is being picked up, it will get removed from the world model and given to the player that picked it up, removing the Vector3. Since i have no idea how the economy might be, i assume items do not get picked up.

Quote:How will you stress test the server? Up to 500 users, single machine will do. But above that number, the load becomes unviable. How to test various failure aspects, or better yet, how to prove viability of your server without the usual months of beta testing where the annoying bugs get worked out?


I guess i have to write my own bot to stress test the server.

Quote:I don't expect you to answer most of this, but these are all just the high-level design choices I've encountered.

I did try to answer your questions as good i can. I do thank you for having pointed out all these points. I guess i have to refine a little more my concept.

Thank you for your helpful post, do not hesitate to let me know if i am committing design errors or other stupid things :)


Thanks for your time,
Yours, Raven Noir
Quote:The average size of "initial object" packets i had been constructing were approx. ~500 bytes large, allowing me to dump approx 20 objects in the first second. However, many games i have been playing never checked if the client renderer was ready when they connected. Since i do not want to make the same mistake, id set an "invulnerable and petrified" flag on the client until every object in his range of action has been sent and the renderer is ready to draw. This is giving me plenty of room to send the packets to the player since his computer is being busy loading resources.


What if client can never catch up?

Dial-up with 3k/s downstream, with 5% packet loss, and world state updates are 2.5k/sec.

Quote:Sorry, i do not fully understand this question. what is "remote" in this case? Please explain "remote" in this context.


All objects are physically located in object servers. Simulation nodes only contain shadow copies of these objects, and they send changes back to object servers.

You get one or two step delay between change propagation, but have central location from which you can access any in-world object. Makes it much easier to handle zone transfers, painless to handle boundary conditions, and can reduce database load.

Quote:In case the database needs to be migrated, there is plenty of tools that can do that in a fast and reliable way.


I was referring to developers changing the schema due to changing requirements and feature additions.

In that case you'll need a way to synchronize the clients, client data, server data and ensure that applying to change to database can be performend efficiently. Assume 10-20Gb database that is affected.

Quote:Each object/item has at least 4 fields (unique id and a Vector3) each 8 bytes long, grand total of 32 bytes per object, multiplied by the total count of total items and actors gives 15.28MB RAM usage.


Ok, that does the index.

What about the object data? Granted, the objects in your case could be trivial, and their appearance and behaviour uniquely identified by a 32-64 bit ID.

But then there's player inventories (50 x 8), player achievements (100x8), quest log (?), skills (1024 / 8), and so on.

For objects you'll need additional data as well. Permissions, ownership, expiry timer, customization and random attributes.

All of this adds up quickly.

Also, how do you intend to define the client/server interfaces for looking at objects. Almost without exception these games share only a subset of data to the clients, and between nodes to keep traffic down and increase security, by applying need-to-know data sharing aproach.

Quote:The world is being split into zones, each having the same size.


But what happens if one zone is overloaded. How do you split the load to another zone - player doesn't want to move to a non-busy area. They want to play in zone A, but zone A is capped.

Do you create a new instance of the zone and move superflous players there? Or do you distribute the zone between two nodes - how do you ensure simulation consistency there?

Also, assume player bandwidth of 3k, average latency of 250 (20 - 1000 ms range), and packet loss of 5% (from 0-30% range). You'll need this to determine game update loops and possible handle rollbacks when clients arrive late.

The average/peak player actions are also important. How many per second, how many messages will simulation need to handle every second total. How many actions will need to be performed in response to that. While this number won't be a deal breaker, it may require you to split some of them across nodes, especially if you have potentially expensive operations there like DB lookups. A few players opening inventory, prompting db lookup can easily block your main game loop for too long.

Quote:I did try to answer your questions as good i can. I do thank you for having pointed out all these points. I guess i have to refine a little more my concept.


They aren't really questions and don't even require an asnwer. Just food for thought.

This topic is closed to new replies.

Advertisement