MMORPG and the ol' UDP vs TCP

Started by
68 comments, last by hplus0603 18 years, 10 months ago
Quote:Original post by Martin Piper
EQ Only uses UDP for game communication.

Ah ok, I don't know a lot about EQ so I just took the info from the emulator site, thanks for the correction. I confuse this with DAoCs hybrid system I guess.

Quote:
Everquest2, which uses UDP for game data, also has much better network performance than WoW.

With less than 1/4 of the players as well which is a factor.

Quote:
I've also not noticed Asheron's Call problems that are related to them using a specific network protocol. The game might be rubbish, but that is not a network protocol related problem.

My point was that it is not a network protocol related problem exactly. I said that they use UDP but the game runs horribly not because of the protocol but because of the server architecture. Who knows the same could probably be said about WoW.
Advertisement
I'm still amazed that people even debate this. TCP alone is a poor solution for any kind of a realtime game, if only because even a single dropped packet causes a stall in all network data delivery until that data loss is noted and retransmitted.

Hybrid TCP/UDP systems are needlessly complicated and suffer from problems like bandwidth overconsumption by the TCP stream, maintainance of seperate channels for UDP and TCP, misordered delivery of updates, etc.

I was going to go into greater detail, but then I realized I already have in the design fundamentals section of the Torque Network Library reference. The packet loss section gives a good explanation for why neither UDP nor TCP provide the right abstraction level for realtime game network programming.

- Mark
Network developers will agree that TCP and TCP+UDP are not optimal for real-time game/simulation applications. Custom UDP protocols can be more efficient. However, if the networked application sends too much data, any advantage provided by a custom UDP protocol is lost. In all cases, if the reliable queues fill up faster than they can drain, lag will be present (beyond network latency), up to the point the channel must be closed due to queue overflow.

A well designed network game can work fine using TCP or TCP+UDP. Beginning network programmers will have a much easier time using TCP+UDP than trying to create a custom UDP protocol from scratch. The argument that TCP+UDP is overly-complex is invalid, as creating a custom UDP protocol is much more complicated. In the context of marketing material for an existing, well-tuned and debugged custom UDP protocol, such an argument can make sense (more in terms of efficiency than complexity: the developer must now become familiar with third party code).

The out-of-order (early) unreliable data arrival before reliable data argument (from the TNL site) is easily dealt with by tracking state. Example: a reliable activation message is sent to an object via the reliable channel and gets lost in transit. An unreliable position update packet arrives before the activation message. The object state is checked, and since the object is not active, the position update is ignored.

How valuable is out-of-order reliable support (OOORS)? In a game/simulation where objects move and stop for long periods of time, or when turning on/off non-state-affecting effects/props, bandwidth can be saved. However, in a game where objects are constantly moving (or stop for very short periods of time), OOORS provides little to no benefit, and if extra packet bits are required to allow for support of OOORS, it's a bandwidth loss.

Quote:Original post by markf_gg
I'm still amazed that people even debate this. TCP alone is a poor solution for any kind of a realtime game, if only because even a single dropped packet causes a stall in all network data delivery until that data loss is noted and retransmitted.


TCP alone is the only option for games operating in restricted environments (for example, when only HTTP/HTTPS is open at the firewall). As long as the TCP queue is effectively monitored (different methods for *nix and Win32), and decent client-side prediction is implemented, it is possible to work around stall issues.

While TCP will never be ideal for a high-data-rate FPS, it can work fine for RTS and slower moving MMOG's. If the game is primarily running lock-step, where everything must be delivered in order, guaranteed, TCP alone will work fine. During high congestion periods, TCP may, by design, slow down faster than a custom UDP protocol. However, this may be an advantage for a MMOG with thousands of players, where a poorly designed custom UDP protocol may fall apart (keeps sending data at a high(er) rate, preventing the network from recovering). I suspect this is one reason why existing UDP-based MMOG with thousands of players can fall apart. TCP is well designed to efficiently handle this case.

Quote:Original post by markf_gg
Hybrid TCP/UDP systems are needlessly complicated and suffer from problems like bandwidth overconsumption by the TCP stream, maintainance of seperate channels for UDP and TCP, misordered delivery of updates, etc.


Bandwidth over-consumption is going to come from the unreliable channel, not the reliable channel. During periods of high congestion, the unreliable channel should be cut until the reliable channel(s) queue(s) can drain (data that is not truly state-critical should never be added to the reliable queue). All other arguments can be ameliorated at the network game design layer.

Quote:Original post by markf_gg
I was going to go into greater detail, but then I realized I already have in the design fundamentals section of the Torque Network Library reference. The packet loss section gives a good explanation for why neither UDP nor TCP provide the right abstraction level for realtime game network programming.
- Mark


The Torque Network Library looks like a good network toolkit (and to be fair, so do RakNet and ReplicaNet). While the arguments given do well to support licensing/purchasing a pre-made, well-tested custom UDP network toolkit, the biggest problem, by far, is network game design as opposed to the underlying network protocol.

I created a custom reliable UDP protocol in a case where the TCP implementation wasn't quite finished. The new UDP protocol ended up being more efficient than a TCP+UDP model (due to packet overhead savings and retransmit optimizations). Even so, the game would grind to a halt during high reliable state data sends. This required a significant redesign of networked game elements. Thus, while every bit of bandwidth helps, the burden of efficiency and game play quality resides in the game design, not the network protocol.

This was for the first full, XBox Live enabled game, and it was finished early (network-enabled games tend to ship late due underestimation of networking issues). While the game only supported 4 players, many more flying and moving objects were active, as well as many rapid reliable state changes (a nature of the game: too late to completely remedy by the time I joined the project). Voice was enabled for all players, all the time (as opposed to only hearing players near each other). The game played with little to no perceptible lag, even below 64kbps (voice took ~32kbps).

Again, I recommend that developers look into developing or licensing/purchasing custom reliable UDP protocols (RakNet, TNL, and ReplicaNet appear to be good choices). However, TCP and TCP+UDP can work fine: the real work in making a game play well under all internet conditions is centered around the network game design itself, not the network protocol. Likewise, if a game plays well/poorly on the internet, it can’t be attributed-to/blamed-on TCP, TCP+UDP, or a custom UDP protocol. It’s the network game design itself.

[Edited by - John Schultz on May 16, 2005 5:55:13 PM]
Quote:Original post by graveyard filla
Quote:Original post by Anonymous Poster
In UDP packets are received in the order that they arrive.


This isn't true. UDP can send packets out of order, and in fact can send duplicates and other nasty things as well.


Yes I should have explained that, sorry. I meant under optimal conditions they are received in the order that they are given, but are certainly not guaranteed to do so. (In fact I did cover this in the other portion of my post, perhaps you didn't read far enough.)

Quote:In UDP packets are received in the order that they arrive. Which is great for movement, or actions, because the last action is less important than the current action. You can mitigate the problems with out of order sequences on UDP very easily, simply by queuing the messages as they arrive, and then re-requesting those that didn't make it. Once your packets are ordered, then you can process them.


...mitigate the problems with out of order sequences on UDP very easily.....

[Edited by - bit64 on May 16, 2005 5:12:14 PM]
Don't be afraid to be yourself. Nobody else ever will be.
Quote:Original post by John Schultz
However, if the networked application sends too much data, any advantage provided by a custom UDP protocol is lost. In all cases, if the reliable queues fill up faster than they can drain, lag will be present (beyond network latency), up to the point the channel must be closed due to queue overflow.

This is one of the reasons why TCP+UDP is a poor combination - it gets people thinking in a message-oriented mindset where the only two primitives for data delivery are guaranteed and unguaranteed messages. What often ends up happening in a 3D simulation is that a large portion of messages get tagged as "reliable", overflowing the queue.

Quote:
How valuable is out-of-order reliable support (OOORS)? In a game/simulation where objects move and stop for long periods of time, or when turning on/off non-state-affecting effects/props, bandwidth can be saved.

In practice we've found out of order delivered reliable events to be of limited use. There is however, no per-packet overhead for supporting reliable OOO data in the TNL model, since TNL also supports strictly unreliable event sends.

Quote:
TCP alone is the only option for games operating in restricted environments (for example, when only HTTP/HTTPS is open at the firewall). As long as the TCP queue is effectively monitored (different methods for *nix and Win32), and decent client-side prediction is implemented, it is possible to work around stall issues.

It should be possible for a network system to support TCP connections for those clients that can't connect via UDP. I think I'll add that as an option to TNL :). It would still support all the higher level primitives like prioritization of object updates and fixed bandwidth consumption.

Quote:
However, this may be an advantage for a MMOG with thousands of players, where a poorly designed custom UDP protocol may fall apart (keeps sending data at a high(er) rate, preventing the network from recovering).

Well, I would never suggest using a poorly designed custom UDP protocol ;)

Quote:
Again, I recommend that developers look into developing or licensing/purchasing custom reliable UDP protocols (RakNet, TNL, and ReplicaNet appear to be good choices). However, TCP and TCP+UDP can work fine: the real work in making a game play well under all internet conditions is centered around the network game design itself, not the network protocol. Likewise, if a game plays well/poorly on the internet, it can’t be attributed-to/blamed-on TCP, TCP+UDP, or a custom UDP protocol. It’s the network game design itself.

The network toolkit you use often affects to a great degree the higher level design of your game networking. TNL, for example, isn't just a low level delivery protocol - it supports a rich set of data delivery policies that greatly simplify the higher level game networking design. TNL also uses a fixed per-client bandwidth setting, meaning that no matter how many objects are being updated, clients network connections will never be flooded.
Quote:Original post by Saruman
By large I mean that anything over 200 players you are going to start having some major issues, maybe even with a lower amount of connected players.

The main bottleneck in the RakNet API is memory usage for tracking duplicate packets. In ReliabilityLayer.h you will see a giant arry and this is a problem space that needs to be solved as for any large (>100) number of connected clients you are going to have issues. I know Kevin has worked on this but I do not know where he has gotten or what design he chose, and I am pretty sure he does not want to commit until the doxygen and osx port are complete as it would set back other peoples work.

There are other minor issues that really should be cleaned up, and IOCP support is something that you would definately want back in if you are running on a Windows platform server.

Hope that helps.


wow, thats pretty surprising to me. and to think the architecture in my game that uses RakNet should be able to handle way more then 200 players.... not that i ever expected that many people to play, but it's always nice to be scalable.

FTA, my 2D futuristic action MMORPG
Quote:Original post by graveyard filla
wow, thats pretty surprising to me. and to think the architecture in my game that uses RakNet should be able to handle way more then 200 players.... not that i ever expected that many people to play, but it's always nice to be scalable.

Note that as I said Kevin will be fixing this so it is not like this will be a persistant issue in the future. You could also fix the main problem yourself just by changing that big array to something more feasible.
Quote:Original post by markf_gg
Quote:Original post by John Schultz
However, if the networked application sends too much data, any advantage provided by a custom UDP protocol is lost. In all cases, if the reliable queues fill up faster than they can drain, lag will be present (beyond network latency), up to the point the channel must be closed due to queue overflow.

This is one of the reasons why TCP+UDP is a poor combination - it gets people thinking in a message-oriented mindset where the only two primitives for data delivery are guaranteed and unguaranteed messages. What often ends up happening in a 3D simulation is that a large portion of messages get tagged as "reliable", overflowing the queue.

Quote:
How valuable is out-of-order reliable support (OOORS)? In a game/simulation where objects move and stop for long periods of time, or when turning on/off non-state-affecting effects/props, bandwidth can be saved.

In practice we've found out of order delivered reliable events to be of limited use. There is however, no per-packet overhead for supporting reliable OOO data in the TNL model, since TNL also supports strictly unreliable event sends.


If OOORS is of limited use, what other class of data beyond guaranteed and non-guaranteed do you see of value? I agree that the biggest problem is too much data sent as guaranteed, which is a network game design issue. I have not yet seen a strong argument for supporting other classes of data. Either the data absolutely has to get there, or it doesn't. Perhaps you can give an example where this is not true?

Quote:Original post by markf_gg
Quote:
TCP alone is the only option for games operating in restricted environments (for example, when only HTTP/HTTPS is open at the firewall). As long as the TCP queue is effectively monitored (different methods for *nix and Win32), and decent client-side prediction is implemented, it is possible to work around stall issues.

It should be possible for a network system to support TCP connections for those clients that can't connect via UDP. I think I'll add that as an option to TNL :). It would still support all the higher level primitives like prioritization of object updates and fixed bandwidth consumption.


That's cool. When you get it working, perhaps post benchmarks showing any performance differences between the two (for a variety of bandwidth and network conditions, game types, etc.). I believe you'll need to use overlapped I/O and IOCP to determine the TCP send queue state on Win32 (queue flags/read-options exist for *nix).

Quote:Original post by markf_gg
Quote:
However, this may be an advantage for a MMOG with thousands of players, where a poorly designed custom UDP protocol may fall apart (keeps sending data at a high(er) rate, preventing the network from recovering).

Well, I would never suggest using a poorly designed custom UDP protocol ;)


The developer may not know that their implementation is poor until stressed under real-world internet conditions. WRT the previous link, I discovered that my first custom protocol was quite poor during network simulation and analysis (I thought it was decent before testing). It was during this analysis that I found that even the native TCP implementation (for this device) was broken. This is one advantage to using a pre-made toolkit, provided the toolkit authors can provide benchmarks/statistics showing that their design can survive worst-case internet conditions (thousands of players, etc., as with an MMOG). TCP has been researched/studied for around 20 years: it's strengths and weaknesses are well known (internet bandwidth balancing/optimization is still a hard, unsolved problem (not perfected)). RED and WRED are newer designs for router queues that help TCP to behave more efficiently during high congestion situations. This an another reason to start with the basic TCP design when designing a custom protocol.

Quote:Original post by markf_gg
Quote:
Again, I recommend that developers look into developing or licensing/purchasing custom reliable UDP protocols (RakNet, TNL, and ReplicaNet appear to be good choices). However, TCP and TCP+UDP can work fine: the real work in making a game play well under all internet conditions is centered around the network game design itself, not the network protocol. Likewise, if a game plays well/poorly on the internet, it can't be attributed-to/blamed-on TCP, TCP+UDP, or a custom UDP protocol. It’s the network game design itself.

The network toolkit you use often affects to a great degree the higher level design of your game networking. TNL, for example, isn't just a low level delivery protocol - it supports a rich set of data delivery policies that greatly simplify the higher level game networking design. TNL also uses a fixed per-client bandwidth setting, meaning that no matter how many objects are being updated, clients network connections will never be flooded.


That's true. My point has been that arguments against TCP and TCP-UDP are without merit in the cases where developers are aware of the network game design issues and are (for whatever reason: time, policy, skill level, cost, target market) limited to a TCP or TCP-UDP solution. In cases of congestion, TNL must drop any clients that haven't allowed the server to drain its reliable queue(s): at some point, no matter how much queue memory is present, you've got to call it quits, and drop the client. While you state that TNL supports a fixed bandwidth option, do you mean fixed maximum bandwidth? Does TNL also implement a filtered mean-deviation estimator to dynamically (and near optimally) adjust bandwidth for live internet conditions? The latter is far more important than the former (I would look at the former as a tuning tool, but in general would not want to artificially limit client bandwidth: there is no reason to do so if the custom protocol is optimally adapting for all the live channels).
Quote:Original post by John Schultz
If OOORS is of limited use, what other class of data beyond guaranteed and non-guaranteed do you see of value? I agree that the biggest problem is too much data sent as guaranteed, which is a network game design issue. I have not yet seen a strong argument for supporting other classes of data. Either the data absolutely has to get there, or it doesn't. Perhaps you can give an example where this is not true?

TNL and the Torque and Tribes engines before it introduced a data delivery policy called "most recent state guarantee" which is to say that for a given object, the current state of the object will, at some point, be reflected to clients interested in that object. This is at the heart of the ghosting facility of TNL and what sets it apart from many other networking packages (i.e. RakNet).

In this system, rather than having simulation objects "push" data events to clients, the objects simply mark themselves as having dirty states. When TNL decides it's time to send another packet to a particular client, it sorts dirty objects based on a user-supplied prioritization function and then writes object updates into the packet until the packet is full. Any dropped packets simply set the dirty state flags for that object for that client that were not subsequently updated in a later packet.

All remote object (ghost) creation messages are sent using this system as well, thus substantially limiting both the number of guaranteed and unguaranteed messages sent to clients. Because TNL tracks which states exist on which clients, there's no need to pulse unguaranteed messages (in case some position state was lost) or to send lots of guaranteed object creation/deletion messages.

The other data classificiation I've found useful is the "quickest delivery" data type -- player input for example, where a dropped packet or two shouldn't require a round-trip back to the client for a re-send. This is mainly a presentation issue for other clients in the simulation.
Quote:
The developer may not know that their implementation is poor until stressed under real-world internet conditions.

Well, they could always use a network technology that's been proven successful in AAA networked games back to oh, say 1998...

Quote:
While you state that TNL supports a fixed bandwidth option, do you mean fixed maximum bandwidth? Does TNL also implement a filtered mean-deviation estimator to dynamically (and near optimally) adjust bandwidth for live internet conditions?

TNL does have an adaptive bandwidth option for connections, but it's fairly primitive at this point. In the products I've shipped with it (Starsiege: TRIBES and Tribes 2) we simply fixed the client bandwith at 2kbytes/sec and 3kbytes/sec respectively. Due to the nature of the most-recent state data guarantee we always filled up each packet to the client. The resulting gameplay was of sufficient quality that we didn't bother attempting to adaptively adjust bandwidth settings on the fly, although we did allow clients to adjust the params slightly upwards if they had broadband connections.

I am currently looking at improving our adaptive rate code to more easily allow TNL's use in higher bandwidth, non-simulation applications. Can you recommend anything I should read on the subject? A near optimal filtered mean-deviation estimator sounds like it might be what I'm looking for :)
Quote:Original post by markf_gg
Quote:Original post by John Schultz
If OOORS is of limited use, what other class of data beyond guaranteed and non-guaranteed do you see of value? I agree that the biggest problem is too much data sent as guaranteed, which is a network game design issue. I have not yet seen a strong argument for supporting other classes of data. Either the data absolutely has to get there, or it doesn't. Perhaps you can give an example where this is not true?

In this system, rather than having simulation objects "push" data events to clients, the objects simply mark themselves as having dirty states. When TNL decides it's time to send another packet to a particular client, it sorts dirty objects based on a user-supplied prioritization function and then writes object updates into the packet until the packet is full. Any dropped packets simply set the dirty state flags for that object for that client that were not subsequently updated in a later packet.


This makes sense for objects that are rapidly changing state when the system is not running lock-step (or don't require ordered state consistency). To date, I have not run into a problem where this method can provide significant bandwidth savings, but I'll keep it in mind as a future option.

Quote:Original post by markf_gg
Quote:
The developer may not know that their implementation is poor until stressed under real-world internet conditions.

Well, they could always use a network technology that's been proven successful in AAA networked games back to oh, say 1998...


Given that this thread is titled MMORPG..., have you tested TNL with 2000-3000 player connections, under real-world internet conditions?

Quote:Original post by markf_gg
Quote:
While you state that TNL supports a fixed bandwidth option, do you mean fixed maximum bandwidth? Does TNL also implement a filtered mean-deviation estimator to dynamically (and near optimally) adjust bandwidth for live internet conditions?

TNL does have an adaptive bandwidth option for connections, but it's fairly primitive at this point. In the products I've shipped with it (Starsiege: TRIBES and Tribes 2) we simply fixed the client bandwith at 2kbytes/sec and 3kbytes/sec respectively. Due to the nature of the most-recent state data guarantee we always filled up each packet to the client. The resulting gameplay was of sufficient quality that we didn't bother attempting to adaptively adjust bandwidth settings on the fly, although we did allow clients to adjust the params slightly upwards if they had broadband connections.


Worst case scenario analysis for a MMORPG and 3000 very active players:

3kbytes/sec * 3000 players = 9000kbytes/sec, 72,000kbits/sec, 72Mbits/sec.

This means you'll probably have many fat pipes, as well as extra routing capabilities to deal with varying internet conditions. Given the unpredictability of network conditions, if the server does not actively adapt its bandwidth output, the system is going to fall apart (lots of data, lots of connections, lots of unpredictability). While this example isn't much of a proof, the bandwidth/complexity concepts come from studying the design of TCP, and why it is able to allow millions (billions) of connections to run relatively smoothly over a very complicated network (or web) of dataflows.

Quote:Original post by markf_gg
I am currently looking at improving our adaptive rate code to more easily allow TNL's use in higher bandwidth, non-simulation applications. Can you recommend anything I should read on the subject? A near optimal filtered mean-deviation estimator sounds like it might be what I'm looking for :)


Van Jacobson's paper, Congestion Avoidance and Control, written in 1988, is an excellent starting point. The history is also fascinating: it describes a time when the early internet could collapse. In the almost 20 years since the paper was written, there have not been signficant improvements (for all cases). These algorithms, as well as their variants, make up the core features of, you guessed it, TCP. To go back further in time, see RFC793, written for DARPA in 1981.

Thus, I hope it is clear why I have been defending TCP*: it really is an excellent protocol. Some of its features are not ideal for games/simulations, but the RTO calculation (see appendix A in Jacobson's paper) is required if a custom protocol is to be used in a large-scale, real world internet environment (such as a MMORPG). It's probable that the UDP-based MMORPG's that fail/fall-apart is due to poor bandwidth management.

More links here.

In summary, study the history and design of TCP, and use the best feature(s) for custom game/simulation protocols, while leaving out (or retuning) features that hurt game/simulation performance.




* I believe this is the first paper to describe TCP, by Vinton Cerf and Robert Kahn in 1974, BSW (Before Star Wars ;-)). TCP/IP allowed ARPANET to become the Internet and later the World-Wide Web. Robert Kahn talks about TCP and the birth of UDP.

[Edited by - John Schultz on May 18, 2005 4:19:23 AM]

This topic is closed to new replies.

Advertisement