Expectation: More Go for Less Dough
The gaming user community is always changing. As new DSL, fiber, and cable modem solutions are bringing lower cost, higher bandwidth connectivity the home user, expectations rise among casual and
die hard on-line gamers alike: more, better, faster! And their personal measure of connection success is ping. The perception remains: "good ping" means "good play". The game hosting companies, game
publishers, and individuals who host network game servers are on the hook to deliver the goods. A rented server with a T1 connection for 32 players hosting a Half-Life Counterstrike server might have
gone for $150 per month a couple of years back. Now they can be had for $25. So how is a game server hosting company supposed to make money?
One way hosting companies have managed to make this happen is with higher performance hardware. Newer P4-generation and recent dual-core machines, running server class flavors of Windows or Linux,
are able to host multiple game engines on the same system.
Another cost management development for gaming hosts is the "self-managed" game server. In the same way that tools like Webmin 1) have enabled web server
hosting companies to offer $5 per month web hosting to thousands by lowering their own administrative costs, a refined set of public domain and cheap game administration remote admin tools have made
it relatively easy for the game server hosting companies to pass on routine maintenance issues for their rented servers to even mildly technical customers, whether it be uploading of new game maps
and player files to stopping and starting different configurations of game server processes.
Server Squeeze: How to Get More
So the network and the server hardware are in place. There are as many customers as the infrastructure can bear. What else can be done to improve the performance, and possibly the capacity,
without adding more hardware? One approach that remains is to get the server software itself to run better.
This means working to "adjust" the server binaries to improve their performance. This assumes, of course, that you have access to the source code for the game server programs. If you are a game
developer or part of the mod community for your favorite game, you may already have access to the game source. If you are a big enough customer of the game (for example a large gamer café
owner), or a big sales enabler (by virtue of the large number of servers you host for Company X's latest release), you may be able to negotiate access to the source or convince the vendor to make
some performance improvements of their own on your behalf.
Assuming you have access to the source, you can take several steps down the optimizing path, including application of processor-agnostic general optimization and optimization targeted to your
server hardware's specific processor type, including 64-bit architectures.
A good first step toward beefing up your server program is to compile it with an effective optimizing compiler. One choice is the Intel C/C++ Compiler 2),
available in both Linux and Windows flavors. If the server is a Windows system, the software developer can use their pre-existing Microsoft DevStudio IDE to manage projects and compiles, with the
Intel C and C++ compilers underneath. If the server is a Linux system, the user can choose to use the Eclipse software development CDT environment or good, old fashioned command line editors and
How to start? For this exercise, a solid game engine example was selected: Richard Stanway's R1Q2 3). This is a tightened and enhanced version of the Quake
2 engine, which was release to the Open Source community by ID Software back in 2001. Older code? Yes, but many game programmers cut their teeth on Q2 mod development. It's a known space and a good
reference point. Rich's R1Q2 was coupled with code from the LOX Q2 mod, an "extreme weapons" mod built by David S. Martin and friends, and enhanced by Geoff Joy and others. The LOX mod is a good
example of performance challenging code, as the massive number of events that can be created by a single player with the right weapon selections and feature combinations can bring an otherwise
healthy server to its knees.
Again for this example, the target server platform is Linux, the default choice among server hosting companies where game server engines have a Linux server offering. The test server used was a
vanilla Red Hat Enterprise Linux 3 (Taroon Update 4) server, running on a 3.7 GHz Pentium 4 with 1 Gig of RAM, spinning a standard Serial ATA hard drive. Note that all of the steps being discussed
here, including the optimization techniques and compiler features, are applicable to or available on Windows as well.
Get the code. Unwrapping the code and doing a straight gcc compile using the ---O2 optimization switch with the provided makefiles generated usable binaries that performed as expected. A pair of
client machines running on an isolated net connected without issue and achieved pings from varying from 15 to 35 ms. Since this code has had some level of grooming, compiler warnings were
Perform reference benchmarking. In this case, two client machines were connected to the server, running its standard version of binaries, from a local network connection. Their static pings were
recorded, as were their pings when the server was stressed. In this case, the stress test involved having the players from both client test machines launch 4 napalm grenades per second from a fixed
location on the servers default level, generating at least 128 in-game explosions per second. Client "freeze" behavior, typical in this server stress condition, was monitored, as was the frequency of
"RATEDROP" warnings, issued from the server when a significant drop in server-client data exchange rate is detected.
Get the Intel compiler. The Intel C/C++ compiler package is available for demo download, with academic, non-commercial, and commercial licenses. The software installs on nearly all major Linux
distributions, including those not supporting RPM.
Update the makefiles to enable optimization options. In this case, that meant changing "CC=gcc" to "CC=icc". The R1Q2 makefile required no dependency changes or LDFLAGS changes. The LOX makefile
required a minor change to the LDFLAGS setting to accommodate the new library home for a couple of key string functions.
For round one of our compiler optimization exercise, CFLAGS was changed to add the -02 optimization switch. This is the most commonly recommended option, performing many
optimizations for speed without significant regard to the impact on code size, including but not limited to:
- Forward substitution
- Constant propagation
- Dead static function, code, and store elimination
- Tail recursions
- Partial redundancy elimination
One thing that became clear during the initial build with the Intel compiler was that the number of warnings increased, going from 4 to 62. Most of the warnings were variable type checking issues.
Some of them warranted further investigation. In this case, only minor code changes were required. The newly rebuilt binaries were tested and results gathered.
For the next round of optimization, the -02 CFLAGS option was changed to -03. This option, according to the documentation, contains "more aggressive
optimizations, such as prefetching, scalar replacement, and loop and memory access transformations". This includes all of the features of the -02 optimization, plus loop unrolling, code replication
to eliminate branches, and padding of certain power-of-two arrays to improve cache use. Again, the newly built binaries were tested and results were gathered.
For round three, the binaries were built with an added switch: -axN. This switch enables processor-targeted optimization, in this case specifically for Intel Pentium 4 and
compatible chips. Once again the new binaries were tested.
The final round of compiler switch optimization called for changing the -axN switch to -axP. This option optimizes the output for Intel Pentium 4 processors with
Streaming SIMD Extensions 3 (SSE3) instruction support. Once more the resulting binaries were tested.
The two client machines used for the test included:
- Machine 1 - a 2.9 GHz Pentium 4 with 512 Mbytes of RAM, running an R1Q2 Quake 2 client in OpenGL mode at 1024 x 768 resolution
- Machine 2 - a 1.3 GHz Celeron with 512 Mbytes of RAM, running a stock 3.20 ID client in software rendering mode at 1024 x 768 resolution
Here is a summary of the results:
|Reference (gcc -O2)|
|icc -03 -axN|
|icc -03 -axP|
|Static Ping (20 sec avg)||18||35||18||35||16||33||15||32||13||23|
|Stress Ping(20 sec avg)||50||60||50||58||45||55||43||53||42||49|
|Perceived Lag Freeze||YES (~3 sec)||YES(~6 sec)||NO||YES (~2 sec)||NO||NO||NO||NO||NO||NO|
|Stress Test Frame Drop Warnings / sec||0.125||0.25||0.125||0.25||0.1||0.17||0.1||0.17||0.08||0.12|
|Post-stress Test Recovery to static ping rate||10 sec||14 sec||8 sec||11 sec||7 sec||10 sec||6 sec||9 sec||3 sec||6 sec|
Compile Optimization Conclusions
The above results show that there is no significant ping difference between the gcc -O2 and icc -O2 behavior during relatively inactive periods, put perceived lag on the client side is reduced
somewhat. Similarly, frame drop warning rates and recovery times after stress events are mildly better with the icc compiler. Results are somewhat more significant when going to a -O3 optimization
level and even more dramatic when including the processor targeting options -axN and -axP.
The above tests are not a perfect model for behavior in a dynamic environment, where players will be connecting from across the company or across the globe. But they do serve to demonstrate the
opportunity for improvement.
Clearly, ping is not the only measure of performance. While the improvements made to the test programs did improve ping somewhat, most of the impact was seen in the server's ability to maintain
smooth gameplay, or to restore smooth gameplay after periods of intense activity. And this is what it is all about.
Additional Steps to Improve the Binaries
The gains demonstrated above may be significant enough for some. If still more performance improvement is required, there are a number of additional steps that can be taken. While these steps are
beyond the scope of this article, they are worth mentioning as areas of future exploration, especially for developers of new game offerings.
One of these steps is to apply profilers to determine where the hotspots (bottlenecks) are in the game server program. Tools such as Intel's VTune Performance Analyzer product can be employed to
locate the sources of program slowness, identify key algorithms that can be improved, and point toward other opportunities to optimize program behavior.
Another approach that can work hand in hand with performance analysis is addition of threading techniques to the software. Individual hotspots in the program can be threaded, using available
threading libraries and new or modified code, to streamline program operations and to take advantage of the performance gains offered by new dual core processor technologies.
Other Ways to Improve Server Performance
There are, of course, fundamental things that a game server administrator can do to ensure that the game being hosted is optimally configured and makes best use of all the work that went into
coding and compiling it well. Several key server configuration parameters may be adjustable for a particular game, significantly impacting overall performance. While these vary from game to game,
they can include:
- Practical player limit (i.e. don't let the user adjust this number past their purchased limit or sell player count limit packages that exceed the game engine's ability to deliver)
- Hard ceiling to connected ping of players (i.e. players with ping greater than a specific limit are not allowed to connect or are disconnected during game play to protect playability for therest)
- Limited bandwidth or disabled downloading of maps, models, audio, and other optional "level-specific" content.
- Limited bandwidth or disabled uploading of player-specific content, such as "skins" and "sprays", where such features are supported by the game.
- Cap on frames per second performance (may be expressed as max number of player updates per second)
A last option: you can always change the game. A novel approach to minimizing server performance impact from level-specific content download, adopted by Richard Stanway in his R1Q2 package,
involves outsourcing of map / texture / audio downloads to an HTTP server. This means that the map download function can be optionally offloaded to a separate system, perhaps one on a separate subnet
to minimize network impact, with transfers running at a higher UDP data rate than the game's existing TCP connections can support. The downside to doing this with an existing, released game is that
it will probably require client-side changes as well. This sort of approach would work well applied to the design of a new game server engine, and could readily be applied to a rewrite of an existing
engine where the server code has been released to the Open Source community.
This type of distributed data transfer between the game server and the client is also another excellent application for threading techniques. In environments supporting several Massive Multiplayer
servers, these could even be scaled up to support deployment in clustered environments, with specific components of the cluster performing particular aspects of client updating and content download
About the Author
Doug Helbling is a software engineer for Intel's software development product deployment team. He works to develop Linux product delivery solutions, various game mods and case studies. His latest
project includes an optimization study of GarageGames' Torgue engine.
1) Webmin web-based interface for system administration http://www.webmin.com
2) Intel C/C++ compilers and related software products http://www.intel.com
3) R1Q2 Quake 2 release http://www.r1ch.net/stuff/r1q2