Sign in to follow this  
uri8700

My game server is eating my CPU

Recommended Posts

uri8700    100
Hi

I made a multiplayer 3d game and the server eats up way too much CPU. On a quad-core 3Ghz machine it only supports 5 players before things get realllll slooooooowwww.

The server is multithreaded, one thread per player and checks the line for input 60 times a second.

Does this make sense? Is my design somehow flawed? Does anybody know how I can up the number of players?

I can't check the line less often coz then player movement is disrupted.

Thanks

Share this post


Link to post
Share on other sites
Hodgman    51222
[quote name='_TL_' timestamp='1310960148' post='4836620']one thread per player[/quote][url="http://www.gamedev.net/index.php?app=forums&module=forums&section=rules&f=15"]Networking and Multiplayer FAQ[/url] -- Question 10 (Should I spawn a thread per connection in my game code?).
Having one thread per player is likely slower, more complex and more bug-prone than just using a single thread for all players.
You can read data from your sockets in a non-blocking manner, so that the thread doesn't stall when there's no data from a particular player.[quote] checks the line for input 60 times a second.[/quote]Many action games, like Counter-Strike, only update 20 times per second, and interpolate between results. Perhaps you can describe your algorithms/architecture a bit?

Share this post


Link to post
Share on other sites
uri8700    100
I'll take a look at the FAQ, thanks. But my sockets are already non-blocking.

The server checks for update every 1/60 second, when it gets an instruction (key_was_pressed) it calculates object's new position and sends an update to all players.

Share this post


Link to post
Share on other sites
frob    44904
[quote name='_TL_' timestamp='1310960148' post='4836620']
Hi

I made a multiplayer 3d game and the server eats up way too much CPU. On a quad-core 3Ghz machine it only supports 5 players before things get realllll slooooooowwww.

The server is multithreaded, one thread per player and checks the line for input 60 times a second.

Does this make sense? Is my design somehow flawed? Does anybody know how I can up the number of players?

I can't check the line less often coz then player movement is disrupted.

Thanks
[/quote]

Figure out WHY things get slow, then fix those things.

You say it "eats up way too much CPU". That probably means the problem is not the networking. Throwing a few buffers to your network card is not a CPU hog.

Use a profiler to figure out what is consuming all the time in your program. The "Premium" and "Ultimate" levels of Visual Studio include a profiler. Visual Studio also has a general CLR profiler that is moderately okay. You can get free profilers such as CodeAnalyst for C++ or if you are using a managed/.NET environment, tools like nprof, slimtune, and the eqatac profiler are available.

Look over your profile results carefully. Be sure you are measuring the optimized version of your code, and that you measure both before AND after making changes.



Using a profiler is pretty easy, there are many tutorials out there. Basically you build your program with some special libraries and run it. At some point you use a tool (such as your debugger) to start profiling, let it run for a few seconds, then stop profiling. The system figures out the results and gives you a bunch of tables showing the functions that were called, how long they took, and how many times they were called.

Profiling networked code is not so different than local code other than it takes a while to set up the normal work conditions of connecting multiple machines and getting to the problem spots.

If you haven't profiled your code before, usually the first time through you notice obviously stupid things like calling "strlen()" a billion times each update, or running an A* pathfinding on hundreds of items every frame, or similar accidental mistakes that you just didn't realize were there. From there you backtrack to why you are calling strlen a billion times and change the code to only call it a few times. Or you look at the code that is running pathfinding on every object every frame, and prune it down to running it only on a few items at a time. Or you'll notice that moving an object runs a query on every other game object instead of just those nearby. Or you'll notice other entirely different things.

Profiling and improving code is a learned art. You need to discover what is slow, and replace it with something faster. Generally, fixing profiler-discovered hot spots means reusing or caching values, spreading work across time, or changing your algorithm to one that processes less data. Sometimes the improvements are as simple as moving code outside a loop. Other times they require significant work writing tools to preprocess data so it can be managed more quickly in game, such as converting a basic mesh map into a spatial tree. Every issue is different, and has its own unique challenge.

Share this post


Link to post
Share on other sites
rip-off    10976
Your networking threads should be using relatively low amounts of CPU. If they aren't, something is wrong. If you are using a non-blocking socket in a thread, they are probably spinning on the socket as fast as possible looking for incoming messages, which is needlessly burning CPU time.

As mentioned, you generally don't want a "thread per client", but perhaps switching to blocking reads might allow you some breathing space until you have time to make the larger architectural changes.

Share this post


Link to post
Share on other sites
EJH    315
Do you have a "Sleep(1)" at the end the server update function? That should make it much nicer to the CPU.

You running this on a home network? Consumer broadband in most places has really bad up speeds, which is what a game server needs the most. You mention "5 players and it starts crapping out" so that might be where you are hitting your up rate limit.

Also, do you really need 60 updates a second? Try dropping it a bit.

Share this post


Link to post
Share on other sites
ApochPiQ    23000
[quote name='EJH' timestamp='1311012109' post='4836917']
Do you have a "Sleep(1)" at the end the server update function? That should make it much nicer to the CPU.
[/quote]

Sleeping is the wrong way to relieve CPU usage, [i]especially[/i] in a network server.

The correct approach is to directly wait on the sockets, either using blocking modes, IOCPs, or select() with a timeout, or whatever. Just sleeping and then busy-polling the socket is incredibly bad design.

Share this post


Link to post
Share on other sites
uri8700    100
Thanks you all for your suggestions.

I'm doing a Sleep(33) on every thread to regulate to 30 recv()s per second, apparently non-blocking ones. What is IOCP? I'm running locally using the loopback interface. Now I can support 10-15 users, and the limiting factor is probably client use of CPU (unlike before when the server was the hog). I'll have to find another machine to do a more realistic test. But it's already much better :) thanks

Share this post


Link to post
Share on other sites
frob    44904
[font="arial, verdana, tahoma, sans-serif"][size="2"][quote name='rip-off' timestamp='1310989495' post='4836761']
If you are using a non-blocking socket in a thread...
[/quote]
[quote name='EJH' timestamp='1311012109' post='4836917'][/size][/font]Do you have a "Sleep(1)" at the end the server update function? That should make it much nicer to the CPU.

You running this on a home network? Consumer broadband in most places has really bad up speeds, which is what a game server needs the most. You mention "5 players and it starts crapping out" so that might be where you are hitting your up rate limit.

Also, do you really need 60 updates a second?
[/quote]

That's going about the backwards, looking at solutions and then seeing if the problems exist.


Open a profiler, measure, and discover the thing that is actually slow. After the actual problem is discovered, fix that problem. Then measure again.

In real life examples I've seen, I have profiled code to discover that the entire performance problems are based in accidental assumptions. The issues are generally not the problems that were suspected.

In one specific case a very tight loop inside a tool was calling strlen() on every item in a huge list, every single update. The programmer working on the system thought it was a problem with a completely different system, and kept swearing up and down that the other system was as fast as it could go, so the loop was fully optimized. Obviously, opening up every sting and running its full length will be painful on performance. A very minor change to not re-calculate the length dropped this the tool from requiring about five minutes to running almost instantly.


In another specific case, a tool was trying to calculate dependencies on data files within the build system. The existing system had been in use for almost a decade on multiple shipped titles. Various people had mucked with the system, made some improvements buying a few seconds here or there, and moved on. I got sick of it and plugged it in to the profiler to see what was taking so long. Every time it validated a file, it would run a query across the network, see if the file existed on a remote server's disk, and then continued. It ran this query on every item in the game database of several hundred thousand items. People on the team just assumed it was because the game had grown for so many years, it was a legacy system, so that's just how it was. After measuring and finding the actual problems with a profiler, I first fixed remote disk lookup and changed it to a local lookup. This dropped the run time down from about 15 minutes to about 3 minutes. Profiling some more, I could still see a huge amount of the remaining time was spent in OS calls to look up file names; there were thousands of times more lookups than the number of actual files. So I checked the directory tree at startup, cached it in a hash map, and used that to look them up. This dropped the total run time down to about 45 seconds.

In another specific case, processing some line-wrapped text caused the system to grind to a halt. One programmer suspected the font renderer because stopping execution always showed it running in the pre-render step. Another thought it was the font engine calculating font sizes, since it frequently stopped looking up those details. After ACTUALLY PROFILING the system, it was caused by an incredibly stupid algorithm for detecting when to wrap a line of text: The first character would be pre-rendered, then the code would calculate its bounding box and see if it fit. Then the first two characters would be pre-rendered and manually re-calculate the bounding box. Then three characters would be pre-rendered, bounding box figured out, and line wrapped if appropriate. So the worst case of writing out several paragraphs of line-wrapped text caused a thousand or so pre-renders to calculate word wrap by adding one character at a time. To compound it, the results were re-calculated every frame. Fixing the incredibly naive algorithm completely solved that bottleneck.

If those other programmers had simply jumped in and followed their guesses to what was slow, they would not have found and fixed the problems.




Always get a proper diagnosis before prescribing a solution.

Share this post


Link to post
Share on other sites
XXChester    1364
[quote name='_TL_' timestamp='1310960148' post='4836620']
Hi

I made a multiplayer 3d game and the server eats up way too much CPU. On a quad-core 3Ghz machine it only supports 5 players before things get realllll slooooooowwww.

The server is multithreaded, one thread per player and checks the line for input 60 times a second.

Does this make sense? Is my design somehow flawed? Does anybody know how I can up the number of players?

I can't check the line less often coz then player movement is disrupted.

Thanks
[/quote]

You answered your own question. You spawn a thread for each player and you only have 4 cores. This wouldn't be a problem if you put a small delay in the threads but sense you just let them run 60 times a second you are going to gobble up all of the CPU memory. I ran into a similar problem with a chat application but a simple thread.sleep for 10 miliseconds changed my server from using 100% CPU to 4%. This is probably still not the optimal way to do this.

Share this post


Link to post
Share on other sites
Geri    367
-Your design is correct, 10 is a very small amount of threads, the problem is different.

-60 is too mutch.Interpolate, and use maximum 25 position requests per sec.

-Maybee the loop of your blank cycles eating the performance away? You should try to add some Sleep(4);-s at the while() cycles where you reciving the datas.

-Maybee your database/object/modell/character management is too slow, and eats too mutch cpu?

Share this post


Link to post
Share on other sites
EJH    315
[quote name='ApochPiQ' timestamp='1311016288' post='4836962']
[quote name='EJH' timestamp='1311012109' post='4836917']
Do you have a "Sleep(1)" at the end the server update function? That should make it much nicer to the CPU.
[/quote]

Sleeping is the wrong way to relieve CPU usage, [i]especially[/i] in a network server.

The correct approach is to directly wait on the sockets, either using blocking modes, IOCPs, or select() with a timeout, or whatever. Just sleeping and then busy-polling the socket is incredibly bad design.
[/quote]

Hehe I never suggested busy polling. Sleep (1) in our UDP game server vastly reduced the CPU usage just due to the main server update loop, where game related stuff and outgoing event timers got updated. Nothing to do with polling sockets. :)

Is there anything wrong with that? I mean, a program consisting entirely a single for loop that increments an integer can spike a core to 100%.

[quote name='_TL_' timestamp='1311043875' post='4837157']I'm doing a Sleep(33) on every thread to regulate to 30 recv()s per second, apparently non-blocking ones.[/quote]

That's not going to be a very reliable timer. Pretty sure when you sleep(x) the OS doesn't guarantee a return in in x ms, just "at least x" ms. You need a real time running on your server if it is anything more than just a relay.

Share this post


Link to post
Share on other sites
hplus0603    11347
[quote name='EJH' timestamp='1311108420' post='4837611']
[quote name='_TL_' timestamp='1311043875' post='4837157']I'm doing a Sleep(33) on every thread to regulate to 30 recv()s per second, apparently non-blocking ones.[/quote]

That's not going to be a very reliable timer. Pretty sure when you sleep(x) the OS doesn't guarantee a return in in x ms, just "at least x" ms. You need a real time running on your server if it is anything more than just a relay.
[/quote]

Most real-time systems are event driven, and handle events out of a queue. "events" can be things like "data is available on socket X" or "timer Y says it's time to step the simulation."

In these architectures, your main primitives are:
1) pend some event (timer, pending read, etc)
2) wait, blocking, for some event to fire
3) handle the event

You will see that select() is implemented like this, but only works for file handles. I/O Completion Ports on NT, and evented I/O on Linux, works like this as well, and works for a bigger set of objects. A library like boost::asio abstracts the platform specific details and gives you this system using the reactor (for the event queue) and any number of worker threads you want to use to pull events off the queue.

The queue will be blocking -- if a thread asks for work, and there's nothing in the queue, it will be blocked until there is work to do. These kinds of queues are the best way to multi-thread a server in many cases, because you can spawn one thread per CPU core, and make maximal usage of available hardware resources, as long as all I/O is asynchronous/evented as well -- no blocking on file reads in the "pure" model! You have to make sure that your event handlers are all thread safe, though, which may introduce more serialization than ideal, but for some problem domains ("everybody is in the same room") it's hard to avoid.

Share this post


Link to post
Share on other sites
ApochPiQ    23000
[quote name='EJH' timestamp='1311108420' post='4837611']
Hehe I never suggested busy polling. Sleep (1) in our UDP game server vastly reduced the CPU usage just due to the main server update loop, where game related stuff and outgoing event timers got updated. Nothing to do with polling sockets. :)

Is there anything wrong with that? I mean, a program consisting entirely a single for loop that increments an integer can spike a core to 100%.[/quote]

Yes, there's something wrong with that.

A server should not be doing so much work that it pegs a core, [i]unless there are enough clients actively requesting that much work to be done[/i]. If your server registers any non-trivial CPU usage with nobody connected, you're doing it wrong.

Sleeping is also the wrong way to relieve CPU usage. The correct solution is to do less work, or wait intelligently using OS wait primitives where applicable. Sleeping a thread to make the CPU usage drop is very bad, because it's analogous to popping the tires on your car so you can't drive over the speed limit. Sure, the cops will quit pulling you over for speeding, but you're also going to never get anywhere because [i]your tires are flat. [/i]As soon as you sleep - even Sleep(0) - to drop CPU usage, your scalability just got axed by an order of magnitude at least. When you [i]do[/i] need to use an entire core to do real work, you'll be wasting time yielding CPU for no good reason, just so you chew up less cycles in idle.

The point of idle is that you should use [i]no[/i] CPU, not "less" and not "a tiny bit." If nobody asks your server to run the game simulation, you shouldn't be wasting cycles on it.

Share this post


Link to post
Share on other sites
hplus0603    11347
[quote name='_TL_' timestamp='1311125007' post='4837734']
OK, can anybody suggest a method other than Sleep() for regulating server ticks?
[/quote]

I already did.
To re-cap:
Either use evented I/O (libevent, boost::asio, etc) with timers for ticks, or use select() with a timeout for when the next tick is supposed to happen.

Share this post


Link to post
Share on other sites
EJH    315
[quote name='ApochPiQ' timestamp='1311114347' post='4837670']
Sleeping is also the wrong way to relieve CPU usage. The correct solution is to do less work, or wait intelligently using OS wait primitives where applicable. Sleeping a thread to make the CPU usage drop is very bad, because it's analogous to popping the tires on your car so you can't drive over the speed limit. Sure, the cops will quit pulling you over for speeding, but you're also going to never get anywhere because [i]your tires are flat. [/i]As soon as you sleep - even Sleep(0) - to drop CPU usage, your scalability just got axed by an order of magnitude at least. When you [i]do[/i] need to use an entire core to do real work, you'll be wasting time yielding CPU for no good reason, just so you chew up less cycles in idle.

The point of idle is that you should use [i]no[/i] CPU, not "less" and not "a tiny bit." If nobody asks your server to run the game simulation, you shouldn't be wasting cycles on it.
[/quote]

Ok, lets say there are hundreds of timed events that happen at random time intervals on the server. Whenever any of these events occurs, some clients must be informed. What is the proper way to handle that by OS wait primitives? Currently my main server thread looks something like this (Lidgren Network, UDP, 32 player game):


[code]
while(1)
{
// (1) update clock
// (2) handle incoming admin commands

// note: 3 through 5 are not necessarily sequential
// (3) handle incoming message traffic
// (4) update all game objects, timed events, etc.
// (5) send any outgoing traffic

// (6) sleep(1)
}
[/code]

It has low CPU usage always, even with 32 players on 6 year old hardware. Bandwidth would become an issue long before CPU would. But for future reference, I'd like to know the "proper" way to handle things with OS Primitives if you have potentially hundreds of events going out at random times from your server, many of which have no dependence on incoming traffic. Just wondering... in Windows, C# preferably. ;)

Share this post


Link to post
Share on other sites
hplus0603    11347
[quote name='EJH' timestamp='1311187907' post='4838095']
Ok, lets say there are hundreds of timed events that happen at random time intervals on the server. Whenever any of these events occurs, some clients must be informed. What is the proper way to handle that by OS wait primitives?
[/quote]

First, simulation events should happen only during well-defined time steps. Time should be measured in "steps" not floating-point seconds. This means that any simulation event always happens during some particular time step.

Second, you schedule notifications to clients. Any event that happens during a time step between the last outgoing packet and the current outgoing packet should be put into the packet being sent to the user. In the simplest implementation, there's a packet per time step per user -- at 30 steps per second, this is quite doable. In more advanced systems, you may have higher step rates (especially if doing precision physics), and you may bundle more steps into a single packet for clients with higher latency, for example.

Share this post


Link to post
Share on other sites
ApochPiQ    23000
[quote name='hplus0603' timestamp='1311191844' post='4838122']
[quote name='EJH' timestamp='1311187907' post='4838095']
Ok, lets say there are hundreds of timed events that happen at random time intervals on the server. Whenever any of these events occurs, some clients must be informed. What is the proper way to handle that by OS wait primitives?
[/quote]

First, simulation events should happen only during well-defined time steps. Time should be measured in "steps" not floating-point seconds. This means that any simulation event always happens during some particular time step.

Second, you schedule notifications to clients. Any event that happens during a time step between the last outgoing packet and the current outgoing packet should be put into the packet being sent to the user. In the simplest implementation, there's a packet per time step per user -- at 30 steps per second, this is quite doable. In more advanced systems, you may have higher step rates (especially if doing precision physics), and you may bundle more steps into a single packet for clients with higher latency, for example.
[/quote]



This is exactly right.

Discretization is the name of the game. Instead of saying "this event happens in 12 ms, this one in 34 ms" you say "this happens at tick 1+offset 2ms, this one in tick 3+offset 4ms." Then you gather the events that occur in each tick, and send them (along with their tick-relative time offsets) out to the clients. This lets you round-robin updates to multiple clients over the course of a tick, which is a good solution for multithreaded servers; you just tell everyone the timestamp of the tick start and then give them all the information in (authoritative server time) relative offsets.

The server simply runs a simulation tick ahead of what it sends to the clients, so it can aggregate the updates correctly.

Share this post


Link to post
Share on other sites
uri8700    100
hplus, can you give an example algorithm similar to EJH's using select() instead of Sleep() and the clock? His algorithm is more or less like mine.

Share this post


Link to post
Share on other sites
uri8700    100
Hidden
I use an algorithm similar to EJH's. But my server is consuming 45% CPU if I do Sleep(1). I have to pull Sleep(30) or so for CPU usage to drop low enough. How can it be? just checking the timer isn't supposed to be much of a CPU hog, even if it's done 1000 times per second.

Share this post


Link to post
hplus0603    11347
[quote name='ApochPiQ' timestamp='1311204664' post='4838187']
this happens at tick 1+offset 2ms, this one in tick 3+offset 4ms.
[/quote]

In my opinion, there should be no offsets within the ticks. Everything happens "at" the tick. On the CPU, of course, some things happen before other things, but they are all logically expected to happen during that particular tick.

The only time when sub-ticks matter is when you do presentation things like animations and sound effects -- and those are entirely client-side, derived from the simulation state, and thus do not need any explicit sub-frame simulation synchronization.

Share this post


Link to post
Share on other sites
evillive2    779
[quote name='_TL_' timestamp='1311209426' post='4838207']
hplus, can you give an example algorithm similar to EJH's using select() instead of Sleep() and the clock? His algorithm is more or less like mine.
[/quote]
The last argument to select is a structure to hold a timeout suggestion. Sleep will wait for AT LEAST the timeout requested regardless of if anything is waiting. Select on the other hand will wait for UP TO the timeout requested if nothing is happening on the socket sets being polled. If something happens on the socket sets being polled by select then the select call returns regardless of if the timeout has expired or not. I believe libevent has a similar mechanism but I don't recall it right now.

That being said - if you are using select already to poll sockets - just insert a small timeout in the timeout value. Note that this argument is a timeval structure and not just a simple number

[url="http://linux.die.net/man/2/select"]man select[/url]

Share this post


Link to post
Share on other sites
ApochPiQ    23000
[quote name='hplus0603' timestamp='1311210060' post='4838215']
[quote name='ApochPiQ' timestamp='1311204664' post='4838187']
this happens at tick 1+offset 2ms, this one in tick 3+offset 4ms.
[/quote]

In my opinion, there should be no offsets within the ticks. Everything happens "at" the tick. On the CPU, of course, some things happen before other things, but they are all logically expected to happen during that particular tick.

The only time when sub-ticks matter is when you do presentation things like animations and sound effects -- and those are entirely client-side, derived from the simulation state, and thus do not need any explicit sub-frame simulation synchronization.
[/quote]

In the common case I would tend to agree, but there are times (such as when doing expensive physics simulation server-side) where it's nice to be able to offset within a tick and do some tweening from that to get to client-presented values. Also helps a bit with perceived latency in certain edge cases.

Suppose you need to run complex physics in addition to some higher-level game logic. Your physics threads get snarled on a nasty collision resolution, for instance, and take slightly longer than the tick budget to return. Instead of deferring the results of the collision by an entire tick, or assigning it to the prior tick and possibly getting premature collision response animation/etc., you use an offset to hint to the client that it needs to do some interpolation to make things look continuous.

Coupled with roundtrip time estimates, you can use this to help resolve "I shot you first" type situations, although admittedly it requires a degree of care to ensure that the arbitration actually produces results that "feel" correct. Bungie did some interesting stuff with this in Reach, and talked about it at length at GDC 2011.

Share this post


Link to post
Share on other sites
hplus0603    11347
[quote name='ApochPiQ' timestamp='1311223884' post='4838299']
Suppose you need to run complex physics in addition to some higher-level game logic. Your physics threads get snarled on a nasty collision resolution, for instance, and take slightly longer than the tick budget to return. Instead of deferring the results of the collision by an entire tick, or assigning it to the prior tick and possibly getting premature collision response animation/etc., you use an offset to hint to the client that it needs to do some interpolation to make things look continuous.
[/quote]


There is no case in reality where this makes sense. If your physics simulation takes longer to run one tick than the duration of a tick, you're likely heading for the Death Spiral of Death. However, let's assume that there's a temporary CPU stall for about one tick's worth of time -- maybe because of virtualization, maybe because of scheduling, maybe because a backup process started -- whatever. Then what? How is this different from a network stall for about one tick's worth of time?
Your system needs to be able to deal with this; typically by adapting the estimated clock offset when there's a snag. In physics simulation time, there is only "the step," and no events happen at a resolution finer than "the step." Separately, your client may lerp the display of various events to times between the actual "step" times, but that's entirely a client-side decision. There is no case in a fixed time-step simulation where it makes sense for the server to try to offset events by less than a step.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this