Sign in to follow this  
LycaonX

Multithreaded server - horrible performance

Recommended Posts

I've got a game server with 160-200 clients. Well, a potential server. I'm using .NET's sockets with the async methods (BeginReceive, etc). My problem is, I'm experiencing MAJOR lag with clients that are connected when there are more than 25. This can be any random mix of actual players and npc connections. Even clients connected over my LAN or even on the local computer are running ping times of 500-3500ms. Now, these are not ICMP pings, but they're done over the client's single TCP connection, I suspect like any other online game. If I have just a dozen or two player clients, their pings are in the usual range you'd expect from random players all over the world. 50ms to 400ish (for dialup). My personal ping from my client to the server is ALWAYS 0-1ms with any number of clients under 25. Once I add more in, my ping climbs up to 3500ms, which is ridiculous for a self-ping. From some of the research I've been able to turn up on Google (not much), the async methods of .NET sockets grab threads from the process threadpool, which contains up to 25 threads available for classes that use the ThreadPool. So this means that I have 25 threads servicing 200 connections, which I think MIGHT be the problem, but I would like the advice of more experienced coders. Most of the 'clients' are NPC connections. The npc application is in c++ and not easily modifiable, so I'm not too keen on messing with it at the moment. If I look at task manager, I see that the npc app has a single thread for each npc connection. So I'm kind of flailing in the dark here. I'm not sure what the 'name' of the solution is, so I'm not sure what to type in Google to find whatever answer I need. Let me know if you need more information in order to diagnose the problem, and/or fix it.

Share this post


Link to post
Share on other sites
Threadpools and non-blocking sockets are the way to go. Don't know about the async versions. Never used them because you get potential thread lock issues.

I also suspect theres some calling overhead with all the async functions being called.

Actually only having a single thread (like in haproxy) can actually be faster than threading.

How does your memory behave? Could be collection resize issues (can be fixed by reserving more space, could be garbage collection kicking in)

On the MMO's I've worked on we had a maximum of 12 threads and sometimes (during live debugging) we had only 1. And it ran perfectly (in java mind you)

Also. Did you try profiling the server?
Did you verify that the server only allocates a maximum of 25 threads (as mentioned before you might speed up the application by lowering the threadpool count, although it might hide the real issue)

Share this post


Link to post
Share on other sites
Yeah, I profiled it. I also ramped up the npc connections to 400 and still show a max of 25 threads.

The async socket methods internally use the ThreadPool, according to MSDN and several other sites found via Google.

From what I read, calling a socket Begin* method grabs an idle thread from the process's ThreadPool and holds onto it until the async operation completes, then releases that thread back to the pool.

Share this post


Link to post
Share on other sites
Ok. Did you try lowering the number of threads? (The whole async thingie could be swamping the CPU(s))

What did the profiler say with regards to memory/cpu usage?

You could implement thread pooling yourself with non-blocking sockets to see if that helps. I did this with great success in Java so it should be of similar performance)

Share this post


Link to post
Share on other sites
Microsoft recommends using async sockets and others usually recommend this as well on the windows platform, so you are using the preferred method it sounds like. Could you profile your app and see where it is spending a lot of its time?

I am thinking the problem you are having is that with only 25 threads, you can only perform 25 async read operations at the same time. Any more will have to wait in a queue. And since an async read operation will not return until some data has been received, that means your queued up clients are waiting on other clients to send data which would definitely increase ping times.

Share this post


Link to post
Share on other sites
Quote:
Original post by landagen
I am thinking the problem you are having is that with only 25 threads, you can only perform 25 async read operations at the same time.


That's what I'm thinking, as well. Like I mentioned the npc application appears to use 1 thread per connection... But it's very well known that the npc client we have is part of a project that uses... eh... saying 'bad coding practices' would be putting it in a nice light.

I really wouldn't mind recoding the client class to use blocking sockets with its own unique thread, but performancewise is having 200-400 threads really a smart idea?

Share this post


Link to post
Share on other sites
I haven't done any network programming in .NET, but have you considered giving the old fashioned Select a try? I see they still include it as a static method on the Socket class.
Quote:

That's what I'm thinking, as well. Like I mentioned the npc application appears to use 1 thread per connection... But it's very well known that the npc client we have is part of a project that uses... eh... saying 'bad coding practices' would be putting it in a nice light.

I really wouldn't mind recoding the client class to use blocking sockets with its own unique thread, but performancewise is having 200-400 threads really a smart idea?
C# by default allocates 1MB for thread stacks so you are talking about an order of magnitude or more increase in memory usage.

Share this post


Link to post
Share on other sites
Looks like 85% of the total time is spent on Threading.Sleep calls (spread across 3 threads. Main, movement, misc game events). The other 15% is spread pretty evenly across the ~50 or so other methods that are being called. Anywhere between 0.3 and 0.8% of total processing time.

Share this post


Link to post
Share on other sites
Probably not the answer you're looking for, but you could use XF.Network instead of System.Net.

I've scaled it up to 1000 or so concurrent connections without trouble, and the API is delicious.

EDIT: Just saw your last reply. You aren't calling Thread.Sleep inside your callbacks, are you? (The methods that take IAsyncResult and call Socket.EndXXX). Those are threadpool threads, you really shouldn't be doing anything heavy there. Just deserialize your messages and pass them to your main thread. If the sleeps are in your main threads, and they're taking up 85% of your time, then obviously your socket code isn't the problem :P

Share this post


Link to post
Share on other sites
Quote:
Original post by CadetUmfer
EDIT: Just saw your last reply. You aren't calling Thread.Sleep inside your callbacks, are you?


Noooo I know better than to do that :p The async callback is setup as a "get-in get-out as fast as possible" method. The data received is immediately passed off for handling.

Side note: I dabbled with the ThreadPool.SetMaxThreads() method and from the output, it looks like threads from the pool are NOT 'held', rather they're only used to actually call the async method. So those of you reading this with similar problems, no, setting MaxThreads to 250, 250 does NOT make a difference.

Share this post


Link to post
Share on other sites
Quote:
Original post by LycaonX
Noooo I know better than to do that :p The async callback is setup as a "get-in get-out as fast as possible" method. The data received is immediately passed off for handling.


200 connections is nothing, unless they each try to send megabytes of data per second.

So there is a problem with how the data is processed.

Quote:
Side note: I dabbled with the ThreadPool.SetMaxThreads() method and from the output,


If completion handling is not the point of contention, then the maximum number of threads used will be equal to number of cores. Looking at thread times on four core machine, maximum of four threads should be seeing 95% of time, with another thread getting the rest.

But as long as processing is fast enough, one or two threads should be more than enough - but that is all handled automatically.

Simply compute the network bandwidth divided by size of packet. This gives you the time available to processing one packet. All cores have this much time before handlers starts hogging the networking part.

Or, if each packet is 500 bytes, and network can handle 50 megabytes/second (symmetric upload/download, echo server), each request needs to be processed in 10 microseconds, or 40 microseconds per thread on 4 core machine.

But this is just the networking part which means the point at which more than 4 threads would be needed. Most realistic systems will not be able to do useful work at such rates. So if handlers complete faster than that, there should never be more than 4 active completion threads. Most systems only get thousands of messages per second - or 1% of this theoretical limit. This means single core single thread will be handling this while being idle 90% of the time. This means there is a lot of time left for processing.


Short version: it is not the networking API that is causing the delays, it's how the data is handled by the application.

Share this post


Link to post
Share on other sites
I think you may need to post some code for us to be more helpful. I would definitely post your callback function and any procedure your callback calls. Do you have any locks or anything in your callback? Also, post the code that actually makes the call to the BeginRead and post information as to how and when it is called.

Share this post


Link to post
Share on other sites
What is the CPU load on your server? Are you actually using all the CPU? If not, then you have an algorithmic bug in how you handle your networking, or perhaps a locking bug where you serialize on some blocking code.

Btw: BeginReceive/EndReceive will end up using a thread pool and I/O completion ports in the implementation, so it's a reasonably efficient way to do I/O.

Share this post


Link to post
Share on other sites
I'm running on my desktop at the moment, quad core 3.0, 8GB of ram. I'm lucky if I hit 10% cpu usage with Firefox with 20 tabs open, WoW running (sometimes three copies simultaneously), miscellaneous folders open, IM programs, mIRC, all the usual junk.

As far as how the data is processed. The server is modular. At run-time, the server loads up all the classes that handle the various opcodes the client sends (for example, chat, changelevel, etc). These are compiled in memory, one at a time and each handler as its own class) and then loaded and stored in a Dictionary(Of Opcode, Handler).

I know you'd like a perfect copy/paste of code but I'm not allowed to distribute it. I suppose I could copy out all the pertinent code and pseudo-code the stuff that's specific to the server. I didn't come up with the restriction, I just signed the paper.

The program has a simple class using an async socket (instead of a tcplistener) to accept connections. I don't see any issues with it; I've had 500 incoming connections all handled gracefully and in less than three seconds for all 500.

The new socket is passed to the ClientManager via an event, which assigns the socket to a Client class (which holds the socket and info on the client like username, etc). The ClientManager then takes over the async operations, initiating handshaking , then passing data received to the main thread via an event.

The main thread then acts on the data. It creates a Packet class from the byte array, which separates out the byte data into an opcode, payload length, and opcode specific data. The main thread then checks the Handler dictionary for the Opcode. If there is a handler for the specific Opcode, the client and data are passed byval to the handler and processed from there. If not, an exception is logged to a file.

I am going to profile the execution time of each handler, it's possible that one or more of them are taking longer than usual. It's a pain though, since VS doesn't appear to have the ability to debug assemblies that are loaded via Assembly.Load. Yes, I do have .GenerateDebugInformation = True in the compile parameters.

Instead of immediately handling the data immediately as soon as it hits the method in the main thread, I've also tried using a Queue in a separate thread. When the data arrives via the event in the main thread, I synclock m_Packets.GetType, add the client/data, then end synclock.

In the Queue processing thread, I check if m_Packets.Count > 0. If it is, then m_Packets is locked, one packet is .Dequeued, end synclock (so other packets can be queued while the current is being processed), then loops until all packets queued are processed. If there are no packets in the queue, I do a Sleep(10) and the loop starts over again.

Whether I use the queue or process each packet as they arrive, I see no visual difference in the latency. I haven't profiled to actually check though.

Also, Ozak, I'm not sure how you'd implement socket operations for 200 connetions without threading. Whether you use a separate manually-created thread and do blocking operations in that thread, or using the built in async methods, you just end up using threads from the process ThreadPool. Would you mind giving a basic phrase of said method so I can help myself and google it?

Share this post


Link to post
Share on other sites
Altho this is unlikely to be the solution to your problem; you should never lock on a publicly accessible object; especially an entire type. Just make a new private object and lock on that instead.

Share this post


Link to post
Share on other sites
Topic seems to have died down, but I profiled all the packets used, here's the info. Numbers are seconds. This is over 24.5 hours or so.

PlayerMsg: 0.015 Handles player skill use
PlayerUpdate: 0.031 Handles movement updates
ServerLogin: 5.141 ' Handles player logins
ServerLogout: 0.579 ' Handles ALL logouts
LevelLogin: 0.076 ' Handles level changing
GetLevelPlayers: 1.35 ' Used by npcs to scan for players so they can spawn in the correct areas
ChangeAvatar: 11.388 ' Used by all clients, player and npc, to update their physical look (this seems pretty high)
AgentLogin: 163.216 ' Used by npc logins
TotalTime: 86437682.723

There are other opcodes but none logged enough time to appear on the iist. ChanceAvatar and AgentLogin are fairly high when compared to the others

Share this post


Link to post
Share on other sites
Are you running the server on a home connection? Around 25 players might be where you run out of up bandwidth. Not sure how that would lag a local connection though.

Share this post


Link to post
Share on other sites
My c#.Net server uses the Begin* methods, with very little lag (Though I havn't been able to really stress test it as much as I would like).
Are you on .Net 3.5? When I upgraded from 2.0 to 3.5 there was a slight increase in processing time for sockets. Also, 3.5 introduces a newer, more C like implementation (which I have not taken advantage of yet).

Look Here:
http://msdn.microsoft.com/en-us/magazine/cc163356.aspx

Share this post


Link to post
Share on other sites
Quote:
Original post by hplus0603
It's hard to tell how you arrived at those numbers


Nothing fancy, just grabbing Environment.TickCount immediately before and after each packet is processed.

As far as connection, I'm on what most would consider a fairly low capacity upload (896 kbps) but the most my server's ever sent out over the DSL is ~25 kilobytes per sec. 90% of the traffic is from the NPCs which are either running on the same computer as the server, or on a LAN computer.

I messed around with it a bit last night and I was (randomly?) getting anywhere from 10 to 300 ms ping from the client on my laptop to the server on my desktop.

It's running on top of 3.5, I'll see if 2.0 makes any difference.

Share this post


Link to post
Share on other sites
If you're on Vista or 7, you should use the Performance Monitor to monitor incoming and outgoing network traffic.

Also, TickCount has terrible resolution -- it may update only once every 20 milliseconds or so. For timing, use a System.Diagnostics.Stopwatch.

In general, a server using BeginReceve()/EndReceive() should have no problems processing thousands of simultaneous connections, at least when running on Windows Server. There may be artificial limitations on the consumer levels of Windows.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this