• Advertisement
Sign in to follow this  

Berkeley Sockets and IOCP

Recommended Posts

Recently I asked a very seasoned developer on his personal recommendation for receiving packets, queuing, and processing them. To provide a little context, I asked:

Hi,


I've been studying Uru code. I had a question.

I'm looking for how packets are queued and eventually exposed to rest of game. Would you recommend running low level packet recv and send on a thread, while communicating to the rest of the game via event queue? In particular I wanted to see how synchronization primitives were used to make a higher level abstraction for game logic code to utilize.

I'm assuming there's actor model style encapsulation and messaging going on. But like we talked about its hard to see big picture by reading code due to distributed nature.

Thanks!
Randy

And got this response:

Assuming that you're going to be running many game contexts in one process, I have a complicated answer to your question, but the short answer is to receive on worker threads and queue messages onto a per-game queue, which is then scheduled to be run when the worker threads run out of input to receive. GetQueuedCompletionStatus with zero timeout works perfectly for this. There is probably a similar but harder way to do this with epoll, which I've read about but have avoided so far because it seems not as well designed as IOCP and kqueue. Packets should be sent directly from the game worker thread, not queued to be sent. This whole design in general avoids context switching.


The games themselves are essentially fat actors that run single threaded.

So it sounds like he's recommending to use a thread pool to receive packets as top priority. If no packets are ready, the worker threads do packet processing. Here packet processing is opening them up, decryption, or whatever else needed. Then they place these processed packets into a queue for the game thread to pick them up.

Sounds like a totally solid design. The worker threads are reused as much as possible. A more naive design would have jobs submit to a threadpool for packet receives, and separate jobs for packet processing. This involves potential context switches between jobs when the OS does scheduling. Another naive solution is to do packet receive + processing on one job, which does not prioritize picking up new packets, and will have likely context switches once each packet + process job is completed.

That's the context. My question is, are Berkeley sockets implemented as a slightly higher level layer compared to IOCP on Windows? Sort of like how fscanf does internal buffering of syscalls, is there a similar relation to raw Windows APIs for IOCP and the POSIX API?

I'm trying to gauge how high or low level my current POSIX implementation is. I plan on writing some dedicated game server code soon, but don't really want to pay for Windows servers, and was hoping to just keep my POSIX stuff as-is. However, also wanted to learn a little more about IOCP stuff. To understand perf tradeoffs, as I'll be paying for server costs myself.

Edited by Randy Gaul

Share this post


Link to post
Share on other sites
Advertisement

Hmm you know I don't think I asked my question very well. I'm trying to gain insight on these different options for UDP in terms of collecting packets and handing them off to game instances:

  • recvfrom on custom thread pool, blocking
  • recvfrom on custom thread pool, polling with non-blocking
  • recvfrom without a pool, polling with non-blocking
  • IOCP for Windows

Share this post


Link to post
Share on other sites
IOCP is a Windows concept, there is no connection to the POSIX standard or to Berkeley sockets at all.

The equivalent notions on other platforms are epoll and kqueue.

Polling a socket is generally wasteful and scales poorly, but it's acceptable in the 1-20 connections range. For larger connection counts or better latency, you are much better off having your reads block a dedicated thread that wakes up when the OS sends them data.

Whether it makes sense to use IOCP or the other solutions depends entirely on your scale needs. If you don't need hundreds of simultaneous connections with minimal per-packet latency, don't bother with complex approaches like IOCP.

Share this post


Link to post
Share on other sites

Thanks for the reply. Here we see an option (option 3, under the section UDP: Only-One-Socket Problem). 

Polling a socket is generally wasteful and scales poorly, but it's acceptable in the 1-20 connections range. For larger connection counts or better latency, you are much better off having your reads block a dedicated thread that wakes up when the OS sends them data.

Whether it makes sense to use IOCP or the other solutions depends entirely on your scale needs. If you don't need hundreds of simultaneous connections with minimal per-packet latency, don't bother with complex approaches like IOCP.

In the link it seems like its saying having reading from socket on a separate thread, and dispatching packets to a pool, is a pretty optimized approach. And to optimize beyond this would be to have multiple threads reading off a socket.

In this case wouldn't polling be a like a spinlock? And blocking be like signal/sleeping? Just asking for a little clarification on what you mean by "polling is generally wasteful".

Also I found it interesting that link describes a couple options and states they seemed roughly equal to IOCP in terms of practical performance. Thoughts?

Share this post


Link to post
Share on other sites
Multiple threads per socket is not going to be an improvement, though. It just shifts the complexity around.

A socket is a logical unit of communication. If reading from it is done on N threads, and responding to that input is only done on, say, one thread, you have an N-to-1 bottleneck. If response is done on N threads, you have N threads of synchronization nightmares if any of the inputs affect each other - and in a game, that's almost a guarantee.

Polling is somewhat like spinning, yes, in that you're keeping a thread "hot" just to sit around and do no useful work. Blocking lets the thread go dormant until input is ready, much like using a signal or condition variable could wake a thread in general.


The advice you linked is fine for games < about 1000 concurrent connections per host. But my experience is that IOCP (and comparable setups) vastly outscales select() or other asynchronous/non-blocking approaches. The idea is to minimize the amount of time either waiting on a socket or waiting on a thread synchronization primitive.

Artificial benchmarks are one thing, but actual games that need >1k CCU on a single host are rare. I think this contributes to some of the mythology and controversy around asynchronicity tactics. It's hard to reach a scale where you can even measure the differences, but once you do, the differences add up fast.

Share this post


Link to post
Share on other sites

FWIW: IOCP and non-blocking sockets are quite different architectures, and actually aim different workloads. BTW - we're speaking about servers, right?

With IOCP, they maintain a thread pool for you, and process each incoming request in-whatever-thread-is-more-convenient-for-them. It means that IF you want to access some-state-shared-between-different-requests - you DO need an explicit mutex lock within your request handler. IOCP is handy - for those cases when, while processing your requests, you do NOT need to lock on a mutex (or when contention on the mutex/whatever-other-resource is negligible). IOCP is explicitly supported on Win-only, but it is possible to write pretty much the same thing on Linux too.

An alternative approach is to dedicate your sockets (like "all the 10 sockets for this MOBA match") to appropriate threads (often, though not strictly necessary - the very same threads which run your game loops) - and to use select()-like function (these days - poll()/epoll()/kqueue()/WaitForMultipleObjects() are superior to select(), but the difference is not interesting at this level) to get the packet as it arrives from any of these sockets (or if nothing arrives - you wait until your next network tick starts). In this case, you do NOT to bother about mutexes and can simply feed the packets from select()-like function into your game loop (it is ok, because everything stays within the same thread). As a rule of thumb - this Shared-Nothing architecture is _superior_ to IOCP-based one at least for games (when applying IOCP to games - it means pretty high contention on the mutexes, causing lots of unnecessary context switching).

An anecdotal evidence from the real world. Once upon a time, there was a game with hundreds of thousands of simultaneous players - and it had servers running on Windows, with ~30 non-blocking sockets per each thread (and using WaitForMultipleObjects() to get the packets), and with ~10K players/server. And at some point an experiment was made to rewrite it to IOCP (because everybody was running around saying "IOCP is THE thing, we're stupid not to use it"). As a result - after spending those several weeks necessary to do it, IOCP did work, but performance difference, while being pretty small (~5%), was in favor of the _original_ non-blocking Shared-Nothing implementation. 

Bottom line: IMNSHO, while IOCP may be good for web-browser-like type of load (BTW, historically IOCP was developed specifically for IIS), and while it IS possible to write workable stuff based on IOCP, I _strongly_ prefer classical Shared-Nothing non-blocking select()-like-based approach to IOCP; as a side benefit - it works exactly the same way on Linux too (replacing WaitForMultipleObjects() with epoll() is very straightforward). 

Phew (sorry if it is a bit sketchy)...

 

But my experience is that IOCP (and comparable setups) vastly outscales select() or other asynchronous/non-blocking approaches. 

Have to say that my experience is different - see, in particular, that real-world experiment above, plus also there is a plausible theoretical explanation too: Shared-Nothing stuff works better because it causes significantly fewer context switches (context switch, including cache invalidation costs, can cost as much as 1M CPU clock cycles(!)), AND because of better affinity of threads to RAM (which is important for NUMA - and for last 20 years typical "workhorse" servers are 2-socket NUMA boxes). And we didn't even start to discuss stuff such as RSS/RPS/... - where affinity to threads/cores is even more important (but dealing with RSS/RPS/etc. admittedly goes into the realm of black magic). In addition - while IOCP tends to scale worse-than-linear, non-blocking stuff tends to scale better-than-linear(!) - which is a complicated phenomenon which won't fit into the post here. And from a very different perspective - well, Linux is known to scale at least as good as Windows (and without IOCP), which means that IOCP is not really THAT important as MS is trying to tell us :-).

 

Last but not least on scaling: of course, trying to push _all_ the sockets in one single thread won't scale - but having like 30 sockets/core for 10K players will get us to 300 threads, and 300 threads per 10 cores tends to scale surprisingly well (though probably, the "sweet spot" is even lower - at 50 threads per 10 cores, but this is already game-dependent etc.). 

Edited by No-Bugs Hare

Share this post


Link to post
Share on other sites

Thanks for the answer No-Bugs Hare!

An alternative approach is to dedicate your sockets (like "all the 10 sockets for this MOBA match") to appropriate threads (often, though not strictly necessary - the very same threads which run your game loops) - and to use select()-like function (these days - poll()/epoll()/kqueue()/WaitForMultipleObjects() are superior to select(), but the difference is not interesting at this level) to get the packet as it arrives from any of these sockets (or if nothing arrives - you wait until your next network tick starts). In this case, you do NOT to bother about mutexes and can simply feed the packets from select()-like function into your game loop (it is ok, because everything stays within the same thread). As a rule of thumb - this Shared-Nothing architecture is _superior_ to IOCP-based one at least for games (when applying IOCP to games - it means pretty high contention on the mutexes, causing lots of unnecessary context switching).

I wanted to clarify here; are you proposing that game instances sit in a dedicated thread, along with associated sockets? So this thread has an entire self-contained loop requiring no state syncs with other threads? Something like this (pseudo code):

while ( 1 )
	packet = select( sockets )
	process( packet )
	send_to_game( packet )

Or were you suggesting something slightly different?

I'm assuming the process could be pretty similar for UDP, but select replaced with a little more custom abstraction.

Edited by Randy Gaul

Share this post


Link to post
Share on other sites

For a simulation game - it would be more like this (note similarities to the classical game loop):
  

while ( 1 )
	packet = select( sockets, time_until_next_network_tick() )
        if( packet )
  	  game_process_packet( packet )//similar to game loop's "process_inputs()"
            //MAY just store the packet and process within game_update()
        else//timeout, need to simulate next tick
          game_update()//similar to game loop's "update()"
          game_post_updates_to_clients()//similar to game loop's "render()"

A potentially-better-performing alternative (assuming that you do NOT really need to process packets until the next tick, and for UDP - make sure to test it for losing extra packets due to stack running out of space while waiting(!) - and play with SO_RCVBUF if necessary) could be 

while ( 1 )
        sleep( time_until_next_network_tick() )
        while( 1 ) //read all the packets from the UDP stack buffer
   	  packet = select( sockets, 0 )//timeout is 0, no blocking here!
          if(packet)
    	    game_process_packet( packet )//similar to game loop's "process_inputs()"
          else
            break
        //no more packets - proceed with the tick
        game_update()//similar to game loop's "update()"
        game_post_updates_to_clients()//similar to game loop's "render()"

Edited by No-Bugs Hare

Share this post


Link to post
Share on other sites

Just for completness on later readers; I have studied multiplatform network IO before )I started to implement it into my game engine and decided to run an abstraction system using the two different but also some kind of similar systems as described in this stackoverflow post

https://stackoverflow.com/questions/2794535/linux-and-i-o-completion-ports

So some kind of IOCP equivalent exists in some way but different

Share this post


Link to post
Share on other sites
IOCP is vastly superior to non-blocking I/O, not so much because it is inherently bettter (it is not) but because a) you use it with overlapped I/O, which is the "correct" and "natural" thing under Windows, and b) you can do networking and file I/O and some general tasks in one and the same thing, and c) a great deal of work has been done to make them perform well whereas nonblocking sockets are merely a compatibility hack implemented in the winsock library (... which, by design, needs polling).

IOCP is much like epoll or kqueue, only just the other way around. You pass some descriptors to watch to the kernel once, and the kernel keeps it in its books (as opposed to select/poll where you must pass the descriptor set every time), and then you can wait on something to happen as often as you like. But rather than being notified of readiness (epoll/kqueue), you are notified of completion.

IOCP might, for some obscure reason, do some polling (I wouldn't know, and I won't say it doesn't since I don't know) but in principle it doesn't need to do that. You fire off an overlapped operation, and the kernel completes it (with a thread from its I/O pool that is blocked until an interrupt happens).
Then, it signals the IOCP which wakes and boosts your last waiting thread's priority by one level for two quantums, so you have a fast response (no such thing as "high priority" needed, this is done automatically). Wake order is LIFO, not arbitrary, which is also beneficial.

Another nice property of overlapped I/O is that it's practically impossible to have your receive buffer fill up, simply because as long as you have some in-flight operations left over, it's being read without you doing anything, whenever something comes in. Whereas with nonblocking I/O, nothing happens while your thread isn't explicitly receiving. Which might in the odd case cause a packet to get dropped because of a full buffer. That'll admittedly be a rare thing to happen, but it's more likely anyway.

Share this post


Link to post
Share on other sites
Of course a shared-nothing implementation will scale better due to less locking. This is pretty obvious if you think on it.

But not all games can do a shared-nothing implementation, c.f. an MMO. This is where your scalability becomes highly contingent on efficient dispatching of input communications to the working thread pool. It's hard to beat IOCP and the non-Windows equivalents when you're in that situation.

But to reiterate what I've already said, that isn't most games. Most games are fine taking a more simplistic approach, and IOCP is exceedingly tough to get right. It's not worth the effort unless you are really pushing the OS.

Share this post


Link to post
Share on other sites

A more naive design would have jobs submit to a threadpool for packet receives,


The game doesn't need to "ask" for data. Data will be coming in. The game should just process what's there.

If you're using UDP, you only need one thread, because there is only one socket, and there is only one network card, and presumably your CPU can shuffle data faster than your network card can send/receive it over the internet.

Thus, the simplest possible implementation, which is portable across all kinds of systems, and is also fast, is something that uses a simple loop with non-blocking socket calls:

int udp = socket(AF_INET, SOCK_DGRAM, IPADDR_UDP);
if ((udp < 0)
  || (fcntl(udp, FIO_SETFL, fcntl(udp, FIO_GETFL, 0) | O_NONBLOCK) < 0) 
  || (bind(udp, &listenaddr, listenaddrsize) < 0)) {
  you lose ();
}

while (true) {
  int worked = 0;
  int r;
  packet *p;
  while ((r = recvfrom(udp, buffer, bufsize, 0, &address, &addrlen)) > 0) {
    dispatch_incoming_packet(buffer, r, address, addrlen);
    ++worked;
  }
  while ((p = dequeue_outgoing_packet()) != NULL) {
    if (sendto(udp, p->buffer, p->size, 0, &p->address, p->addrlen) < 0) {
      requeue_outgoing_packet(p);
      break;
    } else {
      ++worked;
      free_packet(p);
    }
  }
  if (!worked) {
    usleep(1000);    //  one millisecond is a good balance on modern systems
  }
}
Obviously, there are some bits that you have to implement here; this just shows the structure of an active network send/receive thread for UDP.

Because the kernel queues incoming and outgoing packets per socket, it's efficient (enough) to first drain the incoming queue, then generate the outgoing queue (from the game,) and then repeat. If there's no work to do (no packet came in, nothing to send) then the system is running at low load, and sleeps a millisecond to save some CPU.
"Polling" I/O like this is often frowned upon for learning system programmers, because it can be inefficient (burning CPU polling when there's nothing to do,) and it can add additional latency (the sleep means the CPU won't wake up the instant a packet comes in.)
That's a valid concern, but in this case, the construct as shown above is often the best solution for real-time networked games (which is kind of a special case compared to traditional server programming.)

There's still interesting code inside the "dispatch incoming message" function, where you have to figure out whether the address in question is an existing connection that belongs to a known client, or whether it's a new client trying to connect, and then route it appropriately.
A hash table of existing clients, pointing at their game instances, is usually used; when the client is not in the hash table, you can assume it's a new client trying to connect, and route it to some actor that deals with that.

Share this post


Link to post
Share on other sites

7 years ago, we were handling between 100 and 500 concurrent clients on one process with nothing more complex than select() calls, on Windows and Linux. It helped to have this on a secondary thread which performed the basic read and deserialisation before handing the data to the logic thread, but that was a luxury rather than a requirement as there was plenty of CPU time to spare. This does require a strong separation between deserialisation and game logic, of course.

It's interesting however to see that the initial post mentions "many game contexts in one process" - if the idea is to host a lot of individual games (e.g. MOBA or RTS sessions) rather than one big session (e.g. an MMO) then you'll quickly reach the point where background threads no longer give you 'free' CPU time, and I'd speculate that context-switching could be costly; especially if you have OS-level interrupts trying to wake up each game's receiving thread every few milliseconds.

Share this post


Link to post
Share on other sites

7 years ago, we were handling between 100 and 500 concurrent clients on one process with nothing more complex than select() calls, on Windows and Linux

While this will admittedly "work", even under Windows if you define FD_SETSIZE, it is nevertheless an awful approach. Not only are you transferring two kilobytes of data to the kernel every time (the descriptor set is not a bitfield!), but more importantly the reason why FD_SETSIZE is only 64 with Winsock is that 64 happens to be the limit of what WaitForMultipleObjects can tackle. Which, in the case of calling select on 500 sockets, means that Winsock will spawn 8 threads, each doing a wait-all on a subset of 64 sockets (well, one of them only has 52), and your main thread doing a wait-all on these threads. Compared to IOCP, that's just... a desastrous approach.

It's not quite that desastrous under Linux, but still epoll_wait will be roughly 30-40 times (30-40 times, not 30-40%) faster for a set of several hundred descriptors.

 

Don't get me wrong, I'm not saying select is inherently bad. For watching 2 or 3 descriptors, or for watching a moderately small set of descriptors that changes very often, it's mighty fine (possibly even better). Only just, for mostly-static descriptor sets with a count in the hundreds, one might really consider something better which is readily available, and doesn't cost anything extra.

Share this post


Link to post
Share on other sites

you'll quickly reach the point where background threads no longer give you 'free' CPU time,


The server can still use a single UDP socket for all the different sessions.
Only if you use TCP, do you need select() and more sockets.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this  

  • Advertisement