[Article] Using UDP with IOCP

Started by
7 comments, last by hplus0603 14 years, 11 months ago
I finally had some time to sit down and write a little article over using UDP and IOCP together. This effort stemmed from my earlier attempts to get IOCP and TCP performance up to par earlier this month. I'll eventually get back to IOCP and TCP, but for now UDP meets the needs of my desired goals. Included in the package is the article itself in DOC and PDF formats, the actual code for the library, as well as a simple echo client/server demo showing how to use it all. You need Boost to compile the code and Visual Studio 2005 or above since I use variable arguments via a macro. I hope the article and source code serve as a good resource to programmers hoping to learn how to properly use UDP with IOCP as there are not any resources on that topic right now. I am fairly confident in the correctness of my work since I've done months of research, development, and trial and error to make sure I wasn't messing anything up. I'm sure there may be minor things though to nitpick. Writing a Windows High Performance Scalable UDP Server Using Input Output Completion Ports The concept itself is of most use as an academic resource right now. Now that RakNet is free for indie development, that library is the best free choice available for game developers looking for UDP. In addition there is enet for a lower level UDP library for games that is more lightweight than RakNet. This library is at a lower level than enet, all it does is handle sending and receiving. That means most people might not have an immediate use for it, but those that do, I hope it helps. I'm open to any feedback! [smile]
Advertisement
thanks for an interesting read=) Always good to learn a bit more about a subject when one compiles...

Quote:
There is an important limitation to the number of posted events that you need to be ware of. 32bit Windows supports only 2,500 events posted at once. Windows 64bit editions support 32,767 events....snip...


What happens when the eventqueue is full? I guess this is a situation that can lead to deadlocks? I guess it depends on locking mechanics, but if you are dispatching an event and want to send, but event queue is full?

And what excactly is an event? a async function call? a write/read event? close/open? all of them?
www.ageofconan.com
Quote:What happens when the eventqueue is full? I guess this is a situation that can lead to deadlocks? I guess it depends on locking mechanics, but if you are dispatching an event and want to send, but event queue is full?


Ah good question. I didn't cover this but all it means is slightly less efficiency in the event handling. Winsock will buffer all events internally, but when you have posted receives, you avoid some extra work in copying data from the internal buffers. Everything still works fine, but to get the best benefits it's good to have pending events. And by 'events', I mean overlapped posted WSARecvFrom calls.

Quote:And what excactly is an event? a async function call? a write/read event? close/open? all of them?


There are quite a few 'events' going on in the system, but in the context of using the server, it is a read/write event resulting from a WSASendTo or WSARecvFrom. These are a part of the Overlapped structure.

There are no connections in UDP, so there is no concept of a close and open. There are async function call events, but those are part of the internal system that the end user does not need to be concerned with. The write events are not actually exposed to the end user either, but they are generated when the user calls a Send function. The read events are the only events the end user can process.

I hope that clears those points up!
There are two issues I see with this design:
1) Handlers are invoked in thread pool, but then serialized to APC
2) Sending is left as excercise to the reader

For #1, networking related functionality is unlikely to be worth distributing across thread pool. It may help, for example if packets need to be unpacked or decrypted, but for plain serialization, it's unlikely it will generate any noteworthy CPU load.

But almost any application will need to do two things - figure out which connection packet belongs to, and do something within context of connection, as well as notify the application that there is something to be done.

Other issue is that due to concurrent processing in IOCP completion thread pool, you cannot efficiently handle connection-specific packets. This includes updating sequence numbers (what happens if three workers start processing three sequential packets from same client). In practice, this quickly becomes deal breaker, since there's not much you can do concurrently.

This is considerable conceptual issue - IOCP will not resequence UDP packets to match your logical ordering (which connection, which client, anything), yet almost always each packet will need to be processed in an organized manner. With TCP, this is solved by only ever issuing one recv per socket, avoiding the issue altogether, even if multiple threads are processing the completion.

Compare this to game-loop:
while (running) {  getCompletionStatusEx(..., ..., ..., timeout);  // process received packets  while (delta_time > 0.1) {    // run simulation step    // put data to send into token ring  }  for (sockets_with_pending_data) {    // try to do non-blocking send    // if we wouldblock stop and wait for next loop  }};

This takes care of several things:
- since it might be impossible to send all data now, rather than putting load on kernel, keep pending data in each connection. On each pass, send as much as possible. Token ring can be used to store sockets
- received packets are processed locally, which allows you to do either complete processing, or quick, connection specific handling (IP checksum, hash checks, sequence updates, acks), but you can then still off-load heavy-weight processing to a thread-pool (probably not needed). This is very handy if you use encapsulated/nested handlers where each higher-tier system processes parts of contents. IOCP handler would merely check UDP packet wrapper and some basic op-code, perhaps hash check, but everything else would be handled higher up.
- if you use boost asio, the three parts above can be very efficiently split across several independent actors, each of which can run in its own thread, yet still be able to run in up to three worker threads without explicit need for synchronization. This can also be implemented in a more light-weight manner manually.


Your design is ok for servers where each packet is independent and doesn't require any shared state to process. But commonly, this isn't the case, and that's where multi-threaded UDP doesn't benefit much, if anything at all.
Thanks for the feedback Antheus, however I think there are some misconceptions about the code and concepts itself.

Quote:1) Handlers are invoked in thread pool, but then serialized to APC


The WorkerThread function, which is the main function for all thread pool threads, does not have any handlers in there, if I assume you mean event handlers? It does call QueueUserAPC to queue an APC to process the event, but that is not a serializing behavior inside the worker thread standpoint. The call to the boost pool malloc does kind of serialize behavior due to an internal mutex being used, but I can't comment on if that really impacts the execution of the threads.

Or do you mean, the benefits of having multiple threads running is negated because I do serialize to a single thread? If that's the case, then I'd not agree because the goal is to process as much network data as efficiently as possible and by passing it to the APC function via a queue, it is always executed in a single thread context.

Maybe I'm misunderstanding, can you explain what you mean if I am?

Here's that code again with comments removed:
void HighPerformanceUDPServerData::WorkerThread(){	BOOL result = 0;	DWORD numberOfBytes = 0;	ULONG_PTR key = 0;	OVERLAPPED * lpOverlapped = 0;	UDPOverlappedEvent * event = 0;	for(;;)	{		result = GetQueuedCompletionStatus(completionPort, &numberOfBytes, &key, &lpOverlapped, INFINITE);		if(key == -1)			break;		event = CONTAINING_RECORD(lpOverlapped, UDPOverlappedEvent, overlapped);		if(result == TRUE)		{			if(key == 0)			{				if(event->operation == HPS_OPERATION_READ)				{					if(InterlockedDecrement(&curPendingRecvs) < refillPendingRecvs)					{						SetEvent(hRefillEvent);					}					event->dataBufferSize = static_cast<UINT16>(numberOfBytes);					APCEvent * apcData = reinterpret_cast<APCEvent *>(Pool_APCEvent::malloc());					apcData->event = event;					apcData->server = this;					if(QueueUserAPC(ProcessEventWrapper, hEventProcessingThread, reinterpret_cast<ULONG_PTR>(apcData)) == 0)					{						LOG("QueueUserAPC failed. The event [%i bytes] will be discarded. GetLastError returned [%i]", event->dataBufferSize, GetLastError());						Pool_UDPOverlappedEvent::free(event);						Pool_APCEvent::free(apcData);					}				}				else if(event->operation == HPS_OPERATION_WRITE)				{					Pool_UDPOverlappedEvent::free(event);				}				else				{					LOG("Invalid event operation specified [%.2X].", event->operation);					Pool_UDPOverlappedEvent::free(event);				}			}			else			{				Pool_UDPOverlappedEvent::free(event);			}		}		else		{			LOG("GetQueuedCompletionStatus failed. GetLastError returned [%i]", GetLastError());			Pool_UDPOverlappedEvent::free(event);		}	}}


Quote:2) Sending is left as excercise to the reader


There are sending mechanisms already built in. Users can either send to a specific IP/port or "reply" to data by passing in the already filled in sockaddr_in. In addition I have two overloads for those functions, one for sending raw data and another for sending data using a WSABUF structure (which is needed for enet!)

Of course, if users want reliable or sequenced sends, then you are right. As I mentioned in the article, writing a higher layer protoocl on top of UDP is left to the reader as it is too complicated to be shown. I myself couldn't write one myself which is why I went for modifying enet, but that library is not release ready or that practical in my opinion for the time being.

Quote:For #1, networking related functionality is unlikely to be worth distributing across thread pool. It may help, for example if packets need to be unpacked or decrypted, but for plain serialization, it's unlikely it will generate any noteworthy CPU load.


Hmm, no networking related functionality is being distributed across the thread pool as it stands right now. Each thread is simply handling IOCP events with GetQueuedCompletionStatus and then asynchronously queues those events.

The call for ProcessEventWrapper is not actually executed in that thread pool thread, it's executed in the APCEventProcessThreadWrapper function's thread, whose handle is stored in hEventProcessingThread.

Quote:But almost any application will need to do two things - figure out which connection packet belongs to, and do something within context of connection, as well as notify the application that there is something to be done.


The OnClientToServer callback function gives users the server that the data originated on, the sockaddr_in address of the peer, the number of bytes, and finally the data buffer. This function is executed in the context of the hEventProcessingThread handle and is always serialized so only one packet is processed at a time.

Quote:Other issue is that due to concurrent processing in IOCP completion thread pool, you cannot efficiently handle connection-specific packets. This includes updating sequence numbers (what happens if three workers start processing three sequential packets from same client). In practice, this quickly becomes deal breaker, since there's not much you can do concurrently


For TCP that might be true, but with UDP and this design, I've gotten around that problem by using APCs. Let's say worker thread 1 and 2 both handle a packet, A and B from some client X. I can't tell you which order they will be processed in, but they will either be processed in A then B, or B then A. Since UDP is unreliable, it doesn't even matter at this level. If another layer were to be placed on top of this code, as I did with enet, that logic is responsibly for packet reordering.

Quote:This is considerable conceptual issue - IOCP will not resequence UDP packets to match your logical ordering (which connection, which client, anything), yet almost always each packet will need to be processed in an organized manner. With TCP, this is solved by only ever issuing one recv per socket, avoiding the issue altogether, even if multiple threads are processing the completion.


As mentioned previously, this is what I am counting on and exactly why IOCP fits UDP better than TCP for multiple recvs posted! Once again though, this is only possible because I use APCs to serialize all data into one thread context for processing, but that processing thread is in its own thread.

Quote:
Compare this to game-loop:
while (running) {  getCompletionStatusEx(..., ..., ..., timeout);  // process received packets  while (delta_time > 0.1) {    // run simulation step    // put data to send into token ring  }  for (sockets_with_pending_data) {    // try to do non-blocking send    // if we wouldblock stop and wait for next loop  }};



That kind of logic is still possible with this design if you write your own layer above the current provided code. My layer is the foundation send/recv data with no extras. It just runs in its own threads to recv/send data as it's ordered to. The only difference is, right now I have it so the 1st thing you list is actually something that always goes on the background, but you can easily add another layer in your own user program to requeue events into your own queues to accomplish all three of those things. In fact, that's exactly what the behavior of enet is when I use my code to handle the recv/send work and pass the data to enet in it's higher level layer.

Quote:Your design is ok for servers where each packet is independent and doesn't require any shared state to process. But commonly, this isn't the case, and that's where multi-threaded UDP doesn't benefit much, if anything at all.


This server is a barebones, no processing, framework that is only meant to handle quite literally the send and recv process of UDP over IOCP. However, by adding in your own logic layer to the code, as I did with enet on top of this code, you get your shared state processing and dependent packets and it all works fine because that functionality is outside the scope of this code anyways.

So what ends up happening is, the UDP IOCP server is constantly reading/writing in its own thread when events happen or are posted, but the actual logical data processing or higher level protocol logic is always executed in a separate thread. So basically, your server is able to process more UDP traffic and then separates the processing of that traffic in another thread.

I feel like my post is going to read really convoluted so here's two examples that define what I am doing here. Let's say on a regular UDP server it takes 1 second to get a packet from the wire and 2 seconds to process it. If the server ran for 15 seconds, that means only 5 packets could be processed.

Let's say this IOCP UDP server ran with 1 thread only. In 15 seconds it would have taken 15 packets from the wire but only have processed 7 packets. This is because the recv code is in its own thread independent of the actual processing, to which the APC system is 'really fast' for passing events.

Now you might say, well if we simply split up the UDP code, we would reach those same results, and that's true. However, the regular UDP code is not as efficient as the IOCP model and it won't scale. That example is of course a worse case, you expect to be able to process packets a lot faster than the rate of taking them off the wire.

Let's say the regular UDP code can take a packet off at 1 second again, and can process a packet .5 seconds. In that 15 second time span, only 7 total packets would be processed. On the IOPC UDP model, 15 packets would be taken off the wire and all 15 would be processed in that time. As the ration of being able to process data to receiving data increases, then the more efficient the server will be.

Those are of course just exaggerated made up numbers to show the concept, but the point is that the server is able to take more packets off the wire at a time than a regular UDP solution and in turn if you can process those fast enough, which you should, then your server is more powerful.

Does that logic make sense, or am I wrong? I've reasoned this all out while I was developing the server, but maybe I am missing something.
Quote:Original post by Drew_Benton
Or do you mean, the benefits of having multiple threads running is negated because I do serialize to a single thread? If that's the case, then I'd not agree because the goal is to process as much network data as efficiently as possible and by passing it to the APC function via a queue, it is always executed in a single thread context.

This. The IOCP efficiency gains in case of UDP are mostly negated since each packet needs to be dispatched to workers, which only then try to handle it within context of connection.

In case of TCP, IOCP does all that, then invokes handler for connection.

Quote:Does that logic make sense, or am I wrong? I've reasoned this all out while I was developing the server, but maybe I am missing something.


It makes sense, but I never encountered a case where it would be possible to do much useful work in isolation.

I'm not claiming it can't be done, especially if you design your UDP protocol logic to take advantage of that.

The only time I could benefit from concurrent processing was when dealing with zlib compressed packets (decompression is kinda slow). But in that particular case it turned out that recv:send ratio was 1:10, so zlib was bottleneck on sending side.

The rest assumes that there is some protocol on top of UDP that manages acks, resends and flow control.

Multithreaded handlers complicate UDP protocol handling. For example, when you receive an ack/nack, or when you receive sequenced packet, you want to respond immediately. But that requires you to update UDP connection's state. If you have multiple handlers doing this, you either need to lock the connection state, or serialize to single thread. In both these cases the benefits of multi-threading are lost since work done concurrently is very small and possibly drowned in overhead.

With single-threaded handler you would examine only the protocol related data and act on them immediately. The rest would be put into token ring with connection as key. Then, whenever main thread has time, it grabs token ring, and processes its contents (while network handler is filling out new one). In a way, it's emulation of how TCP is handled by IOCP.

This approach has several benefits. Inter-thread communication operates on batches (or single packet worst-case), and it can dispatch to its own thread pool by connection (each connection or several connections to single worker). This can considerably reduce threading-related overhead.

But commonly, server will receive a factor less data than it will send, so performance of receive isn't even an issue. It also depends on network capacity, CPU, etc. On typical system, with 100Mbit network, it's unlikely any of this will matter.
Quote:Original post by Antheus
This. The IOCP efficiency gains in case of UDP are mostly negated since each packet needs to be dispatched to workers, which only then try to handle it within context of connection.


Ok, I'm following you now. The thing is though, I can't think of a related case where the TCP behavior doesn't approach this behavior of serializing the data to logically process when you are dealing with a global state. On a web server, sure, that's understandable since there is no global state, it's just a process data in the current connection context and move on.

If you have a TCP based game and if you have one recv posted per socket, when you go to handle the data in the worker threads, you will still have to have some sort of synchronization in place to work with 'the whole' as most state changing behavior would need to be processed in a global context and not just the connection specific context. I'd call that action/reaction, and that does not necessarily mean you immediately send a response back, but in queuing the data, you would have to have some sort of lock or synchronization for it to work with multiple worker threads. I don't think using a lock-less object would help as it'd also need to be wait free.

An example is like this. Imagine you got a player requested to move here packet from the client. Let's say the worker thread did the calculations and verified it and updated the player's state to reflect the movement. Let's say it also set a flag that there was a movement and it needs to be broadcast to nearby players. At some point, the main processing logic, which has to be outside the worker threads, would need to lock that player object to safely handle the move event and in addition lock all players that are immediately affected by the event (say this player was on fire and burn damage was being applied) That kind of action would interfere with the networking system as packets could not be processed until the main processing thread was done.

So am saying even if you did not do any action/reaction logic in the client context of a worker thread and simply updated the connection specific state (which would not require any locking or synchronization since only one event can happen at a time there) ultimately, the server will sill have to process that connection context and at that point, some synchronization would have to be done in a way that I think would be less efficient than simply putting all events into a queue that runs in a separate thread.

In addition, I thought the whole goal of IOCP model is to pull data from the wire and pass it along for processing rather than actually doing the processing in the worker threads? I mean I agree that the practicality of the model is iffy, but essentially, I'm viewing it as (go programmer art!):


If your game could figure out a way to process multiple packets at once, then that would be great for the IOCP model, but I'm under the presumption that most games are reduced to this behavior of one processing context to avoid the costs of locking and synchronizing multiple threads all across the board? I'm not sure since this topic is so kept under wraps.


I guess what I'm asking now is if there is anything technically inaccurate or just plain wrong about any claims made or logic used. I can understand the argument that it might not actually give as beneficial results when compared to TCP, or that multiple network threads might not benefit a game since to ensure thread safety, data should be processed in one context to minimize locks and synchronization, or vice versa for that matter, or even it might not be that much better than using a single thread to process the UDP data on since a game server is not ever going to be handling so much traffic that the scale of data reaches levels where the IOCP model would best benefit, but those aren't things I'd consider that negatively impacts the concept as let's say, creating one thread per connection as with that approach your server will simply die upwards to a thousand clients or so and is technically a bad thing to do, whereas this model just "might" not outperform a single thread design enough to make it justify the time and effort of having created it.
Quote:Original post by Drew_Benton

If you have a TCP based game and if you have one recv posted per socket, when you go to handle the data in the worker threads, you will still have to have some sort of synchronization in place to work with 'the whole' as most state changing behavior would need to be processed in a global context and not just the connection specific context.
This is the main complication.

Quote:Let's say the worker thread did the calculations and verified it and updated the player's state to reflect the movement. Let's say it also set a flag that there was a movement and it needs to be broadcast to nearby players. At some point, the main processing logic, which has to be outside the worker threads, would need to lock that player object to safely handle the move event and in addition lock all players that are immediately affected by the event (say this player was on fire and burn damage was being applied) That kind of action would interfere with the networking system as packets could not be processed until the main processing thread was done.
While this would be ideal, it's very difficult to actually implement. This is what I was pointing out - there's little work that can be done concurrently, so might as well drop the overhead and do everything in single thread.

Quote:In addition, I thought the whole goal of IOCP model is to pull data from the wire and pass it along for processing rather than actually doing the processing in the worker threads?
Either way, is there even enough work to be done for any benefits to be had from concurrency?

UDP packets arrive sequentially on same socket. Processing of return value, such as passing it into main processing thread will be minimal, and almost certainly not an issue. Is there any work that can be done concurrently at all? I mentioned that packet validation or compression could be done, but I cannot how much of an improvement it would bring. If either of these tasks is expensive, improvements could be considerable.

Quote:I guess what I'm asking now is if there is anything technically inaccurate or just plain wrong about any claims made or logic used.


No, nothing is wrong. There's just the question of practicality, but that needs to be evaluated on a practical example to see if there are any benefits, since implementation may determine how efficient the server actually is.
I think that IOCP is useful even if you don't do work in the completion thread(s). It's a good API for posting potentially thousands of I/O requests, and handling them all, in the case of using TCP. select(), and WaitForMultipleObjects(), will only go to 64 FDs. It's possible that select() would work if you #defined the fd set size higher, but the implementation would likely be quite inefficient.

Thus, even if you just spawn one thread to call GetQueuedCompletionStatus(), and have that thread just call QueueUserAPC() or some similar function to push work to another, centralized thread, then using IOCP is the best way to express more than 64 simultaneous connections.

Similarly, if you use UDP, but you use a signature hashing scheme to authenticate endpoints, then you can easily put the UDP recv() call and hash verification into one thread, and only post verified requests onto the other, centralized worker thread. This may save a few percent for the processing thread, if you have lots of network traffic and an expensive hash function. However, in that case, you don't need IOCP, because there's only the one socket to recv() on.

So, I wouldn't totally rule out this technique in general, but you'd have to have a very parallel workload (not your typical world update loop), and you'd have to use TCP to get significant benefits from using more than one worker thread and IOCP.
enum Bool { True, False, FileNotFound };

This topic is closed to new replies.

Advertisement