Trying to get my IOCP code on par with expected performance

Drew Benton · 2009-04-03T21:14:23

The more I learn, the less I realize I really know. I've finally gotten around to learning the IOCP model the past few weeks. I've done extensive research over this time and have consulted the archives here at GameDev a lot. After having reached a reasonable minimal code base that could be tested, I ran into some issues while stress testing it. I didn't quite understand how to test my IOCP code, so I ended up writing dozens of tests that failed. I thought the code was wrong at first, but after doing more research, that does not seem to be the case. Long story short, I was throwing too much data at it expecting it to "just work" since it was IOCP. What I've done is found 20 very important quotes from different threads over the years, all from hplus0603 since I needed accurate and reliable information, that best describes game programming on a massive scale. I've linked each quote back to the original thread for future references. Here they are. Quote:(Link) IOCP solves the problem of "if I'm using TCP, how can I simultaneously serve 1,000 sockets with some fairness and performance?"Quote:(Link) There is no need to have more than one read request (or write request) outstanding for a single socket. The point of IOCP is that you can have many read requests outstanding for many sockets, and efficiently receive data from the sockets that do have data. ... you should just do a non-blocking write/send without IOCP, and let the kernel-side outgoing buffer take care of the asynchronizing.Quote:(Link) Quote:More then 2000 TCP connection is hard for OS. These days, if you use the appropriate I/O primitive (IOCP on Windows, various poll/epoll variants on UNIX), 2000 TCP connections isn't so bad. However, if each of them sends commands that require collision testing, 30 times per second, that's hard for the server :-)Quote:(Link) Quote:WoW has an order of magnitude more players Be careful to separate number of players per shard ("server") from number of players for the game, total. Even a successful indie usually only runs a single shard, and can probably get the same number of players on that shard as you'd get on a single shard for a commercial game.Quote:(Link) Quote:Nor is it possible at this scale to have just one thread serving all clients I'd like to take exception to that particular statement. You can serve all the clients in a single thread just fine, given the proper software architecture. ... (combined replies) ... Ergo, it's often most cost-effective, AND highest performance, to design for single-CPU single-threaded nodes. If you want to scale, add more nodes. Quote:(Link) Note that there are two pieces to "clustering" here: 1) How do objects on the servers talk to each other? You need to be custom here, IMO. 2) How do you manage the actual cluster servers? I e, how is software deployed, how do you monitor load, how do you fail-over when a host dies? You can leverage existing stuff here. We built ours on nagios, perl, rsync and ssh, although something more integrated like Beowulf would probably work, too.Quote:(Link) Quote:I'm looking at doing scaleable networking Note that all the things you've been talking about -- IOCP, bit packing, etc -- all have to do with simple efficiency improvements. They do not improve the scalability of the system. A system that is scalable has a know path to add capacity to meet load. For example, if one system is totally optimized, and can do 1,000 connected clients per machine, but won't be helped by adding more machines, then that system is not scalable. A system that does 100 connected clients per machine, but you can add as many machines as you want to meet load, is scalable. Quote:(Link) One thread per zone allows you to scale to multi-CPU systems for larger worlds. This is assuming that zones are computationally independent (no locking required). One process per zone might be even better, as it allows you to scale to multiple physical machines. It requires the client to re-connect to a different machine when zoning, though.Quote:(Link) Regarding players per server, that depends on your degree of security, how much simulation you do, how interactive it is, and all that. If all you're doing is bouncing a few packets a second between players that are close to each other, you can easily go to 2,000 on a single piece of modern hardware. After all, most routers are just regular computers, and do many more packets than that. If you do secure server-authenticated simulation at high simulation rates where each tool, attachment, vehicle, weapon, etc is a fully simulated object, and fully replicated to neighboring servers and clients, and have expensive rules for who can see what, when, and where, then you can cut that in ten.Quote:(Link) When it comes to "servers" vs "clusters," I've given up on trying to correct that terminology. 90% of the professionals in the business, and 100% of the customers, say "server" when they mean mean "shard", and I just translate to "cluster" in my head. Quote:(Link) When it comes to CPU load, a highly optimized FPS that doesn't allow you to affect the environment (similar to Planetside) may allow you 300-500 players per server CPU. That's making some assumptions on server physics rate, collision complexity, etc. If you allow players to affect the environment, have very complex geometry, a richer movement model, a higher simulation rate, etc, then that number will go down. From what little I've played WoW (a one-month subscription that came with a copy I got), I would say that they do not run the full simulation on the servers.Quote:(Link) A game like EverQuest runs about 100-200 players per piece of hardware (at least at the time of shipping -- when a 1 GHz CPU was the top of the line). They have maybe 50 zones in the game. They could put very heavily loaded zones onto their own machines, but put sveral zones that seldom have that many people all on a single machine to save on hardware cost. Thus, an initial EverQuest "server" (really, a world instance, or "shard") would be run by something like 30 machines. ... snip ... If you want all of everything, at high simulation rates, you won't be able to do 2,000 entities on a single machine, for much of the same reasons that you don't have 2,000 simultaneous active NPCs in a single-player game of Half-Life. The easiest way to load balance between physical servers within a single shard is to require the user to "zone" between areas of the game, and put different zones on different machines -- the EQ model. ... snip ... Quote:(Link) Quote: they could only handle around 200 clients per server, which would increase the hardware costs of your average MMOG by a factor of 10 You don't expect that, just because the server "Little Big" has 2,000 players active, that means it's a single machine, right? EQ runs one process per zone, with sparsely populated zones sharing machines, but heavily populated zones having a machine of their own (this may have changed if they've upgraded the hardware). The Wood-elf starting zone (which is also the Elf starting zone) used to get server-based lag around 150 people or so. (This was years back -- don't know the status these days) As more technology are put into these games, faster machines will mean better collisions and things of that nature, rather than more simultaneous players per machine. 200 players per machine is just fine. Quote:(Link) Regarding number of player per server, vs number of players per server machine, such data IS available for some of the games out there. For example, in Game Developer, issue May 2004, article "The Business of EverQuest Revealed," Sony spokespeople say there are 47 open "servers" (== shards) and about 1,500 server machines running those shards. That's a little over 30 machines per shard. They also say their peak simultaneous player volume is 100,000, which, if you divide it on 1,500 PCs is about 70 players per server machine. Yes, a developer may tell you they have 3,000 players per "server," but, as my original point said, that's not 3,000 players per machine. Each "server" in an MMO (more accurately called "shard") is a cluster of multiple server machines; no single server machine does 3,000 simultaneous players for an interactive, real-time, online game. (Text-based MUDs are something different, of course)Quote:(Link) Can you serve 10,000 "connections" on a single machine, and somehow claim success? I'm sure. For example, just echoing a position/velocity packet per second per player works fine that way. Also, a text-based MUD, where each player issues a command once every ten seconds probably could get away with that. Can you do that while running object interaction simulation, collision detection, weapons statistics, inventory, and all the other things that go into a game? Nope! No-where near. The game I know of that came closest so far (publicly available, that is), was Planetside, which did something like 500 players per machine (== island). Even the Half-Life 2 engine, brand spanking new, doesn't claim more than 200 players per machine.Quote:(Link) Anyway, depending on physics demands, a current single machine can run between 100 and 1,000 players before running out of steam, where "steam" means kernel ability to service connections, and memory bandwidth ability to serve player physics. If you put 200 patches per server, that means 5,000 server machines to cover your area, which is certainly reasonable for a cluster intended to serve a million simultaneous players.Quote:(Link) Quote:If you're planning to make an indy mmog this is a good strategy IMHO. When we have to bring the server back up after a crash it's at the maximum a 20 minutes rollback but it's often something in between. Of course when we do bring the server down ourselves there's no rollback. The real problem is when there's not "a" server, but a distributed cluster of them. You have to make sure that all operations that affect more than one physical data store commit together somehow. (The classic way is to use a transaction monitor and two-phase commit)Quote:(Link) We use a model where the client connects to all zone servers within view, and zone servers tell the client about objects within its borders only. An alternative is to make the client connect to only one zone server (except when transitioning), and make each zone server tell the client about all objects within client range (this means that zone servers need to forward their updates to their neighbors). A third option is to go with the "central server" approach, but to have more than one.Quote:(Link) Quote:Is there any techniques that can be used for most MMOGs? What most MMOs do is sharding. Their game world is only designed to handle 1,000 - 5,000 simultaneous online users, and gets too full if more people are playing. Thus, they replicate the same content across many shards, giving each one a separate name ("Darkbane," "Brell Serilis" etc). Quote:(Link) Sockets limitation: The number of sockets really shouldn't be a limitation to how many players you have on a single machine. ... snip ... 10,000 users on a server: "server" != "machine". Typically, MMORPG companies build clusters of machines that work together, where each machine will serve one or more zones, or one or more areas of the world, or one or more roles (NPC, guild, chat, etc), or some other such load balancing scheme. ... snip ... Now, as the thread title implies, my particular question comes in regarding how to make all the aforementioned concepts work together. I have all this information in front of me (the quotes) that tell how it's done in practice, but it seems that they are all interdependent on one another to form a working solution, and I don't know what direction to go in now. Before having put together all those quotes for this thread, I thought if I made an IOCP server, I'd be able to process heavy network data for around 1000 clients without any "problems", per se, and simply be done. I find that I can barely handle traffic for 200 clients and I've not even added in any data processing or other game related technologies. I thought my code was wrong at first, but after checking it over and trying other configurations it seems right. Regardless though, it doesn't matter about the IOCP code I've written. It could be wrong, it could be right. Either way, the bigger problem at hand is that I'd still have to code something to accomplish the distributed aspect to make the server scalable via adding more machines. That, is what I do not know how to do. Likewise, the whole shard/zone business seems something that is "game specific" and I don't have a game! I'm only trying to work on the network architecture. I'm looking for advice, design ideas, and a general direction I should go in to be able to learn first-hand on how to make all these things work together to form a scalable networking system that is applicable to games. I'm requesting more "practical" advice rather than "theoretical", but beggars can't be choosers [wink]. My main goal is to have a simple "working" system that is not too overly complicated but realistic enough to where it would be applicable in a simple game. I want something "real world", but not to try and create the next big commercial middleware system. The resulting performance metrics aren't that important to me as just learning the practical design of how it's done. This is for my own learning experience for the future as I want to build up experience in this area. Here's what I'm looking for specific to the four areas of interest: IOCP - I've pretty much exhausted online resources on IOCP. I don't need much here, barring someone who has written working code to review my own [lol]. I've not ruled out the necessity of understanding a custom UDP based protocol as well rather than TCP w/ IOCP. It's just that I've had all my experience with TCP and IOCP is the last tier. MMO-Architecture - Having studied many F2P mmos, I understand the components of them, but I don't know how to design them (types, classes, etc...). Shame on me, since I've got a degree in Software Engineering. I need to really learn UML and how to class diagram out a simple game. Any resources on game specific modeling would help here or any other advice on building up experience to get effective here. Cluster Computing - I'll be looking at Windows specific technology. The type of cluster or distributed system I use does not really matter as long as I learn how to make it scalable. I think I should grab a few cheap 1U racks from Geek.com to help test in this area; but I've got no idea what type of system to code for such. Shards & Zones - I realize this is really game specific, but I'd like to find a way to make a simplified abstraction of an API used for implementing these concepts. I know this post is really long, sorry. Thanks for any help though! I do understand how specialized these domains are and there are a lot of trade secrets people don't want to give away. That's why I'm just looking for practical advice on what I should pursue to accomplish my goals. [smile] [Edited by - Drew_Benton on March 30, 2009 12:42:23 PM]

Drew_Benton

1,865

Author

March 31, 2009 03:05 PM

Quote:If you provide properly aligned and lockable buffers in the read, the kernel can read straight into your buffer, instead of going through a separate buffer, so a 0-byte read is by no means a guaranteed win IMO.

Ah, ok, that makes sense. Thanks for that explanation. I went ahead and just switched to the regular way of passing in the buffer and size to WSARecv. I also reread the topic on Scalable Server Architecture and see I was misinterpreting some of the information there; I should also not touch the SO_SNDBUF/SO_RCVBUF options on a socket since I will always have overlapped reads posted on a socket in this setup. I can see now in other setups, that might not be the case.

Quote:What kind of CPU and how much RAM are you using for this test server?

Here is information from CPU-Z (I'm running XP 32bit though, so 3.5gb is usable)

In terms of the test:
* Program is ran in release mode.
* The Initial memory is: 8.1mb (1000 preallocated connections)
* 500 connections from one laptop via LAN

Server:
* Memory usage to 9.5mb
* ~42kb/s traffic in to the server (reported by NetNeter)
* ~28kb/s traffic out of the server (not sure why though)

Clients
* ~42kb/s total traffic out
* ~28kb/s traffic in to the clients (not sure why either)
* Each client takes up 2.3mb on laptop, virtually no CPU time
* CPU utilization on laptop is ~2-4% from other system applications running
* Each client sends 32 bytes and Sleeps 1 second, 32 bytes/s output consistently

Results:
* Up to the ~500 connection mark, there is no service time of > 5s. As soon as 500 is hit, one or two will fire occasionally. If I try adding 100 more or so, more begin to have longer service times.

I also ran Wireshark during the test and I think I might see the real problem at hand. I ran a capture for just under 2 minutes to get the 500 clients connected (takes about 70s on one computer) and then added 100 more on another computer until they started timing out.

I applied a filter to the traffic, "tcp.analysis.ack_rtt > 1 && tcp.dstport == 15779" and starting around 50 seconds (which a little over 450 clients would be connected) the RTT to ACK for the packets raises to 1.5s. Towards the 80s mark, where a little over 600 clients would be connected, there are a whole bunch of "retransmissions" (lines that are black and red in Wireshark) and their RTT to ACK is 2s. Getting towards the 90s mark, there are a couple of entries that hit a RTT to ACK of 3s!

That would explain why the service time is gradually getting longer as more connections are being added and clients sporadically disconnect. I was running 500 clients per laptop, which it seems the network can't handle from one source.

If I split the test up into 2 x 250 connection parts and watch Wireshark, I see far fewer retransmits and never get any notifications of the delayed reads. If I try running 400 clients per computer, then right towards the 600 mark, I start getting more retransmits and longer service time delays in my program. As I hit almost 800 connections, the retransmits were filling up Wireshark and most of the connections were failing in my program due to longer service times.

I see now that my code seems to be more than suitable for handling a lot more connection and traffic, but my network and my current test setup is not. What would you suggest is the best way to go about testing code like this in general to avoid problems like this? I mean, when you are in an early stage and want to test the upper bounds of your code but have nothing much to lure random testers in and you need lots of traffic, do you just have to wait?

Let's say I wanted to try and pull off a larger test, would the problem lie in my router? I.e., if I setup a dedicated server to run on at home and got let's say 10 or so older computers (you know those P4 512mb Dell Optiplex ones) connected via LAN, do you think the router still couldn't handle it or are the computers themselves the problem?

Thanks for your continued help [smile]

Code wise, I've cleaned up the code a bit and fixed a few things that were no longer necessary. This code still is far from usable for anything serious, but I'm just adding it for anyone who stumbles upon the thread:

/*	A lot of resources were consulted and used in this code. Major resources	used include:		MSDN		http://win32.mvps.org/network/sockhim.html		Network Programming for Microsoft Windows		CodeProject's IOCP articles			http://www.codeproject.com/KB/IP/IOCP_how_to_cook.aspx			http://www.codeproject.com/KB/IP/SimpleIOCPApp.aspx			http://www.codeproject.com/KB/IP/iocp.aspx	Larger blocks of comments are mostly from the tbe second reference.	I used comments from that project to help understand the particulars 	of IOCP.*/#include <winsock2.h>#include <mswsock.h>#include <windows.h>#include <list>#include <vector>#include <algorithm>#include <iostream>#pragma comment(lib, "ws2_32.lib")HANDLE hPacketProcessingThread = INVALID_HANDLE_VALUE;// Logical states for the overlapped structureconst int HPS_CONNECTION_STATE_CLOSED = 0;const int HPS_CONNECTION_STATE_ACCEPT = 1;const int HPS_CONNECTION_STATE_READ = 2;// Max bytes for the recv bufferconst int HPS_OVERLAPPED_BUFFER_RECV_SIZE = 8192;// Max bytes for the send bufferconst int HPS_OVERLAPPED_BUFFER_SEND_SIZE = 8192;// The size of the sockaddr_in parameterconst int HPS_SOCKADDR_SIZE = (sizeof(SOCKADDR_IN) + 16);DWORD WINAPI WorkerThreadWrapper(LPVOID lpParam);DWORD WINAPI ScavengerThreadWrapper(LPVOID lpParam);struct tHighPerformanceServerData;struct tWorkerThreadData;struct tWorkerThreadWrapperData{	tHighPerformanceServerData * serverData;	tWorkerThreadData * threadData;};struct tConnectionLocalData{	DWORD dwUid;	tConnectionLocalData() :		dwUid(-1)	{	}	~tConnectionLocalData()	{	}};struct tConnectionGlobalData{	LPFN_ACCEPTEX lpfnAcceptEx;	LPFN_GETACCEPTEXSOCKADDRS lpfnGetAcceptExSockaddrs;	SOCKET listenSocket;	HANDLE hCompletionPort;	DWORD dwNumberOfConcurrentThreads;	DWORD dwReadTimeTimeout;	DWORD dwAcceptTimeTimeout;	int initialReceiveSize;	tConnectionGlobalData() :		lpfnAcceptEx(NULL),		listenSocket(INVALID_SOCKET),		hCompletionPort(INVALID_HANDLE_VALUE),		dwNumberOfConcurrentThreads(0),		lpfnGetAcceptExSockaddrs(NULL),		dwReadTimeTimeout(-1),		dwAcceptTimeTimeout(5000),		initialReceiveSize(0)	{	}};struct tConnectionData{public:	OVERLAPPED overlapped;	SOCKET socket_;	sockaddr_in address;	WORD sendBufferSize;	BYTE recvBufferData[HPS_OVERLAPPED_BUFFER_RECV_SIZE];	INT connectionState;	DWORD dwLastReadTime;	tConnectionGlobalData * globalDataPtr;	tConnectionLocalData * localDataPtr;public:	tConnectionData(tConnectionGlobalData * gblData) : 		socket_(INVALID_SOCKET),		connectionState(HPS_CONNECTION_STATE_CLOSED),		sendBufferSize(0),		dwLastReadTime(0),		globalDataPtr(gblData),		localDataPtr(0)	{		memset(&overlapped, 0, sizeof(overlapped));		memset(&address, 0, sizeof(address));		localDataPtr = new tConnectionLocalData;	}	~tConnectionData()	{		delete localDataPtr;	}	bool Initialize()	{		connectionState = HPS_CONNECTION_STATE_CLOSED;		if(socket_ != INVALID_SOCKET) // Prevent resource leaks		{			return Close(true, true);		}		socket_ = WSASocket(AF_INET, SOCK_STREAM, 0, NULL, 0, WSA_FLAG_OVERLAPPED);		if(socket_ == INVALID_SOCKET)		{			// TODO: Handle error			return false;		}		// We still need to associate the newly connected socket to our IOCP:		HANDLE hResult = CreateIoCompletionPort((HANDLE)socket_, globalDataPtr->hCompletionPort, 0, globalDataPtr->dwNumberOfConcurrentThreads);		if(hResult != globalDataPtr->hCompletionPort)		{			// TODO: Handle error			return false;		}		DWORD numberOfBytes = 0; // Not used in this mode		if(globalDataPtr->lpfnAcceptEx(globalDataPtr->listenSocket, socket_, recvBufferData, globalDataPtr->initialReceiveSize, HPS_SOCKADDR_SIZE, HPS_SOCKADDR_SIZE, &numberOfBytes, &overlapped) == FALSE)		{			DWORD dwError = GetLastError();			if(dwError != ERROR_IO_PENDING)			{				closesocket(socket_);				socket_ = INVALID_SOCKET;				// TODO: Handle error				return false;			}		}		// Update the state the connection is in		connectionState = HPS_CONNECTION_STATE_ACCEPT;		// Success		return true;	}	bool Close(bool force, bool reuse)	{		if(socket_ != INVALID_SOCKET)		{			struct linger li = {0, 0};			if(force == true) // Default: SO_DONTLINGER			{				li.l_onoff = 1; // SO_LINGER, timeout = 0			}			setsockopt(socket_, SOL_SOCKET, SO_LINGER, (char *)&li, sizeof(li));			closesocket(socket_);			socket_ = INVALID_SOCKET;		}		connectionState = HPS_CONNECTION_STATE_CLOSED;		if(reuse == true)		{			return Initialize();		}		return true;	}	void ProcessIO(DWORD numberOfBytes)	{		if(connectionState == HPS_CONNECTION_STATE_READ)		{			if(numberOfBytes == SOCKET_ERROR)			{				// TODO: Log error				Close(true, true);				return;			}			else if(numberOfBytes == 0) // connection closing?			{				// TODO: Log error				Close(false, true);				return;			}			dwLastReadTime = GetTickCount();			//			// TODO: Process data sent from the client here			//			PostRead();		}		else if(connectionState == HPS_CONNECTION_STATE_ACCEPT)		{			// On Windows XP and later, once the AcceptEx function completes and the SO_UPDATE_ACCEPT_CONTEXT option is set on the accepted socket, 			// the local address associated with the accepted socket can also be retrieved using the getsockname function. Likewise, the remote 			// address associated with the accepted socket can be retrieved using the getpeername function.			setsockopt(socket_, SOL_SOCKET, SO_UPDATE_ACCEPT_CONTEXT, (char *)&globalDataPtr->listenSocket, sizeof(globalDataPtr->listenSocket));			dwLastReadTime = GetTickCount();			if(globalDataPtr->initialReceiveSize != 0)			{				//				// TODO: Process data sent from a ConnectEx call here				//			}			// We are ready to start receiving from the client			PostRead();		}	}	// This function will post a read operation on the socket. This means that an IOCP event	// notification will be fired when the socket has data available for reading to it.	void PostRead()	{		connectionState = HPS_CONNECTION_STATE_READ;		WSABUF recvBufferDescriptor = {HPS_OVERLAPPED_BUFFER_RECV_SIZE, (char *)recvBufferData};		DWORD numberOfBytes = 0;		DWORD recvFlags = 0;		BOOL result = WSARecv(socket_, &recvBufferDescriptor, 1, &numberOfBytes, &recvFlags, &overlapped, NULL);		if(result == SOCKET_ERROR)		{			if(GetLastError() != ERROR_IO_PENDING)			{				// TODO: Handle error				Close(true, true);			}		}	}};struct tWorkerThreadData{public:	HANDLE hThread;	DWORD dwThreadId;public:	tWorkerThreadData() : 		hThread(INVALID_HANDLE_VALUE),		dwThreadId(0)	{	}	~tWorkerThreadData()	{	}};struct tHighPerformanceServerData{public:	WORD wPort;	int backLog;	HANDLE hCompletionPort;	DWORD dwNumberOfConcurrentThreads;	DWORD dwNumberOfWorkerThreads;	LONG lRunningWorkerThreadCount;	SOCKET sListenSocket;	SOCKADDR_IN saInternetAddr;		GUID GuidAcceptEx;	LPFN_ACCEPTEX lpfnAcceptEx;		GUID GuidGetAcceptExSockaddrs;	LPFN_GETACCEPTEXSOCKADDRS lpfnGetAcceptExSockaddrs;		CRITICAL_SECTION workerThreadCS;	std::list<tWorkerThreadData *> workerThreads;	DWORD dwInitialConnectionPoolCount;	std::list<tConnectionData *> connectionPool;	HANDLE hScavengerThread;	DWORD dwScavengerThreadId;	DWORD dwScavengerDelay; // milliseconds between runs of the idle socket scavenger	HANDLE hScavengerExitEvent; // tells scavenger thread when to die	DWORD dwWorkerThreadScaleValue;	tConnectionGlobalData globalData;public:	tHighPerformanceServerData() : 		hCompletionPort(INVALID_HANDLE_VALUE),		dwNumberOfConcurrentThreads(0),		dwNumberOfWorkerThreads(0),		lRunningWorkerThreadCount(0),		sListenSocket(INVALID_SOCKET),		wPort(0),		lpfnAcceptEx(NULL),		lpfnGetAcceptExSockaddrs(NULL),		dwInitialConnectionPoolCount(1000),		dwScavengerDelay(1000),		hScavengerExitEvent(NULL),		hScavengerThread(INVALID_HANDLE_VALUE),		dwScavengerThreadId(0),		dwWorkerThreadScaleValue(1),		backLog(SOMAXCONN)	{		GUID guidAcceptEx = WSAID_ACCEPTEX;		memcpy(&GuidAcceptEx, &guidAcceptEx, sizeof(guidAcceptEx));		GUID guidGetAcceptExSockaddrs = WSAID_GETACCEPTEXSOCKADDRS;		memcpy(&GuidGetAcceptExSockaddrs, &guidGetAcceptExSockaddrs, sizeof(guidGetAcceptExSockaddrs));		InitializeCriticalSection(&workerThreadCS);	}	~tHighPerformanceServerData()	{		DeleteCriticalSection(&workerThreadCS);	}	int WorkerThread()	{		BOOL result = 0;		DWORD numberOfBytes = 0;		ULONG key = 0;		OVERLAPPED * lpOverlapped = 0;		InterlockedIncrement(&lRunningWorkerThreadCount);		while(true)		{			tConnectionData * connectionData = 0;			InterlockedDecrement(&lRunningWorkerThreadCount);			result = GetQueuedCompletionStatus(hCompletionPort, &numberOfBytes, &key, &lpOverlapped, INFINITE);			if(key == -1)			{				break; // Time to exit the worker thread			}			connectionData = CONTAINING_RECORD(lpOverlapped, tConnectionData, overlapped);			if(connectionData == 0)			{				// TODO: Handle error				continue;			}			InterlockedIncrement(&lRunningWorkerThreadCount);			if(result == TRUE)			{				// We have an I/O to process				connectionData->ProcessIO(numberOfBytes);			}			else			{				// Close this socket and make space for a new one if we are still listening				connectionData->Close(true, ((sListenSocket == INVALID_SOCKET) ? false : true));			}		}		return 0;	}	int ScavengerThread()	{		while(true)		{			int count[4] = {0};			std::list<tConnectionData *>::iterator itr = connectionPool.begin();			while(itr != connectionPool.end())			{				tConnectionData * connection = (*itr);				count[connection->connectionState]++;				// AcceptEx() called, but no completion yet				if(connection->connectionState == HPS_CONNECTION_STATE_ACCEPT)				{					// determine if the socket is connected					int seconds = 0;					int length = sizeof(seconds);					if(0 == getsockopt(connection->socket_, SOL_SOCKET, SO_CONNECT_TIME, (char *)&seconds, &length))					{						if(seconds != -1)						{							seconds *= 1000;							if(seconds > (int)globalData.dwAcceptTimeTimeout)							{								printf("[%i][Accept] idle timeout after %i ms.\n", connection->socket_, seconds);								// closesocket() here causes an immediate IOCP notification with an error indication; 								// that will cause our worker thread to call Close().								closesocket(connection->socket_);								connection->socket_ = INVALID_SOCKET;								connection->connectionState = HPS_CONNECTION_STATE_CLOSED;							}						}						// No connection made on this socket yet						else if(seconds == -1)						{							// Nothing to do						}					}				}				// The client is in a read or write state, doesn't matter which. We want to make sure				// activity still exists as desired.				else				{					bool doClose = false;					DWORD tick = GetTickCount();					DWORD dwLastTime = tick - connection->dwLastReadTime;					if(dwLastTime > globalData.dwReadTimeTimeout)					{						printf("[%i][Read] idle timeout after %i ms.\n", connection->socket_, dwLastTime);						doClose = true;					}					else if(dwLastTime > ((float)globalData.dwReadTimeTimeout * .5))					{						printf("[%i][Read] %i ms\n", connection->socket_, dwLastTime);					}					if(doClose)					{						closesocket(connection->socket_);						connection->socket_ = INVALID_SOCKET;						connection->connectionState = HPS_CONNECTION_STATE_CLOSED;					}				}				itr++;			}			printf("[Closed]: %.4i [Accept]: %.4i [Read]: %.4i [Write]: %.4i\n", count[0], count[1], count[2], count[3]);			// Pause until next run due			DWORD result = WaitForSingleObject(hScavengerExitEvent, dwScavengerDelay);			if(result != WAIT_TIMEOUT)			{				break;			}		}		return 0;	}	DWORD AddConnectionsToPool(long count)	{		// We cannot add more connections once the server has started		if(hScavengerThread != INVALID_HANDLE_VALUE)		{			return 0;		}		DWORD total = 0;		for(long index = 0; index < count; ++index)		{			tConnectionData * connection = new tConnectionData(&globalData);			bool result = connection->Initialize();			if(result == true)			{				connectionPool.push_back(connection);				total++;			}			else			{				// TODO: Handle error				delete connection;			}		}		return total;	}	DWORD AddWorkerThreads(DWORD count)	{		DWORD total = 0;		for(DWORD index = 0; index < count; ++index)		{			tWorkerThreadWrapperData * workerThreadData = new tWorkerThreadWrapperData;			tWorkerThreadData * threadData = new tWorkerThreadData;			threadData->hThread = CreateThread(NULL, 0, WorkerThreadWrapper, workerThreadData, CREATE_SUSPENDED, &threadData->dwThreadId);			if(threadData->hThread != NULL)			{				total++;				EnterCriticalSection(&workerThreadCS);				workerThreads.push_back(threadData);				LeaveCriticalSection(&workerThreadCS);				workerThreadData->serverData = this;				workerThreadData->threadData = threadData;				DWORD dwResult = ResumeThread(threadData->hThread);				if(dwResult == (DWORD)-1)				{					// TODO: Handle error					__asm nop				}			}			else			{				delete workerThreadData;				delete threadData;			}		}		return total;	}};DWORD WINAPI WorkerThreadWrapper(LPVOID lpParam){	tWorkerThreadWrapperData * data = (tWorkerThreadWrapperData *)lpParam;	DWORD dwResult = data->serverData->WorkerThread();	LPCRITICAL_SECTION pCS = &data->serverData->workerThreadCS;	EnterCriticalSection(pCS);	std::list<tWorkerThreadData *>::iterator itr = data->serverData->workerThreads.begin();	while(itr != data->serverData->workerThreads.end())	{		tWorkerThreadData * td = (*itr);		if(td->dwThreadId == data->threadData->dwThreadId && td->hThread == data->threadData->hThread)		{			printf("Removing worker thread [%X][%X]\n", data->threadData->hThread, data->threadData->dwThreadId);			data->serverData->workerThreads.erase(itr);			break;		}		itr++;	}	delete data->threadData;	delete data;	LeaveCriticalSection(pCS);	return dwResult;}DWORD WINAPI ScavengerThreadWrapper(LPVOID lpParam){	return ((tHighPerformanceServerData *)lpParam)->ScavengerThread();}bool InitializeWinsock(){	WSADATA wd = { 0 };	if(WSAStartup(MAKEWORD(2, 2), &wd) != 0)	{		// TODO: Handle error		return false;	}	if(LOBYTE( wd.wVersion ) < 2)	{		WSACleanup();		// TODO: Handle error		return false;	}	return true;}void DeinitializeWinsock(){	WSACleanup();}// Our high performance server :)class cHighPerformanceServer{private:	tHighPerformanceServerData * internalData;public:	cHighPerformanceServer()	{		internalData = new tHighPerformanceServerData;	}	~cHighPerformanceServer()	{		delete internalData;	}	bool Create(unsigned short port)	{		// Get the system information		SYSTEM_INFO SystemInfo;		GetSystemInfo(&SystemInfo);		// Try to create an I/O completion port		internalData->hCompletionPort = CreateIoCompletionPort(INVALID_HANDLE_VALUE, NULL, 0, internalData->dwNumberOfConcurrentThreads);		if(internalData->hCompletionPort == NULL)		{			// TODO: Log error			Destroy();			return false;		}		// Calculate how many worker threads we should create to process IOCP events		DWORD dwNumberOfWorkerThreads = internalData->dwNumberOfWorkerThreads;		if(internalData->dwNumberOfWorkerThreads == 0)		{			if(internalData->dwNumberOfConcurrentThreads == 0)			{				dwNumberOfWorkerThreads = SystemInfo.dwNumberOfProcessors * internalData->dwWorkerThreadScaleValue;			}			else			{				dwNumberOfWorkerThreads = internalData->dwNumberOfConcurrentThreads * internalData->dwWorkerThreadScaleValue;			}		}		// Create the worker threads!		DWORD dwWorkerTotal = internalData->AddWorkerThreads(dwNumberOfWorkerThreads);		if(dwWorkerTotal != dwNumberOfWorkerThreads)		{			// TODO: Log error			Destroy();			return false;		}		internalData->sListenSocket = WSASocket(AF_INET, SOCK_STREAM, 0, NULL, 0, WSA_FLAG_OVERLAPPED);		if(internalData->sListenSocket == INVALID_SOCKET)		{			// TODO: Log error			Destroy();			return false;		}		// Bind the socket to the port		internalData->wPort = port;		internalData->saInternetAddr.sin_family = AF_INET;		internalData->saInternetAddr.sin_addr.s_addr = htonl(INADDR_ANY);		internalData->saInternetAddr.sin_port = htons(internalData->wPort);		int bindResult = bind(internalData->sListenSocket, (PSOCKADDR) &internalData->saInternetAddr, sizeof(internalData->saInternetAddr));		if(bindResult == SOCKET_ERROR)		{			// TODO: Log error			Destroy();			return false;		}		int listenResult = listen(internalData->sListenSocket, internalData->backLog);		if(listenResult == SOCKET_ERROR)		{			// TODO: Log error			Destroy();			return false;		}		DWORD dwBytes = 0;		int ioctlResult = WSAIoctl(internalData->sListenSocket, SIO_GET_EXTENSION_FUNCTION_POINTER, 			&internalData->GuidAcceptEx, sizeof(internalData->GuidAcceptEx), &internalData->lpfnAcceptEx, 			sizeof(internalData->lpfnAcceptEx), &dwBytes, NULL, NULL);		if(ioctlResult == SOCKET_ERROR)		{			// TODO: Log error			Destroy();			return false;		}		dwBytes = 0;		ioctlResult = WSAIoctl(internalData->sListenSocket, SIO_GET_EXTENSION_FUNCTION_POINTER, 			&internalData->GuidGetAcceptExSockaddrs, sizeof(internalData->GuidGetAcceptExSockaddrs), &internalData->lpfnGetAcceptExSockaddrs, 			sizeof(internalData->lpfnGetAcceptExSockaddrs), &dwBytes, NULL, NULL);		if(ioctlResult == SOCKET_ERROR)		{			// TODO: Log error			Destroy();			return false;		}		// Assign the global data for our connections		internalData->globalData.lpfnAcceptEx = internalData->lpfnAcceptEx;		internalData->globalData.lpfnGetAcceptExSockaddrs = internalData->lpfnGetAcceptExSockaddrs;		internalData->globalData.listenSocket = internalData->sListenSocket;		internalData->globalData.hCompletionPort = internalData->hCompletionPort;		internalData->globalData.dwNumberOfConcurrentThreads = internalData->dwNumberOfConcurrentThreads;		internalData->globalData.dwReadTimeTimeout = 10000; // TODO: Variable		internalData->globalData.dwAcceptTimeTimeout = 5000; // TODO: Variable		internalData->globalData.initialReceiveSize = 0; // Do not accept anything from AcceptEx		// If we wanted to accept data sent from ConnectEx via AcceptEx		//internalData->globalData.initialReceiveSize = HPS_OVERLAPPED_BUFFER_RECV_SIZE - ((sizeof(SOCKADDR_IN) + 16) * 2);		DWORD dwConnectionTotal = internalData->AddConnectionsToPool(internalData->dwInitialConnectionPoolCount);		if(dwConnectionTotal != internalData->dwInitialConnectionPoolCount)		{			// TODO: Log error			Destroy();			return false;		}		// Connect the listener socket to IOCP		if(CreateIoCompletionPort((HANDLE)internalData->sListenSocket, internalData->hCompletionPort, 0, internalData->dwNumberOfConcurrentThreads) == 0)		{			// TODO: Log error			Destroy();			return false;		}		internalData->hScavengerExitEvent = CreateEvent(0, TRUE, FALSE, 0);		if(internalData->hScavengerExitEvent == NULL)		{			// TODO: Log error			Destroy();			return false;		}		internalData->hScavengerThread = CreateThread(0, 0, ScavengerThreadWrapper, internalData, CREATE_SUSPENDED, &internalData->dwScavengerThreadId);		if(internalData->hScavengerThread == NULL)		{			// TODO: Log error			Destroy();			return false;		}		DWORD dwResult = ResumeThread(internalData->hScavengerThread);		if(dwResult == (DWORD)-1)		{			// TODO: Log error			Destroy();			__asm nop		}		// Success!		return true;	}	void Destroy()	{		if(internalData->hScavengerExitEvent != NULL)		{			SetEvent(internalData->hScavengerExitEvent);			if(internalData->hScavengerThread != INVALID_HANDLE_VALUE)			{				int result = WaitForSingleObject(internalData->hScavengerThread, internalData->dwScavengerDelay * 2);				if(result != WAIT_OBJECT_0)				{					// TODO: Log error					__asm nop				}				CloseHandle(internalData->hScavengerThread);				internalData->hScavengerThread = INVALID_HANDLE_VALUE;			}			CloseHandle(internalData->hScavengerExitEvent);			internalData->hScavengerExitEvent = NULL;		}		if(internalData->sListenSocket != INVALID_SOCKET)		{			closesocket(internalData->sListenSocket);			internalData->sListenSocket = INVALID_SOCKET;		}		std::vector<HANDLE> workerThreadHandles;		std::list<tWorkerThreadData *>::iterator itr = internalData->workerThreads.begin();		while(itr != internalData->workerThreads.end())		{			workerThreadHandles.push_back((*itr)->hThread);			itr++;		}		// Clean up the worker threads waiting on the IOCP		if(internalData->hCompletionPort != INVALID_HANDLE_VALUE)		{			EnterCriticalSection(&internalData->workerThreadCS);			size_t count = internalData->workerThreads.size();			for(size_t x = 0; x < count; ++x)			{				PostQueuedCompletionStatus(internalData->hCompletionPort, 0, -1, 0);			}			LeaveCriticalSection(&internalData->workerThreadCS);		}		// Wait for all worker threads to close		for(size_t x = 0; x < workerThreadHandles.size(); x += MAXIMUM_WAIT_OBJECTS)		{			DWORD count = min(MAXIMUM_WAIT_OBJECTS, workerThreadHandles.size() - x);			DWORD dwResult = WaitForMultipleObjects(count, &workerThreadHandles[x], TRUE, count * 1000);			if(dwResult != WAIT_OBJECT_0)			{				// TODO: Log error				__asm nop			}		}		// Sanity check		if(internalData->workerThreads.size())		{			// TODO: Log error			printf("%i worker threads did not finish...resources will be leaked.\n", internalData->workerThreads.size());		}		if(internalData->connectionPool.size())		{			std::list<tConnectionData * >::iterator itr = internalData->connectionPool.begin();			while(itr != internalData->connectionPool.end())			{				closesocket((*itr)->socket_);				delete (*itr);				itr++;			}			internalData->connectionPool.clear();		}		if(internalData->hCompletionPort != INVALID_HANDLE_VALUE)		{			CloseHandle(internalData->hCompletionPort);			internalData->hCompletionPort = INVALID_HANDLE_VALUE;		}	}};HANDLE exitEvent = 0;BOOL __stdcall ConsoleHandler(DWORD ConsoleEvent){	switch (ConsoleEvent)	{		case CTRL_LOGOFF_EVENT:		case CTRL_C_EVENT:		case CTRL_BREAK_EVENT:		case CTRL_CLOSE_EVENT:		case CTRL_SHUTDOWN_EVENT:		{			if(exitEvent != 0)			{				SetEvent(exitEvent);				return TRUE;			}		}	}	return FALSE;}int main(int argc, char * argv[]){	printf("sizeof(tConnectionData) = %i\n", sizeof(tConnectionData));	if(InitializeWinsock() == false)		return 0;	cHighPerformanceServer server;	if(server.Create(15779) == false)	{		return 0;	}	exitEvent = CreateEvent(0, TRUE, FALSE, 0);	SetConsoleCtrlHandler(ConsoleHandler, TRUE);	WaitForSingleObject(exitEvent, INFINITE);	SetConsoleCtrlHandler(ConsoleHandler, FALSE);	server.Destroy();	DeinitializeWinsock();	CloseHandle(exitEvent);	return 0;}

Washu

7,836

March 31, 2009 04:02 PM

Quote:Original post by Drew_Benton
That would explain why the service time is gradually getting longer as more connections are being added and clients sporadically disconnect. I was running 500 clients per laptop, which it seems the network can't handle from one source.

The network can handle it fine, the desktop cannot. 500 connections coming from one machine is quite a large workload, especially if it then has to queue up and send data on those connections. Spreading out your connections across many machines should improve your test case, and allow you to implement other tests more easily as well.

Also, and I can't find a reference for this so take it with a grain of salt, I recall reading somewhere that Windows XP networking stack has certain limitations that server OS ones do not.

Quote:
If I split the test up into 2 x 250 connection parts and watch Wireshark, I see far fewer retransmits and never get any notifications of the delayed reads. If I try running 400 clients per computer, then right towards the 600 mark, I start getting more retransmits and longer service time delays in my program. As I hit almost 800 connections, the retransmits were filling up Wireshark and most of the connections were failing in my program due to longer service times.

I see now that my code seems to be more than suitable for handling a lot more connection and traffic, but my network and my current test setup is not. What would you suggest is the best way to go about testing code like this in general to avoid problems like this? I mean, when you are in an early stage and want to test the upper bounds of your code but have nothing much to lure random testers in and you need lots of traffic, do you just have to wait?

Let's say I wanted to try and pull off a larger test, would the problem lie in my router? I.e., if I setup a dedicated server to run on at home and got let's say 10 or so older computers (you know those P4 512mb Dell Optiplex ones) connected via LAN, do you think the router still couldn't handle it or are the computers themselves the problem?

I doubt your router is the problem. While most consumer routers are fairly limited in their internet bandwidth abilities, their internal switches can generally operate fairly well (and you should be hitting only the switch if you're on the local network). Spreading out the connections to be coming from multiple clients will create a more realistic test scenario. Configuring your host OS to be a server class one (even a 180 day trial of Server 2008 will do) would help as well.

In time the project grows, the ignorance of its devs it shows, with many a convoluted function, it plunges into deep compunction, the price of failure is high, Washu's mirth is nigh.

Drew_Benton

1,865

Author

March 31, 2009 04:28 PM

Thanks Washu, that's great information to know. I think I'll go that route of testing on the Server OS with a few more computers across the lan. I'll post some more results in a couple days or so after I get everything setup and test again. I might also invest some time in making the code 64bit compatible (or at least make sure it works in 64bit mode) as well so I can get a 2-for-1 deal as well.

I might as well get this type of work done now, because I'm going to have to pickup some more testing hardware anyways to work through the distributed computing and making a simple program that can work across multiple computers at once anyways. I won't go crazy or anything, just get some cheap basics that should work out fine.

Thanks again everyone, I'll keep you updated. [smile]

Antheus

2,410

April 01, 2009 11:17 AM

Although you're measuring latency while sending small amounts of data, I don't see any NO_DELAY option.

Drew_Benton

1,865

Author

April 01, 2009 02:46 PM

Quote:Original post by Antheus
Although you're measuring latency while sending small amounts of data, I don't see any NO_DELAY option.

Thanks, I forgot about that flag in the test client. That seemed to help a little but having set that flag really emphasizes the problem with the testing hardware/OS combination on the server.

I ran another large test trying to get 750 from one computer to make sure nothing changed and sure enough, the higher the client count got, the more TCP Retransmissions flooded the screen with longer RTT to ACKs. In addition, I was getting a flood of duplicate ACKs, which to me looks like there is too much activity going on for the hardware on the test clients as it's not churning through the data fast enough. Similarly, another test with 400 clients on each computer resulted in the same behavior, so I'm certain now about the OS theory.

I don't expect to be able to get more testing hardware or a server setup until some time next week, so for now I'll keep working on the small case examples using the IOCP code using what I have. I still have to solve some more problems dealing with concurrency and stream processing. That's ok since this is really all for fun and learning; I'm not in a rush or have deadlines looming. I've also started already doing my initial research and development on making my small distributed/cluster code to tackle the scalability issues I originally talked about. So I have plenty to do until I can do a larger more effective test. When I do, I'll make sure to write up the results and hardware used for reference.

Thanks everyone, this thread has been a great benefit to me!

Ysaneya

1,391

April 02, 2009 04:58 PM

That's an interesting topic, I'm interested to see if/how you'll solve your problem.

I don't know if it's too complex or not for you to test, but maybe you could try to change your network system to use a single UDP socket on the server to which the clients can send/receive data. The main point here would be to test if you're hitting a TCP socket count overhead/limit somewhere. With UDP your packets won't be reliable anymore, but since it's just for a test, it shouldn't matter..

Y.

Drew_Benton

1,865

Author

April 02, 2009 08:21 PM

Quote:Original post by Ysaneya
That's an interesting topic, I'm interested to see if/how you'll solve your problem.

Me too! I'm hoping a trial run on a server will turn out positive, but I'm still deciding on how to go about that test. I need my own server estup, which I've had planned for a while so I was just going to setup my own this upcoming week, but before making that kind of plunge, I've had to spend a lot of time researching options and making future considerations: A) Buy a refurbished server 1U unit from geeks.com, load up server OS, and test. B) Lease a dedicated server for a month and test on that so it's more "real world testing". C) Make a future investment and simply build my own via Newegg.

As much as I want to setup my own on my own local network, I don't know enough about the process to where I feel like I'd be making an educated buying purchase, so I'm leaning towards doing a 30 day lease of a dedicated for the purpose of testing this stuff. The other ideas I had about the distributed computing and shards/mmo architecture I think I can get away with on my local network using my existing desktop and laptops so no worries there as those goals aren't based on being able to act as real servers. This IOCP stuff though, I want to get it right.

Quote:I don't know if it's too complex or not for you to test, but maybe you could try to change your network system to use a single UDP socket on the server to which the clients can send/receive data. The main point here would be to test if you're hitting a TCP socket count overhead/limit somewhere. With UDP your packets won't be reliable anymore, but since it's just for a test, it shouldn't matter..

I've started the process of making my IOCP server UDP based, but I'm getting a little stuck on one specific issue. In TCP, each connection has a WSARead posted on it and it simply blocks until there is something going on.

In my current UDP setup however, when I post a overlapped WSARecvFrom on the main listen socket, it completes immediately with an error code of ERROR_IO_PENDING. This is bad because since it does not block, the server will run at 100% CPU usage as it's polling the recv buffers rather than waiting for a completion event.

I've found that if I instead just post a blocking WSARecv, and then use PostQueuedCompletionStatus to the worker threads, then I can utilize the system as it should be, but that's defeating the purpose as it's bottlenecked at the non-overlapped WSARecvFrom call.

Going back to the overlapped WSARecvFrom call, each call does post a recv operation for the IOCP code to use, but I don't really know how to make it so that it's not flooding the connection with overlapped recv requests. I.e., running it in the loop will make it post an overlapped operation, so calling 1000 times a second means 1000 queued overlapped events waiting to be handled. Well that running 60 seconds is 60,000 over lapped events waiting ot be happened, or in UDP, 60,000 packets a minute.

I'm not really sure at the moment of how to go about coding limitations on that process in an efficient manner that would allow me to do the same caliber of TCP testing using UDP. My only initial thought on the matter is to have some value that stores how many posted overlapped operations there are and when that number hits some limit, Sleep the recv thread a little to see if those events have reduced. In the worker threads, I'd decrease the value on each operation. I was thinking I could do that system using the Interlocked family of functions, but that doesn't address what to do when I have too many posted as Sleeping seems a bit of a hack.

What I think I should be doing is having the worker threads decrement the overlapped counter as mentioned before, but have a detection in place to set a global event that the UDP thread waits on if the current overlapped counter value is great than some safe value. When that event is then triggered, it will loop through the WSARecvFroms again, posting requests and updating the counter until it hits the max quota and then waits for the event again. That seems better than Sleep and more effective and scalable overall.

I should be able to finish this code and test the idea in the next couple of days. I actually have simple UDP I'm working with for the distributed system component I'm making. More to come soon! [smile]

aissp

136

April 02, 2009 11:52 PM

hi Drew. Just fui.

Server VM based on dual core centrino and hosted 6 different servers, RAM 4gb.
Router fortigate 200 (not home router) clients, 25 different computers belonging different subnetworks, each of them running several clients. Each packet size was about 600 bytes. Results was 800 ccu without any significant exhausting server resources (i guess it was very easy to increase this amount, but we was limited by clients side...)

Drew_Benton

1,865

Author

April 03, 2009 03:54 AM

Thanks for those results aissp! I am still looking forward to testing the TCP version on a server myself and checking it out. I think I will run out of client resources before you did though [lol]

I've been working on the UDP version and after testing it, the results seem unreal. I mean, I know I am testing on localhost and all, but it's amazing how it's turning out.

Just some preliminary numbers about the current results. I am still thinking something has to be wrong, but so far it looks like I'm doing it right. I will spend some more time thinking through it though. I used TeamViewer to RDC into my laptops, so there was a small amount of network traffic from UDP ~4kb/s always going on.

Server: UDP IOCP with 2500 overlapped received kept pending at a time (see my Journal for why 2500!) As soon as this value hit 1/2, more requests were posted to bring it back to the max. This logic only triggers on 1/2 empty. The server also has a hard coded limitation of tracking 8kb "connections" (arbitrary). Since this is early stage work, I have to limit the server to one IOCP thread and one worker thread to avoid any synchronization issues with storing timing data. Unlike the TCP server, this server has to make use of boost::singleton_pool for the massive allocations and deallocations for the connection objects posted on each overlapped event. That wouldn't make performance "better" though.

Client: Creates a UDP socket and sends 32 bytes each second. Each client uses __rdtsc to generate its own UID and sends that to the server. Server stores client data in a stdext hash_map by the 64bit number sent. Simple client just like the TCP one was.

I ran 1000 processes of the Client on each of two laptops, 2000 total clients. Initial memory usage was 29mb (high for many reasons, modified TCP code) and stayed there in that range. CPU usage was more often 1 and sometimes 2 and 0 throughout the first 5 minutes of the test. About each minute that passed added 1 second on cpu time. About 150kb/s data was being processed by the application according to NetMeter.

The test ran for just a little over 5 minutes, informal timing. Out of the 2000 connections, 1033 had an average service time of over 1 second, but under still under ~1ms (< 1002 ms). 967 had an average service time of under 1 second, but under 1ms below it (>998ms). Given that 1/2 of the clients have their data handled faster than they are sending it, makes me a little concerned. I think I need to add in sequence numbers to the packets to try and figure out if duplicates are throwing off the values or if it's just the lack of precision of GetTickCount. I should move to a more high precision timer for a more accurate test.

For a second test, I ran 1500 processes of the Client on each laptop, 3000 total clients for the server. For this test, about 220kb/s data was being processed by the applications according to NetMeter. CPU usage was more often 1 and 2 % rather than 0 in the previous version. As a result, almost 2 seconds of cpu time was spent per minute on this test.

This test results were more reflective of the amount of traffic, but not by much. 325 were still running with an average service time of around 999ms. 2204 were running somewhere around 1000ms and fractions. 471 had averages above 1s between 1-5ms over.

Download: Results 1 | Results 2. The format is: [Client Index] => Average Service Time

Of course, testing is by no means scientific or "official", but the obvious difference between TCP and UDP are being seen; I just didn't know it would be this big. I'm sure my tests are far from optimal, but I'll keep at it this weekend to get better tests replicable for when I test on a server. I also will try to make a UDP/TCP client all-in-one as that would be more useful than trying to make two projects.

These posts are getting longer and longer each time, so I'll stop here, but I did want to mention I also made a one process test client that spawned more UDP connections and that testing went well likewise. Each laptop tried to run 4,000 UDP sockets and traffic was ~675kb/s. Occasional dips did take place where it dropped down to 1/3 of that, which should be reflected in the service time latencies. I ran a quick test for a couple of minutes (7950 total clients actually made it) and results were as follows: Results 3. yes the first few at ~500ms is puzzling, but everything else looks normal, but scaled way better than TCP testing did!

hplus0603

11,916

April 03, 2009 10:31 AM

Quote:which it seems the network can't handle from one source.

Actually, I think the problem is the testing client. When there are 500 separate processes, each taking 2 megabytes, and each doing their own I/O, the client-side kernel will have trouble scheduling all that. If you wrote one uber-client that used IOCP to multiplex a large number of client connections, I believe that client could sustain many more connections. From the point of view of the server, there would be no difference, because each connection is still a separate client-side port.

The reason you get half as much data back as in is because TCP overhead is at least 40 bytes per packet (20 for IP and 20 for TCP without options). Thus, your 32 bytes are actually less than the overhead.

Also, when posting measurements, "kB" is usually 1000 bytes, "kb" is usually 1000 bits, "KB" is usually 1024 bytes, and "Kb" is usually 1024 bits. Thus, when you post "48 kb" I read that as bits, meaning you're doing I/O that would fit on a modem. My guess is you meant bytes. A lot of people don't use the same conventions, though (and there's "kibi" and "kby" and others, too) so it's best to write out the unit: "kilobytes/s."

enum Bool { True, False, FileNotFound };

Trying to get my IOCP code on par with expected performance

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Trying to get my IOCP code on par with expected performance

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines