epoll troubles [Solved]

Started by
3 comments, last by Megahertz 16 years, 3 months ago
Ugh. I hate having to post my code on here to get help with whats going on, but I've just not been able to find the problem. Basically heres whats going on. I've written a few hundred line epoll server (Fedora Core 4) that listens for connections, accepts them, and continues to recieve data on them until i shut down the client. The client program (Windows XP) does nothing more than create a given number of clients, connects them to the server and at a given interval (5 times a second atm) sends a 512 byte packet to the server. When I first start up the server and start up the client program everything is fine. Everything connnects, no errors on either side and the clients merrily start sending their packets to the server. However at some random time usally about 20-30 seconds after the client starts sending packets the last 1-5 (so far) clients in the list start blocking on their send calls. At first I was using non-blocking sockets on the sending side, hey if a socket blocks for a few ms, no biggie. Problem is that it would block almost indefintley. After getting this problem I switched to non-blocking sockets for the client. As expected I now get WSAEWOULDBLOCK errors where previously I blocked. Ok, thats fine, however now with the non-blocking code I get inconsistencies with which of those last 1-5 sockets actually give a WSAEWOULDBLOCK error. Ok thats sorta fine too, except it contradicts the previous behavior of when a socket would block, it would stay blocked. I've got no idea what its waiting on....for so long!? Ok heres the kicker, I changed out the CTestServer::Run() method (will be posted below) of the server and converted just that method over to use select rather than epoll and I have absolutely no issues with the sockets. No blocking, I got a shower, went and got food and came back and it was still purring along. In any event, heres the code for the meat of the server, I can post more should there be a request, but if theres a problem. I'll bet its here. Thanks in advance for any help. -=[ Megahertz ]=-

int CTestServer::Run()
{
	
	// Create our epoll file descriptor
    const int max_events = 1024;
    int epoll_fd = epoll_create(max_events);
    if (epoll_fd == -1) {
		printf("epoll_create\n");
		return 0;
    }
    
    // Add our server fd to the epoll event loop
    struct epoll_event event;
    event.events = EPOLLIN | EPOLLERR | EPOLLHUP | EPOLLET;
    event.data.fd = listener->GetSocket();
    if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, listener->GetSocket(), &event) == -1) {
		printf("epoll_ctl\n");
		return 0;
    }
    int conncount=0; 
   	printf("Waiting....\n");

	struct epoll_event events[max_events];
    // Execute the epoll event loop
    while (true) {

		
		//wait for something to happen
		int num_fds = epoll_wait(epoll_fd, events, max_events, -1);
		
		
		// loop through all the fds that had activity
		for (int i = 0; i < num_fds; i++) {
			
		    // Case 1: Error condition
		    if (events.events & (EPOLLHUP | EPOLLERR)) {
				printf("epoll: EPOLLERR");
				close(events.data.fd);
				return 0; //just a test to see if it should bail out.
				//continue;
		    }
		    assert(events.events & EPOLLIN);

		    // Case 2: Our server is receiving a connection
		    if (events.data.fd == listener->GetSocket()) {
				struct sockaddr remote_addr;
				socklen_t addr_size = sizeof(remote_addr);
				int connection = accept(listener->GetSocket(), &remote_addr, &addr_size);
				if (connection == -1) {
				    if (errno != EAGAIN && errno != EWOULDBLOCK) {
						printf("accept error\n");
					}
					return 0;
				}	
					
				// Add the connection to our epoll loop
				event.data.fd = connection;
				if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, connection, &event) == -1) {
				    printf("epoll_ctl\n");
				    return 0;
				}
				printf("New Connection! %d\n",conncount++);
				continue;
		    }
	    


		    // Case 3: One of our existing connections has read data 
	
			char buf[5120];
			memset(buf,0,5120);
			int bytes = recv(events.data.fd,&buf,5120,0);
			if (bytes<=0) { //so far unless i terminate the clients, this never happens (which is good)
				close(events.data.fd);
				printf("Closed connection: %d\n",events.data.fd);
			}
			else {
				printf(" %d Bytes from client %d\n",bytes,events.data.fd);			
			}		
		}
	}	
}	



[Edited by - Megahertz on January 1, 2008 12:10:59 AM]
-=[Megahertz]=-
Advertisement
That the client's attempts to send data are blocking indicates that the TCP send window has filled up, which means that the server isn't reading the incoming packets from the TCP stream.

One obvious thing I notice is that you're not removing sockets from epoll control when the connection closes; you just close the socket. Not sure what epoll does in that case (I've only ever used epoll with UDP, so I'm not really familiar with how it interacts with connection-based networking). That may or may not be important, but it's worth looking into.

But the first thing I'd be trying to do to debug this is to verify that you're getting as much data being received by the server as is being sent by your full collection of clients; my intuition is that it definitely sounds like something's slipping through the cracks somewhere, right here in the server's receive loop.. but I don't immediately see a bug in the code, except for the one I mentioned above.
Well the connections never get closed until i actually shut down the clients. I will eventually need to remove the FD's from the list when I code the clients to arbitrarily disconnect, but its a non-issue at the moment.

I'm pretty sure all the clients are receiving all the data they've been sent, but I do not have code/proof to back up that claim. I'll code up some byte counters and track this information and see if thats an issue.

-=[ Megahertz ]=-


-=[Megahertz]=-
Whoops, quick double-check.


What you're saying is that with this epoll-based Linux server, the last 1-5 WinXP clients start to block on their 'send' calls, yes?

If that's the case, then you want to be testing the bytes sent by the WinXP clients, and verify that the Linux server is receiving the same total number of bytes. The likely problem is that the Linux server isn't actually pulling the data out of the tcp stream, causing to the tcp window to fill up, and stop the WinXP from putting any more data into it.

Just wanted to make sure that we're not talking at cross-purposes. :)
Bleh. I figured it out based off the information in your post.

So yeah I started tracking the bytes sent vs what was received. Sure enough there were inconsistencies. I thought I had took care of a problem earlier but I guess my fix wasnt good enough.

Basically the client app creates a number of clients that connect to the server. My test cases that were breaking were somewhere greater than 50 clients. Anything less and it was ok, though technically it can still break given what I now know. With that many clients connecting at once, the server couldnt keep up and some connections never actually got connected. But I would think that send would toss up an error on a connection like that. It'll need further review to figure out whats going on. Half open connection maybe?

Anyway, to fix that I put a Sleep(1) in between each client calling connect to kinda give the server time to keep up. Well I guess it worked the few times i tested it, but after that it worked 1 out a 100 times maybe.

So yeah, I was sending X and receiving less than X.

I upped the Sleep from 1 to 10 and that got all the clients connected and everything is being sent AND received.

I've tried upping the backlog on my linux box but its not working AFAICT. Something else I'll have to look into.

Also I read a snippet while searching the web about a possible bug with fast connecting clients. I'll have to look into that too.

Anyway, thanks for the help. A second pair of eyes and ears always works wonders. =)

HAPPY NEW YEAR GUYS!!

-=[ Megahertz ]=-
-=[Megahertz]=-

This topic is closed to new replies.

Advertisement