• Advertisement
Sign in to follow this  

boost::asio causing seemingly random crashes

Recommended Posts

I'm using boost::asio for UDP communication and I'm experiencing what appears to be random crashes, which I don't even know how to begin to solve. I've been trying to figure this out for 3 weeks now and I'm about at my wit's end.

Some information I've discovered while working on this:

  • It only happens about every 1 in 3 times I run it.
  • When I add console output using std::cout that is written once every frame, it never happens. I have no idea why this would affect it.
  • It only happens when I run the exe directly. When I run it in visual studio in either debug or release mode, it's fine.
  • It only happens when I'm running the client and the server within the same process. If I run one machine as a dedicated server and another as a client it does not happen.
  • It only seems to be a problem when it's in my actual game engine. I created a bare bones process that does nothing but simulates the UDP communication that the game would do, and that works fine.
  • I've tried having the client and server IO done in two separate threads with two separate io_service objects, as well as in the same thread with a single io_service. There was no difference either way.

The crash happens during the initiation/synchronization step.The protocol is as follows:

  1.     The client sends a connection initiation packet to the server.
  2.     Upon receiving this, the server sends to the client a few packets containing a list of UUID and integer ID pairs.
  3.     The client receives each of these packets, and builds its own map of the server generated IDs to the client generated IDs.

Here's a screenshot of the console output, which shows that it is crashing in the middle of writing a line in the client's handleReceive function. I should not, it doesn't always crash at this very moment. Sometimes it does write out all console lines fully.

Lrb69mQ.png

 

Here is the code of the handleReceive function in the client IO thread:

void UdpClient::handleReceive(const boost::system::error_code& error, std::size_t bytesReceived)
{
    if ((!error || error == error::message_size) && bytesReceived > 0 && udpPacket_.decodeHeader())
    {
	UdpPacketType packetType = udpPacket_.getPacketType();		
	int bodyLength = udpPacket_.getBodyLength();
	char* body = udpPacket_.getBody();

	switch (packetType)
	{
		case UDP_PACKET_SERVER_ID_LIST: // Read a batch of IDs from the server, and map them to client IDs.
		{
			DebugHelper::streamLock->lock();
			std::cout << "[" << boost::this_thread::get_id() << "] Synching IDs on client. "<<std::endl;
			DebugHelper::streamLock->unlock();
			
			int uuidSize = boost::uuids::uuid::static_size();
				
			// Combined size of a UUID and an integer ID.
			int combinedSize = sizeof(int) + uuidSize;
			
			for (int i = 0; i < bodyLength; i += combinedSize)
			{
				// Read the UUID from the packet body.
				boost::uuids::uuid uuid = networkUtility_.buildUuid(body, i);			
						
				// Read the server generated integer ID from the packet body.
				int serverId = networkUtility_.buildInteger(body, i + uuidSize);

				// Get the locally generated integer ID.
				int localId = BaseIds::getIntegerFromUuid(uuid);

				clientLayer_->mapLocalIdToServerId(localId, serverId);
			}

			DebugHelper::streamLock->lock();
			std::cout << "[" << boost::this_thread::get_id() << "] ID batch processed on client. "<<std::endl;
			DebugHelper::streamLock->unlock();
			break;
		}

I don't know if maybe there are some tricks to working with threads, or UDP/IP that might be helpful? I'm just completely stuck at this point and I don't know where to go from here. Any ideas are appreciated!

 

Share this post


Link to post
Share on other sites
Advertisement
From the symptoms, it sounds like the difference is in timing changes. Adding printing will change timing, as will most of the other things you're talking about.
You haven't showed us the stack trace of the crash, nor the declaration/use of the variables that are crashing.

You can also log to a file, rather than console. to make sure that you get all printing before the crash.
Or log to OutputDebugString().

Separately, and probably not related, but I thought I should mention it: The "streamlock" you're using is locked manually, not using a RAII "locker" object that gets automatically unwould when leaving scope; manual locking is a very bad pattern and you should change to using a local variable based lock holder. Edited by hplus0603

Share this post


Link to post
Share on other sites

From the symptoms, it sounds like the difference is in timing changes. Adding printing will change timing, as will most of the other things you're talking about.
You haven't showed us the stack trace of the crash, nor the declaration/use of the variables that are crashing.

You can also log to a file, rather than console. to make sure that you get all printing before the crash.
Or log to OutputDebugString().

Separately, and probably not related, but I thought I should mention it: The "streamlock" you're using is locked manually, not using a RAII "locker" object that gets automatically unwould when leaving scope; manual locking is a very bad pattern and you should change to using a local variable based lock holder.

 

Is there a way to get the stack trace when running the exe? It never happens when running inside visual studio.

 

And thanks for the tip about locking, I will change that.

Edited by JackOfCandles

Share this post


Link to post
Share on other sites

Is there a way to get the stack trace when running the exe? It never happens when running inside visual studio.


Once it crashes, you get a dialog box.
You can use that dialog box to attach with Visual Studio.

Share this post


Link to post
Share on other sites

 

Is there a way to get the stack trace when running the exe? It never happens when running inside visual studio.


Once it crashes, you get a dialog box.
You can use that dialog box to attach with Visual Studio.

 

 

Sorry for the long delay in getting back to you. I had a technical issue that prevented me from continuing, followed by a family emergency that I had to leave town for. Just a string of bad luck! But now everything is back to normal and I can continue working on this. I was able to attach the VS debugger to the faulting process as you suggested, and based on the call stack it looks like it is crashing when rendering.

vIIOKXt.png

I think atioglxx.dll is related to my video card driver. As a test, I tried running it on a machine with a different video card, and sure enough it worked fine. I did try updating the driver, but there didn't seem to be any change. That DLL is still the same version I originally had.

 

I really have no idea how the multi-threaded UDP I/O would in any way be causing a fault in the graphics driver. Maybe this isn't as big of a deal as I thought, as my card is pretty old (Radeon HD 5970 from 2009), but it feels really unsatisfying to just leave it hanging, knowing that this could be happening for who knows what other video cards.

Share this post


Link to post
Share on other sites
What if you record the data you receive on UDP, and write it to a file, then run your program in a mode that plays back the file without using a socket?
That way, you should be able to reproduce the crash without using the networking code.
If you can do that, then it's pretty clear that it's an ATI OpenGL driver issue. Those are not particularly uncommon.

Share this post


Link to post
Share on other sites

What if you record the data you receive on UDP, and write it to a file, then run your program in a mode that plays back the file without using a socket?
That way, you should be able to reproduce the crash without using the networking code.
If you can do that, then it's pretty clear that it's an ATI OpenGL driver issue. Those are not particularly uncommon.

 

Well, I'm not sure this would make a difference, because at one point not only did I try sending no data (other than the packet header), I also tried removing the body of the handleReceive function as well. And now it just got more puzzling. I removed the call to the render function just to see what happens, and while this did decrease the frequency of crashes, it does still crash fairly often. Each time I attached to the debugger to check where it's crashing and commented the code out, only for it to crash again somewhere else. It's turned into a sort of whack-a-mole game.

Here are a couple screenshots of the other crashes and their call stacks. At this point, I'm not convinced it is an OpenGL driver issue. One of the crashes happened in the fmod dll, and one in my own code with the line in question being a simple boolean conditional. It seems unlikely that there would be a bug in all of these independently. It almost feels like something is failing in the network IO thread, but it's breaking on whatever instruction it is executing in the main game loop thread.
 

MatQ2fe.png

 

c9ZHUQ1.png

Share this post


Link to post
Share on other sites
Sounds like you have a "memory smasher" bug, such as a write-after-delete.

You might want to try enabling different memory debugging tools, depending on what version of visual studio you have.
You might want to look at the memory hex dump before and after the data that's been corrupted.
You can look at the disassembly and figure out which registers point at areas that then are bad, and see where those values come from, and then hex dump that memory to look for patterns.

Another option, if this crash is reproducible, is to use a memory write breakpoint to figure out what's writing the thing that's crashing.

Share this post


Link to post
Share on other sites

Sounds like you have a "memory smasher" bug, such as a write-after-delete.

You might want to try enabling different memory debugging tools, depending on what version of visual studio you have.
You might want to look at the memory hex dump before and after the data that's been corrupted.
You can look at the disassembly and figure out which registers point at areas that then are bad, and see where those values come from, and then hex dump that memory to look for patterns.

Another option, if this crash is reproducible, is to use a memory write breakpoint to figure out what's writing the thing that's crashing.

 

Ooh that sounds bad. Currently I'm using the free Visual Studio Express 2015, so I'm not sure if those tools are available, but maybe I'll have to bite the bullet and buy the real thing. Hopefully I can still buy 2015, because I tried 2017 originally but had to downgrade because boost wasn't compatible with it when trying to build the boost::python DLLs. Ughh... I'll see what I can find though!

Share this post


Link to post
Share on other sites

I would also recommend inspecting the code very carefully for bad use of memory and pointers. Just from those screenshots above there are far more 'new' and 'delete' statements than I am used to seeing in modern programs, which implies this is quite dangerous code. I certainly wouldn't expect 2 'new' calls in an input-polling function. Consider whether these can be changed to local or member variables, and if they do have to be dynamically allocated, consider using standard library containers such as vectors, or using smart pointers to manage the lifetime.

Share this post


Link to post
Share on other sites

I just wanted to update this, in case anyone comes across it in the future looking for answers. I'm pretty sure I know what the problem was, though I was unable to verify for certain. I was using the same port number for both the client and the server. So when the client and the server were running on the same machine, it would sometimes be writing to both the at the same time, causing them to step on each other. I believe this would explain all of the symptoms I was experiencing. The fact that it only happened when running both on the same machine, the fact that it only happened during the initial synchronization process (all other types of communication happens in a more ordered manner). As I said I wasn't able to verify for certain, because I discovered this in the process of switching from boost::asio to SDL_net, and I would have had to spend a ton of time reverting a lot of changes (my reasoning was SDL_net is more lightweight, I didn't really need the advanced functionality offered by boost::asio). To think, such a minor, stupid mistake is probably the cause to such a massive headache. Ugh...

 

Thank you for all of the advice offered in this thread, it is much appreciated!

 

Share this post


Link to post
Share on other sites
Quote

I was using the same port number for both the client and the server. So when the client and the server were running on the same machine, it would sometimes be writing to both the at the same time, causing them to step on each other.

 

That doesn't sound quite right to me.

If you use REUSEADDR and/or REUSEPORT, more than one socket can send from, and receive from, the same port number. The kernel will allocate incoming data to one of the sockets in some way that is not usually deterministic. However, nothing will "step on" anything else in this setup. It's totally supported by the kernel API and system libraries.

Maybe what would be happening would be that one of your programs (client, for example) would send a message it thought was to the server, but the kernel would turn around and hand it right back to the client, who would then not process it correctly.

If that is the case, and causes your program to crash, then your program has a remotely exploitable bug and running your game would expose you to possible shenanigans from the greater internet. No matter what kind of data arrives on a socket, and in what order, your program should never mis-process the data or crash. For random noise, you should detect that you don't know what the packet means, and stop processing it early. For packets that seem correctly formed but have the wrong "meaning," you should detect that they don't make sense in the context ("server packet received on client") and log the problem and ignore the packet. Similarly, if packets contain fields that are too short, or too long, for the intended usage, you should detect this, mark the packet as corrupt, and stop processing. If you ever let data under the control of a remote computer point your program at uninitialized data, or past the end of the packet, or into a part of your program that hasn't been initialized yet, your program has a remote code execution vulnerability.

 

 

Share this post


Link to post
Share on other sites

I'm not sure if I am using REUSEADDR or REUSEPORT, I've not heard of those. I should note that while I said the client and server are running on the same machine, it is more accurate to say they are running in the same process. I have two GameStateManager objects and I call update on both inside the main game loop. I'm not sure if that makes a difference though. I do have some packet verification code. I generated a protocol ID that I check for, for example, and there is a packet header which contains the size to verify it is not greater than the max size, but there is no distinction between a packet for a client and for the server. Well that's not true actually, I store a packet type value, which can be used to implicitly determine if it is a type meant for the client or for the server, and it would get ignored in the switch statement in the processPacket() function.

Edited by JackOfCandles

Share this post


Link to post
Share on other sites

You are saying that the client and server are using the same port. You cannot use the same port for two sockets on the same computer unless you turn on some feature to allow sharing of ports.

Is it perhaps the case that your program is using a single socket, for both client and server, when running inside the same process? If so, there is no way you can reliably pass messages back and forth between client and server.

Is it perhaps the case that the client and server GameStateManager use the same socket object, but run in different threads? If so, they could totally have thread-unsafe code in them, that would cause corruption in your address space, totally separately from what networking you're actually using.

It sounds to me as if you've bitten off a slightly bigger chunk than you can reasonably chew at this point -- you're using advanced, asynchronous libraries, networking, and multiple logical processes in a single physical process, all of which are pretty advanced concepts and require significant experience to get right. You may find that you're better off if you simplify the code, as follows:

  1. Only run one client or one server in one process. It's OK to compile in the code for both, as long as you only activate one, perhaps based on command line options.
  2. For a server, create a socket, and bind it to a known port, on address INADDR_ANY.
  3. For a client, create a socket, and do not bind it; instead use sendto() to send the data to the server.
  4. To poll your network for incoming data, set the socket to non-blocking when you create it, and simple call recvfrom() in a loop until there are no packets to dequeue.
  5. This means your code can use a single thread per client and per server.

With the requirements of ASIO out of the way, and the requirements of multi-threaded code out of the way, and the requirements of shared client/server in the same process out of the way, this should make it possible for you to concentrate on the networking code, and make it easier to reproduce and debug whatever problems you're having.

Share this post


Link to post
Share on other sites

I did already remove boost::asio from the equation, replacing it with SDL_net (and set a separate port for client and server), and I am no longer having any problems. That's not necessarily conclusive that the problem was because of the boost::asio stuff, but that's what I'm leaning towards. As it stands, everything seems to be working correctly at this point.

Edited by JackOfCandles

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this  

  • Advertisement