Stressing IOCP server

Started by
24 comments, last by ramdy 13 years, 3 months ago
Hello,

First of all, Im not a network expert. I have been learning for some months about sockets just cause I need to implement an efficient server for that game we all have dream about (I am looking specially for it to be robust and stable). After making very basic IO models implementations: blocking, wsaasyncselect, etc. I have ended working on the Completion Ports model. I can say work is almost finished and I am now in the beta-testing phase. For it, I basically start a server and then, in other machines at LAN, I start around 10 clients. These clients, start sending massively all the time strings (around 200 chars length) to the server and everything work fine, but, after 1-2 hours (when server has collected about 100-200Mbytes), server crash :'(. Of course, I could prevent all this flowing but well, it is a stress test.
I look at memory and the server keeps stable (no memory leaks). Maybe Im forcing too much the server so it can't attend so much data and the story overflow somewhere... Don't know if it's something which may or may not happen and I don't even get any error message :'( It just crash.

Thanks in advance!
Advertisement
You mean your socket server program crashes? I'd perform better logging first. You can't force too much data on a server. The kernel manages the receive buffers and you can set the size of them but it won't just crash your program. It could be returning an error you're not catching though. Could be a buffer overflow in your code also. Are you debugging your code to see where it crashes or anything?

Also I recommend for creating a stable server to just use boost asio.

I look at memory and the server keeps stable (no memory leaks). Maybe Im forcing too much the server so it can't attend so much data and the story overflow somewhere... Don't know if it's something which may or may not happen and I don't even get any error message :'( It just crash.



What's wrong with the following debugging method?

1) Start server process
2) Attach with debugger
3) Set debugger to break on exceptions
4) Start hammering the server with data
5) Wait for crash to show up in debugger

Also, if you want a server to be extremely robust, then you need to build it with redundancy and failure as design features. C++ is terrible for such architectures; I'd look at a dynamic language (like Python or JavaScript) or a runtime with explicit support for robustness (like Erlang/OTP). You get less done per CPU cycle, but you can make guarantees about robustness, which is often more important than the little-k factor of scalability.
enum Bool { True, False, FileNotFound };
Thanks, Sirisian and hplus0603.

I prefer not using any third library since I don't want any restriction on the obtained software, I have also spend already so much time with this implementation, wouldn't like switching now to a library plus the time to learn it...

About debugging and chosen lenguage, Im using C++ since IOCP is to be integrated into a 3d engine (Quest3d), so this make things very hard to debug, specially when crash appears after several hours of stress.

What I think is that Im stressing "too much", it's just flooding the server. However, it may never crash... I have found that I could separate the parsing of the packet received from the completion port event process, so, as soon as receiving a packet just storing it and recall another WSARecv. This which may help, maybe just delay the crash some time... Im wondering if the crash is something may happen to any server "open" to process any data incoming or robustness comes by precisely preventing this flooding.

Also, would you suggest me a good "stress test tool"?

Thanks again.





About debugging and chosen lenguage, Im using C++ since IOCP is to be integrated into a 3d engine (Quest3d), so this make things very hard to debug


My suggestion will work just fine with a C++ server. In fact, I've done just that myself.

These days, there are some other nice tools, too, such as VMWare Replay Debugging. I wish I had had that ten or twenty years ago :-)
enum Bool { True, False, FileNotFound };

I prefer not using any third library since I don't want any restriction on the obtained software, I have also spend already so much time with this implementation, wouldn't like switching now to a library plus the time to learn it...

Boost is under the boost license which is do whatever you want and most C++ programmers are familiar with it so it's not much of a restriction. (Usually for boost threads and the automatic cross platform code. For instance my WebSocket server I wrote compiles in Windows and Linux with the same code). I understand though if you don't want a dependency like that.

About debugging and chosen lenguage, Im using C++ since IOCP is to be integrated into a 3d engine (Quest3d), so this make things very hard to debug, specially when crash appears after several hours of stress.

I have a rotating multithreaded log library I made utilizing boost threads. Works fairly well. Also abstracting the networking so it can be easily logged is a good strategy if it isn't already.

What I think is that Im stressing "too much", it's just flooding the server.

That's the idea of a stress test. The program might slow down but it should never crash.

However, it may never crash... I have found that I could separate the parsing of the packet received from the completion port event process, so, as soon as receiving a packet just storing it and recall another WSARecv.

That shouldn't be much of a problem. Processing a packet in the callback usually isn't that big of a deal. (The CPU is much faster than the I/O). So in your receive callbacks you register another callback right? If so that's the normal setup and won't crash. The TCP receive buffer will just overflow and the socket might throw an error.
Some types of bugs simply cannot be solved with tools. These types of bugs come about from a misunderstanding of how an API or protocol works. In essence, you think something should work one way and code it so, but in actuality, it really works another way. The real kicker is that the "wrong" way works 99% of the time under regular conditions but you notice a small number of bugs where everything just breaks down and you can't figure out why. Most of the times, these bugs only show themselves after hours and hours of testing, which obviously makes it harder to debug. Even knowing what the problem is doesn't mean you can instantly solve it due to a misunderstanding about how it all works that you are unaware of. Let me give you an example of such a bug I faced at the end of last year. This might seem kinda long, but I'd advise you to read it through as reading other peoples experiences might give you ideas of how to solve your own problem.

I was writing a "proxy" program for a 3rd party protocol that would intercept all the packets and allow users to work with them as needed for logging, analysis, etc... In that particular branch of code, I migrated fully to boost::asio to make my life simpler. I've spent the better part of 3 years trying to write my own net code, trying out IOCP, select, WSAEventSelect, and everything else that's doable trying to find the "perfect" solution for me. Needless to say, boost::asio solved all my problems for me and allowed me to cut my code base by over 75%. My old projects code was easily over 1200+ LoC each and my new boost::asio code was in the range of 300-400 LoC.

Anyways, so I had my new boost::asio wrapper that I was happily using. I made the rest of my code cross platform as well so I can use the project on Windows and Linux. I began days of testing to make sure it all worked. At first, everything seemed to be working fine on Windows. I'd run my proxy and run a game client through it and everything was as expected. No crashes and no disconnects. Content with that, I then began running my tests on Linux. I used VirtualBox and Fedora for those tests. For the first couple of hours, everything seemed normal, but one client out of a few would randomly crash. I reran all the tests on Windows and could not reproduce such behavior.

At that point, I wasn't sure what to blame. Windows seemed to work fine, Linux mostly did but it had the occasional bug. Was it a bug in my code, boost::asio's code, or was something happening on the server that was causing the client to crash? I wasn't quite sure so I began debugging. Starting out, I logged all the packets the client received as well as the proxy. That was a lot of data that didn't really help much since I didn't quite know what to look for. With any state based logic system, there's so much going on, "valid" packets can trigger a crash under the right circumstances if the state is corrupt from something else, so that was a dead end. I wrote custom debugging tools for the client to know why it crashed. This helped a lot since I was able to see why the client died out. it turns out it was receiving packets with an entity id it did not know, and the poor coding of the client would simply abort. So I'm getting somewhere, but why does this seemingly only happen on Linux and not Windows?

I went back to my code and began looking over it for anything that could possibly be wrong. Note that the actual network code was so small that I was deceived into thinking it was a library problem for a while. I'm talking about probably only 100 lines of real code that did anything of which they were still pretty simple.So I figured at that point, I must have a fundamental misunderstanding about something, but I didn't know what. I was looking through my code and the only thing I could not explain the gaurntees of was my use of strands. To avoid locks and ordering issues, I simply posted events to a strand so they'd be executed serially.

I started a new project which only used the strand and io_service to check my logic. I made a simple test where I posted two events to the strand and outputted the results. If my logic was correct, I should always see message 1 followed by message 2. I did. I ran the tests 49 times in a row and everything was as expected. However, on the 50th run (these are real numbers, not made up) the second message came first and then the first message. I reran the tests over and over and it seemed on average maybe once in 50 or so attempts the messages would get reordered. I knew right then this was my problem, but why was it happening was the bigger question.

I went through the docs and I reread everything. As I reread, I noticed some subtle wording of things. In my code, I was using the syntax of:
io_service.post( strand.wrap( boost::bind( Func ) ) );
which worked so I thought it was correct. However, this code does not do what I thought it would. The strand object doesn't get to wrap the call until after the post function has completed. This means if you had two of these calls next to each other, one might actually preempt the other since the behavior is not guaranteed to be order when calling post. It was my understanding using the strand would make it so, but that's after the fact.

After playing with code some, I realized the correct syntax I was supposed to be using in my packet handling logic:
strand.post( boost::bind( Func ) );

I updated all my code and reran all my tests over and over. Problem solved. Everything worked as expected all the time on both OSs. After I solve any serious bug like this I take the time to look at what when wrong and why it did so. I ask myself could I have avoided this problem? I came to the conclusion in this case the answer was no. I simply had a fundamental misunderstanding about the way strand worked and even after reading the docs I didn't quite understand the way the API was meant to be used. Most of the time it worked and I was getting ready to release code that would have been buggy. Luckily if I notice any problems, I know they are not flukes and there is an underling issue to be addressed.

So how can this help you? I'd take hplus's advice and run your server in debug mode in your debugger and hammer the server and wait for the crash to show up. Most likely, the crash will be due to some memory problem, such as a WSAOverlapped object being freed while in use or some sort of corruption under a certain circumstance, maybe an object not being cleaned up after being used and leaking resources. It's also possible you are running out of non paged memory, but unlikely since if you do, your system usually just bsods since drivers need such memory and if its not available they'll crash and burn hard.

In either case though, you need to decouple your networking system from the rest of the engine so you can test it stand alone to find out the problem. You want modularity so you can build a system of pieces that you are sure they work. Trying to write one large integrated system is hard. Maintenance can be a nightmare and debugging even more so when things go wrong. having the ability to test a component outside of the system will make life a lot easier for you in the short and long run.

Lastly, while I would strongly suggest anyone who wants to use high performance networking stuff on Windows and be able to use their code on Linux and between x86 and x64 platforms should really give boost::asio I go, as I've explained from my own problems of the past that doing so won't actually fix your problem or make you a better programmer. After you fix the bug, if you do, you should consider why it happened, how you can prevent it in the future from happening again, and if you really should use another library as the core that takes care of such things for you. After struggling with tons of problems over the years of doing it myself, I decided it was time I let people who knew what they were doing and willing to share it with the public (boost) handle it for me and I'll never write custom core code again!

Good luck!

Some types of bugs simply cannot be solved with tools.


I respectfully beg to disagree. Having used all kinds of loggers, kernel monitors, in-circuit emulators, source-level debuggers, source code analysis tools, oscilloscopes, virtualization, logic probes, and other things I've probably forgotten about, I know that the tool *always* exists to help you find the bug. At times, you may not have access to the best tool for the job (try getting an ICE for a high-speed multi-core CPU and see whether you won't be out your pocket money for quite some time after that :-) but then, you can try another tool.

The best tool, though, is going back to design principles. If you assume something should happen in a certain way, then assert that. If you think A should always complete before B, then insert (interlocked increment) counters in A and B, and assert that the A counter is always greater than B. If the assert hits, you know you have a problem in your assumption. Now, inspect the code you think ensures the property you're relying on. Add asserts. Repeat :-) Sometimes you have to disassemble the code, looking for a compiler bug (only to realize you spelled some constant wrong and the hex value of the constant isn't wha you thought it would be). Sometimes you have to step into system libraries. That's just what it takes to develop in C/C++!

Replay debugging is great for hard-to-reproduce bugs because it will capture the problem in the act, and you can replay the failure over and over, seeing exactly how it goes wrong, and then start tracing the bug backwards. (Data breakpoints is sometimes a godsend, btw!)

That being said, for some of the most frustrating bugs, I find that starting to compose a bug report/question on a forum like here, or stackoverflow, is the best debugging tool. While doing so, I think, in my mind, about all the lame objections that people will have to my bug report, and I formulate a pre-emptive explanation of why I've excluded that possible cause for the bug. In doing so, I often end up formulating a response to some question, only to realize I haven't actually verified the answer -- and that turns out to be the solution. The post never needs to be posted.

Anyway, thinking "this can't be debugged" is generally not helpful. Of course it can be debugged. What you should be saying is either "_I_ can't debug this" (at which point you should take all the help you can get) or "debugging this isn't worth the effort" (in which case you should physically delete the code from your machine). Everything _can_ be debugged, with sufficient application of effort and taking of advise :-)
enum Bool { True, False, FileNotFound };
I have to agree with hplus0603 on this, running a debugger (replay debugging, if you have it, is great -- it's like repeatedly watching a video of a bank robbery except you can watch the robbers from different angles and look under their masks, too) is the "normal" and canonical way of addressing a failure. It's the simplest and (usually) safest thing to do.

Though, in some cases, logic can help you narrow down the possible causes too. Assumptions are very dangerous since they can be misleading, but still they can sometimes still get you there.

You mention that memory usage is constant, so I would assume that unless you're on a 32 bit system (you did not mention) and have some serious memory fragmentation going on (which may cause allocations to fail despite the working set not growing -- but this is an error condition that you should be able to catch, it shouldn't just crash and burn without a hint!), it is a most likely a thread synchronisation / locking issue.

IOCP works by waking up a worker thread when data arrives, and the worker does "something". IOCP works reliably for thousands of programs on millions of systems, every day. Unless you have a bug in your IOCP code of course, but then it would not work fine for 250,000 packets before failing. It would crash and burn the first second.
Plus, it would not only fail under high concurrency, but also when you connect a single client. So, this is not likely it.

It also isn't likely something not being initialized (a null pointer, or a file not opened), because that too, should crash during the first second and regardless of concurrency.

Concurrent memory allocation/deallocation works perfectly well and 100% reliably on millions of systems every day, provided that you don't do any special tampering or outright wrong things. Examples of that are freeing memory that was allocated on a different heap, or use the HEAP_NO_SERIALIZE flag, or using a buggy homebrewn allocator, or a non-ABA safe lockless scheme (in a false sense of optimization).
Freeing from the wrong heap can happen accidentially if you allocate in a DLL and free in the main program, but similar to bad pointers, this won't work nicely 250,000 times before crashing.
Other than that, you would have to explicitly do something, so you would know if that was the case.

Which leaves writing to some shared resource from a worker thread without proper locking. This would be what I'd assume as the likely cause, and which would be the first thing I'd look at if the debugger doesn't point me directly at an "oh duh!" type of error on the first run.
The probably single most important thing with threading is to make no "it will work" assumptions. It won't. Be pedantic. When reading from anything that anyone else might possibly write to, or when writing to anything that anyone else might possibly read from or write to, too -- locking is a must.
Clever optimizations (did anyone say "lockfree"?) often turn out to bite you in the rear. Trying to avoid one critical section (which is a truly cheap thing!) often turns out to be no more than a few percent faster in reality, but often causes weeks/months of headaches at inappropriate times (in addition to being more complicated in the first place). I had to learned this lesson the hard way. :(
Thank you all for such instructive answers!

I finally realized the way to debug my application inside VC++ after asking for some help also at Quest3d forums, first thing I find is something strange.

where I use:

DWORD Ret;
if ( Ret = WSAStartup( MAKEWORD(2,2), &wsaData ) != 0)
{
StateLog.Log("WSAStartup() failed with error: ", Ret);
Server.SetStateToError();
return -1;
}

I found at debug info, the szDescription value from &wsaData takes the value: 0x0b8931c4 "Winsock 2.0"
Why 2.0? It may be 2.2

Thanks.

This topic is closed to new replies.

Advertisement