• Advertisement
Sign in to follow this  

Would anyone like a nice IO Completion Ports example?

This topic is 3018 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

For anyone who polls this board, I have a rather nice and clean example of using IO completion ports with sockets to make an echo server and I am using it for interviewing right now. Obviously I am not going to need it once I am done and I would not mind posting most of the code (it's a lot of code) and a short description on what is going on. This would be very useful to someone starting out with server code because it touches on some very complicated topics like lock free algorithms, synchronization, look aside lists... This is not something that someone outside of a professional programming environment would be likely to see. I want to post this because all the completion port examples I have seen on the net are either fundamentally broken, have improper/broken synchronization, or just very bad code - at one point in my life I wasted a lot of time with those examples... The lack of any good examples makes me nervous, I am not sure if anyone would actually take the time to review some very complex C code and actually see how completion ports and the surrounding synchronization, buffer handling, and lock-free algorithms are correctly implemented into a working system. So... Would anyone be interested in seeing this code or will I just be wasting a lot of my time doing a write up and post? Please post if you would like to see it. One rule: If you post includes the request to translate this code into C++/C#/Java or porting it to Linux please don't post. -Karl Strings

Share this post


Link to post
Share on other sites
Advertisement
I think this is a great idea. Good examples are always helpful, especially when it is clearly stated why they do things in a certain way (plus, perhaps, a note on why it is wrong to do the same thing in a different way).
Plus, I've been wanting to learn IOCP for a while now :D. You wouldn't want to leave somebody salivating for a great example and not getting it, would you?

Share this post


Link to post
Share on other sites
I'm also interested in it, I want to learn IOCP too, and its true that is difficult to find good information and code examples in the net.

Share this post


Link to post
Share on other sites
Personally, I have no more use for such an example (apart from being curious, you never know whether you can still learn something from someone else's code), but it is certainly something that would be very valuable. It might save people who are still struggling on that particular thing several weeks of pain.

It might be even more helpful if it included file i/o as well, and a Linux epoll port. Completion ports are much easier (almost trivial, once you have waded through all documentation) to get right with only sockets, and the same is true for epoll. Getting both sockets and file read/writes into one "thing" without blocking, however, is rather non-trivial.

Share this post


Link to post
Share on other sites
Quote:
Original post by samoth

It might be even more helpful if it included file i/o as well, and a Linux epoll port.


Or, instead of dealing with API pecularities, just focus on actual tasks.

Share this post


Link to post
Share on other sites
Quote:
Original post by Antheus
Quote:
Original post by samoth

It might be even more helpful if it included file i/o as well, and a Linux epoll port.


Or, instead of dealing with API pecularities, just focus on actual tasks.


This has been an area of contention for me for quite some time. In most applications abstracting the OS with a library is a good thing. It lets developers concentrate more on what they are doing and less on what the underlying OS is doing. This changes when you enter the arena of system and/or server software. Modern operating systems provide lots of support for high performance applications and in order to take full advantage of that support one must intimately tie the application to the OS. You need to be just as concerned with the underlying OS and what it is doing as well as what you are doing.

Adding general purpose abstraction layers so close to the hardware does make your life easier, but you lose a key advantage. The vast majority of the time that advantage does not matter (super performant code on a quad core 2.4ghz machine gains very little for user applications.) and you can use C#.NET, Java, Boost, whatever your flavor is, to its fullest potential and be just fine. For pedagogical reasons, its best that people learn what is really happening on the metal. It keeps people from writing 10,000 element linked lists that have to be iterated though every 100ms because it is easy to write "LinkedList<MyClass> LList; LList.Find(Whatever);" without understanding the consequences of big O and cache poisoning.

Anyway, it seems as if some people are interested so I will do a write up and post once interview season is over.

PS. The MSDN example for AcceptEx uses IOCP, and it is completely broken. If anyone is looking for a good IOCP QA exercise, go to http://msdn.microsoft.com/en-us/library/ms737524(VS.85).aspx and figure out why it is broken. If you post your answer here I will let you know if you are right or not. It's right on the surface, no digging required, so most people should be able to find it just by looking at the code sample long enough, and I doubt throwing the code into a debugger will be any help.

-Karl Strings

Share this post


Link to post
Share on other sites
Quote:
Original post by Karl Strings
This has been an area of contention for me for quite some time. In most applications abstracting the OS with a library is a good thing. It lets developers concentrate more on what they are doing and less on what the underlying OS is doing. This changes when you enter the arena of system and/or server software. Modern operating systems provide lots of support for high performance applications and in order to take full advantage of that support one must intimately tie the application to the OS. You need to be just as concerned with the underlying OS and what it is doing as well as what you are doing.


Why would that be necessary when abstraction layers such as Boost.Asio which Antheus suggested simply provide wrappers around these high performance OS facilities? One may argue that the abstraction layer might be slightly slower than calling the native API directly, but knowing the Boost community I would say the performance loss is negligible. Besides, that sort of reasoning can be taken just as far as you like: "The C++ compiler generates slow code. I'm better off writing the assembly code by hand."

Quote:
Original post by Karl Strings
Adding general purpose abstraction layers so close to the hardware does make your life easier, but you lose a key advantage. The vast majority of the time that advantage does not matter (super performant code on a quad core 2.4ghz machine gains very little for user applications.) and you can use C#.NET, Java, Boost, whatever your flavor is, to its fullest potential and be just fine. For pedagogical reasons, its best that people learn what is really happening on the metal. It keeps people from writing 10,000 element linked lists that have to be iterated though every 100ms because it is easy to write "LinkedList<MyClass> LList; LList.Find(Whatever);" without understanding the consequences of big O and cache poisoning.


In a sense, I agree with you on this. It is important to understand your tools. However, whether or not it's worth your time really depends on what your goal is. If you are trying to gain a deeper understanding of the technology you are working with (in this case, Operating System specific scalable IO), playing around with that is obviously the way to go. However, if you're making a game, you will probably be better off using a reliable abstraction layer written people who know what they are doing (such as the Boost developers). That way, you just might even finish your project. ;)

A few days ago I ran across a thread on an electronics forum where a guy presented his idea for making an ADC (Analog-to-digital converter) from discrete components. The replies were generally discouraging, and the consensus appeared to be that "10 years ago, it might've been fun; today, you're probably better off making something with an ADC." Once again, making the ADC from discrete components would enable him to learn something. Building something else using an ADC IC would allow him to accomplish something.

Share this post


Link to post
Share on other sites
Quote:
Original post by Windryder
Why would that be necessary when abstraction layers such as Boost.Asio which Antheus suggested simply provide wrappers around these high performance OS facilities?


My reply was regarding "linux and file handling". Not the rest of thread or OP.

Share this post


Link to post
Share on other sites
Quote:
Original post by Antheus
Quote:
Original post by Windryder
Why would that be necessary when abstraction layers such as Boost.Asio which Antheus suggested simply provide wrappers around these high performance OS facilities?


My reply was regarding "linux and file handling". Not the rest of thread or OP.


From your post I did get the feeling that you encourage people to apply this principle in a more general context ("focus on the actual tasks"). Hence I thought it appropriate to refer to your suggestion in my post.

Share this post


Link to post
Share on other sites
Quote:
Original post by Windryder

From your post I did get the feeling that you encourage people to apply this principle in a more general context ("focus on the actual tasks").


If someone wants to study IOCP using plain C, that's fine.

But when it comes to cross-platform portability and related abstractions, the problem is solved satisfactory by asio or ACE. And reinventing the wheel in that particular case is somewhat counter-productive.

Portability, especially in C and C++ is a tedious nightmare of typedefs and macros, and lots of obscure debugging around undocumented or other edge cases, and studying something like that for purpose of learning is, IMHO, a waste of time. It's just one of those things that requires years-long tenure of hands-on practice and simply cannot be taught from some book or source code tutorial.

Or, at very least, it's probably best to start with some existing abstraction, then drill down as to why certain choices were made the way they were.

Share this post


Link to post
Share on other sites
I am an engineer, not a clergyman. I try to stay away from such religious topics as language and libraries, but apparently I have slipped into the trap again.

I think this thread has deviated to far from its original intention.

-Karl Strings

Share this post


Link to post
Share on other sites
Try to search something on CodeProject.com, there are a lot of tutorials about IOCP in C++.

Share this post


Link to post
Share on other sites
Quote:
Original post by Windryder
Quote:
Original post by Karl Strings
This has been an area of contention for me for quite some time. In most applications abstracting the OS with a library is a good thing. It lets developers concentrate more on what they are doing and less on what the underlying OS is doing. This changes when you enter the arena of system and/or server software. Modern operating systems provide lots of support for high performance applications and in order to take full advantage of that support one must intimately tie the application to the OS. You need to be just as concerned with the underlying OS and what it is doing as well as what you are doing.


Why would that be necessary when abstraction layers such as Boost.Asio which Antheus suggested simply provide wrappers around these high performance OS facilities? One may argue that the abstraction layer might be slightly slower than calling the native API directly, but knowing the Boost community I would say the performance loss is negligible.



My advice when it comes to Boost is to "Chew the meat and spit the bones." Some of the libraries are excellent but not all of them. The performance of the Boost Serialization library is significantly worse than that of the C++ Middleware Writer across a number of tests. I'm not very familiar with the Asio library, but want to point out that you should be careful regarding what you use. If you're not, your company/work will go the way of the dodo bird.

Another weakness of traditional C++ serialization libraries is they don't automate the production of serialization functions. This is a bonus that one gets with C++ Middleware Writer.

Brian Wood
www.webEbenezer.net

Share this post


Link to post
Share on other sites
It's quite nice to read that someone else has been working with similar project that I've been on for a long time. However instead of just focusing on sockets, I'm writing a platform for handling streams efficiently. Just like you, I originally I worked with C, but after complexity of code increased I decided to swap to C++ to keep the code relatively trivial, clean and easy to maintain.

Even though completion ports do their job when trying to avoid obsolete memory bandwidth usage and cache trashing. From performance point of view (and to keep things trivial), it would make sense to be able to map kernel read and write buffers directly to user space instead of mapping user space buffers to kernel space (completion ports).

Share this post


Link to post
Share on other sites
Quote:
Original post by skorhone
... From performance point of view (and to keep things trivial), it would make sense to be able to map kernel read and write buffers directly to user space instead of mapping user space buffers to kernel space (completion ports).


Explain? I would also like to know the reson you think that mapping memory from kernel space to user space is trivial...

Just to clarify, mapping memory from one ring to another is not usually trivial. Yes, it may not be hard to "do", but it is very easy to open your code up to some wicked race conditions that are very hard to debug.

If your goal is to map memory, have fun, just be aware that you can have some weird race conditions that may not be apparent just by looking at the code.

Share this post


Link to post
Share on other sites
I think the general idea is that if the device driver gets to specify the parameters for a buffer (alignment requirement, contiguity requirement, physical address requirement, etc) then it may be more efficient (due to hardware limitations) than some arbitrary buffer allocated by the application.

For example, if some DMA operation needs 64-byte aligned buffers, and a user hands in a 16-byte aligned buffer (or even a 4-byte aligned buffer from malloc()) then chances are that the driver will need to re-buffer, which triples the cost of the operation(!). Original write into internal buffer; read out of internal buffer; write into user buffer.

And before you say that all hardware should support DMA to arbitrary scattered memory pages with arbitrary alignment, let me point at the 99% of hardware out there that doesn't :-)

Share this post


Link to post
Share on other sites
I didn't mean that the implementation of this kind of interface was trivial - it obviously isn't. What I meant was that such interface could be easier to utilize efficiently (as you already noted, most completion port implementations are more or less wrong)

Interface could be something simple as:
flags |= WSA_MEM_MAP;
buffers[0].buf = NULL; /* wsarecv updates this */
buffers[0].len = 4096; /* max read length */
WSARecv(sock, buffers, 1, &bytesRead, &flags, overlapped, NULL);
.
<read completes>
.
WSAFree(buffers, 1);

Instead of:
3* WSARecv(sock, buffers, 1, &bytesRead, &flags, overlapped, NULL);
.
<logic to process results in the threadpool in order they were initiated>

Share this post


Link to post
Share on other sites
Quote:
Original post by skorhone
I didn't mean that the implementation of this kind of interface was trivial - it obviously isn't. What I meant was that such interface could be easier to utilize efficiently (as you already noted, most completion port implementations are more or less wrong)

Interface could be something simple as:
flags |= WSA_MEM_MAP;
buffers[0].buf = NULL; /* wsarecv updates this */
buffers[0].len = 4096; /* max read length */
WSARecv(sock, buffers, 1, &bytesRead, &flags, overlapped, NULL);
.
<read completes>
.
WSAFree(buffers, 1);

Instead of:
3* WSARecv(sock, buffers, 1, &bytesRead, &flags, overlapped, NULL);
.
<logic to process results in the threadpool in order they were initiated>


Let me make sure I understand this... You want the allocation to happen in the kernel and WSARecv will return a kernel allocated buffer to you on completion? I am assuming that you are doing this to ensure proper memory alignment and no double allocations?

I think you will find better results and less bugs by using look aside lists. You can allocate your memory however you need to in user land at startup and free at program close. One round of allocations and frees, everything is correctly aligned however you want, and most important - no kernel code is needed.

Share this post


Link to post
Share on other sites
True, but the point would be to make the application developement trivial. Completion ports have been there over a decade, still there's only handful of proper implementations and documentation is above average - there must be a reason for that :)

In my opinion interface that is hard to use, has a design flaw. Enough ranting, just trying to point out that currently completion ports are propably not the ultimate solution. With a proper interface, perhaps.

Share this post


Link to post
Share on other sites
Quote:
In my opinion interface that is hard to use, has a design flaw.


If an operation is, at the core, complex, then any proper interface must, by extension, be complex. If you try to put a simple interface on a complex operation, you will be doing something wrong, or something inefficient.

That being said, the wrapper idiom for I/O completion ports that's found in the .NET framework is pretty elegant in my opinion. Most well-written .NET programs use I/O completion ports, and may not even know it. Anytime you call BeginRead() or similar, you're really using I/O completion ports with the Windows thread worker pool.

Share this post


Link to post
Share on other sites
Quote:
Original post by Karl Strings
Quote:
Original post by skorhone
I didn't mean that the implementation of this kind of interface was trivial - it obviously isn't. What I meant was that such interface could be easier to utilize efficiently (as you already noted, most completion port implementations are more or less wrong)

Interface could be something simple as:
flags |= WSA_MEM_MAP;
buffers[0].buf = NULL; /* wsarecv updates this */
buffers[0].len = 4096; /* max read length */
WSARecv(sock, buffers, 1, &bytesRead, &flags, overlapped, NULL);
.
<read completes>
.
WSAFree(buffers, 1);

Instead of:
3* WSARecv(sock, buffers, 1, &bytesRead, &flags, overlapped, NULL);
.
<logic to process results in the threadpool in order they were initiated>


Let me make sure I understand this... You want the allocation to happen in the kernel and WSARecv will return a kernel allocated buffer to you on completion? I am assuming that you are doing this to ensure proper memory alignment and no double allocations?

I think you will find better results and less bugs by using look aside lists. You can allocate your memory however you need to in user land at startup and free at program close. One round of allocations and frees, everything is correctly aligned however you want, and most important - no kernel code is needed.


I looked into this issue a bit and came up with some interesting tidbits:

1) SetFileIoOverlappedRange is new to Vista/2008 and it addresses use of the overlapped struct, this optimizes one set of allocations.

2) Regardless of what you do, send and recv (WSA included) will allways copy to/from an internal device buffers. Memory alignment requirements for that buffer are not going to be known by the program, some cases maybe not even by the kernel (only the driver), so the best solution is to either buy server class hardware that can handle whatever you throw at it, or make all send/recv buffers page aligned (4KiB in most cases). The one way to do this is to suck up a whole page via VirtualAlloc and shove your buffer at the beginning, but this does not scale well and may cause locality problems depending on that hardware you are using. You are causing more problems then you are solving.

3) Allocating memory in such a way that you can address its locality of reference will get you a much bigger bang for the buck then guessing about memory alignment issues that you simply cant know about in advance or that changes depending on minor hardware.

A serious server will be run on serious hardware. A simple server should be run on modest hardware. The penalty for misaligned buffers is not that significant when only dealing with a few hundred connections or light loads. The point here is that it will scale correctly without code changes if you follow the IOCP interface so don't worry about it. Worry about how to alloc your buffers to address other performance issues.

I can assure you - the *last* place that needs tweaking is the kernel/OS/hardware when it comes to performance. If you follow good programming practices and adhear to the API you will have fewer bugs and better performance overall. And your program wont crash or perform very badly when you change hardware or get an OS update...

Share this post


Link to post
Share on other sites
Quote:
Memory alignment requirements for that buffer are not going to be known by the program, some cases maybe not even by the kernel (only the driver)


Actually, if you look back in the forum, you'll see that when I worked for a hardware/OS company, we had great results with an I/O model where the driver got to allocate the buffers. Some of my musings come from that experience. My assertion is that if that could be the "base" API instead of the copying read()/write(), then portability of high-performance software would be a lot easier.

Also, in data centers, "serious" hardware is on the way out. Cheap, throwaway nodes (be they blades or pizza boxes) is pretty much the scalability standard, and the other important development is server virtualization. When you have a hypervisor doing your "real" hardware drivers, making sure that the shuffling between your app, the kernel, the driver, the hypervisor and the hypervisor's driver is efficient becomes even more important. Paravirtualization based on buffers coming from the bottom up, instead of the top down, is safer than mapping user pages across, and is faster than copying.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement