Scalable network/disk IO server - Linux best practices

Started by
26 comments, last by SimonForsman 14 years, 6 months ago
Quote:a bunch of hippies and communists working for free actually are keeping up with a company spending millions of dollars on it


Personally, I've contributed to Linux in the past, and while some have called me a bleeding-heart pinko liberal, I'm certainly neither a hippie nor a communist :-)

However, in all honesty, I wouldn't say that the Linux kernel is keeping up. It has some areas of strength (I like iptables, for example!) but on average, the driver model is a lot more primitive, the support for hot-plug devices is quite arcane (try hot-plugging a USB graphics card on Linux and see how far it gets you), and device support is a lot sparser than Windows (in part because vendors choose to support Windows over Linux, of course).

Perversely, all that kernel magic matters the most on the client side, rather than the server side. Servers typically have simple needs, on average: shuffle as much data as possible through the chosen APIs (networking & disk), and then get out of the way! Clients, dealing with an ever-changing device landscape (even during active use) have it much tougher. Yet another reason why Linux clients really aren't first-class citizens at this point.
enum Bool { True, False, FileNotFound };
Advertisement
Quote:Original post by Katie
What amazes me about core kernel performance like this is not the levels of differences between the two options, but that Microsoft actually ISN'T a generation ahead; that a bunch of hippies and communists working for free actually are keeping up with a company spending millions of dollars on it.


A common misconception. Stats concerning contributions to the Linux kernel have shown that employees of large corporations, such as Red Hat or IBM, are the primary contributors to the Linux kernel. The number of lone hacker contributions is comparatively small. Linux has become something that Microsoft's competitors are cooperatively developing, rather than an enthusiast system.
Quote:Original post by samoth
do you have any *usable* documentation regarding this?

Unfortunately there isn't any good documentation. I had to poke around the kernel and libaio source code to figure out how to connect eventfd to aio. It's very simple once you get it working, but the fact that there is no documentation is pretty frustrating. BTW, libaio man pages refer to *kernel* data structures, not libaio data structures. They're very similar but the names aren't correct.

Anyway, you create a notification event by calling eventfd(0, 0). After that, when you issue an AIO request (via io_submit), you connect it to the event fd via an undocumented libaio function io_set_eventfd. Once you do that, it's smooth sailing - you can epoll the event fd and when the poll wakes up to handle the event, io_getevents is guranteed to return some events for you.

This way you can interleave network events and disk io events within epoll in a single thread. You get a single thread handling both networking and disk IO which hugely simplifies synchronization aspects of your code (you effectively delegate all synchronization to the kernel). You then build a state machine that controls state transitions based on network and IO events and voilà - you get a highly scalable network server [smile].

If you have any specific questions, I'd be happy to answer them here.
Quote:Anyway, you create a notification event by calling eventfd(0, 0). After that, when you issue an AIO request (via io_submit), you connect it to the event fd via an undocumented libaio function io_set_eventfd.
Great, thank you :-)
One day in the future, when AIO can finally handle buffer cache (an Indian guy provided several patches with different strategies 3-4 years ago, so it is maybe only another 5-10 years until they're accepted), this will make AIO a really good option.

Quote:This way you can interleave network events and disk io events within epoll in a single thread.
Yup, that's just as one would wish the world should be. Except for the buffer cache thing.

I still don't get it why kernel developers seem to think that "asynchronous" and "buffer cache" have to be mutually exclusive. Sure, DMA is great because the controller does all the work, also "zero copy" is a cool buzzword, and "no locks" sounds cool too. But from an application developer's point of view, I couldn't care less, as long as it just works. Even if buffered AIO was 300% slower than "normal" buffered transfers for some obscure reason (but why should it be?), it would still be several thousands of times faster than the best you could ever expect from your hard disk.
There are certainly some applications that will actually want raw, unbuffered DMA access (for a file server shoving data off disk onto the network controller directly, or for having a guarantee that a database transaction is really on disk when you think it is [which you don't know anyway on most write-cache enabled drives today]).
But for the vast majority of applications, the buffer cache is really the most powerful optimisation ever invented. We're talking of "nano" versus "milli" in access times and of "giga" versus "mega" in transfer rate (this assumes that reading from the buffer cache actually has to do a memcpy, which is not even necessarily the case, if similar alignment constrains as for DMA are made).
Access time and transfer rate may seem insignificant with AIO, after all they run asynchronously, so they are hidden. However, that is a fallacy. Access time still means you can't get more than a few hundred "operations" done per second, if nobody else is using the physical drive. If you do more, your requests will queue up, and at some point something will have to block or fail. Fast producer, slow consumer -- there is no other way.
Granted, the kernel might be able to merge some operations and save a few seeks, and you can always throw a bigger RAID with more disks at the problem. So make that "thousand" or "few thousand" instead of "few hundred".
However, using the buffer cache, you'll do ten thousands or hundred thousands of operations per second without even wasting a thought, the only real downside being that if you're writing and the UPS fails, you may lose data.
Quote:which is not even necessarily the case


Actually, read() and write() are defined in copying terms -- another UNIX "specialty."

In BeOS, we had some success in letting the kernel/driver allocate buffers at the request of the application, and the application then referencing these buffers for async I/O. Sadly, the filesystem people had to live by the POSIX interface, so we didn't get that for file I/O, but for video and audio, it worked very well!

I think a block I/O API entirely based on allocating buffers and requesting operations on those buffers would be able to perform very well. Specifically, the task of allocating a buffer doesn't necessarily need to map it into your memory space. If you're copying data, you could allocate a buffer, issue an asychronous read, and when complete, forward that to an asynchronous write, without touching or mapping the data at all. If you actually need to read the data, you can call a function to map it into your address space -- this call would also block, waiting for any pending IO to complete before it returns. (Or return with an error, if non-blocking)

enum Bool { True, False, FileNotFound };
Quote:Original post by samoth
I still don't get it why kernel developers seem to think that "asynchronous" and "buffer cache" have to be mutually exclusive.

Well, I think the idea might be that if the buffer cache performs well enough (> 90% hit rate), then you have no need for an async call, since the operation is non-blocking anyway. I hear you, though. You can always emulate it with thread pools, but it would be nice if the kernel offered a "real" solution.


Quote:I think a block I/O API entirely based on allocating buffers and requesting operations on those buffers would be able to perform very well.

To be fair, Linux has sendfile and splice and friends...
Quote:Original post by Rycross
Quote:Original post by Katie
What amazes me about core kernel performance like this is not the levels of differences between the two options, but that Microsoft actually ISN'T a generation ahead; that a bunch of hippies and communists working for free actually are keeping up with a company spending millions of dollars on it.


A common misconception. Stats concerning contributions to the Linux kernel have shown that employees of large corporations, such as Red Hat or IBM, are the primary contributors to the Linux kernel. The number of lone hacker contributions is comparatively small. Linux has become something that Microsoft's competitors are cooperatively developing, rather than an enthusiast system.


It should however be noted that alot of corporate Kernel contributions are drivers, not improvements to the actual kernel. Corporations such as Intel and Microsoft pretty much exclusivly contributes drivers

Its still great that so many large corporations contribute drivers to the kernel though since drivers are one of the few software types that should always be OpenSource imho. (Knowing that i will be able to get my hardware working in the next Windows or Linux version is far more important than having access to the sourcecode for my applications or my OS).

and to hplus, there are really good printer drivers for Linux, if you use hardware from a decent vendor. (Scanners are in a worse position but there are good options there aswell)
[size="1"]I don't suffer from insanity, I'm enjoying every minute of it.
The voices in my head may not be real, but they have some good ideas!
Quote:if the buffer cache performs well enough (> 90% hit rate), then you have no need for an async call


But for each time the buffer cache misses, you will stall the thread, which means that it cannot serve other requests (that might have hit the cache). Cached and non-cached traffic should be asynchronous for best performance; it's really that simple!
enum Bool { True, False, FileNotFound };
Quote:there are really good printer drivers for Linux, if you use hardware from a decent vendor


Where "decent vendor" is defined as "a vendor with Linux drivers"? Most consumers tend to define "decent vendor" as "the vendor that gives me the best output for the cheapest price."
enum Bool { True, False, FileNotFound };
Quote:Original post by hplus0603
But for each time the buffer cache misses, you will stall the thread

One way around this is to create multiple threads per CPU with decreasing priorities. This way if one thread does stall, the others will take over outstanding requests. This is probably a good practice to use anyway because even with fully asynchronous I/O you might occasionally stall *somewhere* (memory paged out, etc.) IOCP does it automatically, but with Linux it takes a bit of creativity to get the same functionality [smile]

Of course complete asynchronous IO is much better than a blocking version for high performance software. I was just pointing out that the situation isn't that bad.

This topic is closed to new replies.

Advertisement