Quote:Anyway, you create a notification event by calling eventfd(0, 0). After that, when you issue an AIO request (via io_submit), you connect it to the event fd via an undocumented libaio function io_set_eventfd.
Great, thank you :-)
One day in the future, when AIO can finally handle buffer cache (an Indian guy provided several patches with different strategies 3-4 years ago, so it is maybe only another 5-10 years until they're accepted), this will make AIO a really good option.
Quote:This way you can interleave network events and disk io events within epoll in a single thread.
Yup, that's just as one would wish the world should be. Except for the buffer cache thing.
I
still don't get it why kernel developers seem to think that "asynchronous" and "buffer cache" have to be mutually exclusive. Sure, DMA is great because the controller does all the work, also "zero copy" is a cool buzzword, and "no locks" sounds cool too. But from an application developer's point of view, I couldn't care less, as long as it just works. Even if buffered AIO was 300% slower than "normal" buffered transfers for some obscure reason (but why should it be?), it would still be several thousands of times faster than the best you could ever expect from your hard disk.
There are certainly some applications that will actually want raw, unbuffered DMA access (for a file server shoving data off disk onto the network controller directly, or for having a guarantee that a database transaction is really on disk when you think it is [which you don't know anyway on most write-cache enabled drives today]).
But for the vast majority of applications, the buffer cache is really the most powerful optimisation ever invented. We're talking of "nano" versus "milli" in access times and of "giga" versus "mega" in transfer rate (this assumes that reading from the buffer cache actually has to do a memcpy, which is not even necessarily the case, if similar alignment constrains as for DMA are made).
Access time and transfer rate may seem insignificant with AIO, after all they run asynchronously, so they are hidden. However, that is a fallacy. Access time still means you can't get more than a few hundred "operations" done per second, if nobody else is using the physical drive. If you do more, your requests will queue up, and at some point something
will have to block or fail. Fast producer, slow consumer -- there is no other way.
Granted, the kernel might be able to merge some operations and save a few seeks, and you can always throw a bigger RAID with more disks at the problem. So make that "thousand" or "few thousand" instead of "few hundred".
However, using the buffer cache, you'll do ten thousands or hundred thousands of operations per second without even wasting a thought, the only real downside being that if you're writing and the UPS fails, you may lose data.