Scalable network/disk IO server - Linux best practices

Started by
26 comments, last by SimonForsman 14 years, 5 months ago
I'm writing a highly scalable network server that needs to do a lot of disk IO. On Windows, the best way to do it is well understood - just use IO completion ports. On Linux, there seem to be a plethora of polling mechanisms, event mechanisms, etc. From what I understand the best accepted practice is to use epoll in combination with aio. Is this correct? If this is so, I plan to design the server in the following way. I'll create a thread pool of workers (equal to the number of CPUs on the machine), and a single thread to listen/accept a socket connection and transfer it to one of the workers from the pool (probably in round-robin fashion). Each worker will then have an epoll_pwait loop so that it wakes up when there is a network IO ready, or an AIO ready from disk. If there is a network IO, the thread will process it by issuing an AIO request with an appropriate signal mask. When the AIO request completes, the thread can then wake up, handle the AIO, and reply to the client. Is there something I'm missing? Is this an acceptable "best practice" to build this sort of server?
Advertisement
I think that will work fine.

You should realize that putting SSDs in RAID 0 into the machine is likely to do more for raw performance than any specific software structure, though :-)
enum Bool { True, False, FileNotFound };
Quote:Original post by CoffeeMug
I'm writing a highly scalable network server that needs to do a lot of disk IO. On Windows, the best way to do it is well understood - just use IO completion ports. On Linux, there seem to be a plethora of polling mechanisms, event mechanisms, etc. From what I understand the best accepted practice is to use epoll in combination with aio. Is this correct?

If this is so, I plan to design the server in the following way. I'll create a thread pool of workers (equal to the number of CPUs on the machine), and a single thread to listen/accept a socket connection and transfer it to one of the workers from the pool (probably in round-robin fashion). Each worker will then have an epoll_pwait loop so that it wakes up when there is a network IO ready, or an AIO ready from disk. If there is a network IO, the thread will process it by issuing an AIO request with an appropriate signal mask. When the AIO request completes, the thread can then wake up, handle the AIO, and reply to the client.

Is there something I'm missing? Is this an acceptable "best practice" to build this sort of server?




Depending on the size of the job individual requests do (such as a very big block) you WOULDNT want to do roundrobin job dispatching but some mechanism where the worker threads ask for a new job when they are done.


Another consideration is that you may want to spread the threads each to its own disk (one per disk) as having 2 threads working on the same physical disk will likely have them fighting (and slowing to a crawl) as the head constantly is sent to different parts of the disk. Thus a more complicated scheduling based on data location might be more efficient.

Some disk controllers (hardware and software) do a good job optimizing disk requests, but that would again be a source of delay for any particular thread's job (and again reason not to assign them 'round-robin'....)

--------------------------------------------[size="1"]Ratings are Opinion, not Fact
Note that the threads in the suggestion are for handling the response to I/O -- not the waiting. Aio means you can have more requests outstanding than you have CPUs. This means that the kernel gets to schedule I/O across disks, rather than the application. Often, then kernel is in a better position to do so.

If you have really large I/O requests, I would also recommend splitting requests into individual bites of a maximum size, say 1 MB or 8 MB. This will reduce the maximum jitter of requests, because no one big request can come in and lock up the disk channel for a very long time. It will reduce throughput of really large requests, though, but generally that is actually the least important factor to optimize for.
enum Bool { True, False, FileNotFound };
wodinoneeye's point about disks is very valid -- several worker threads per disk, sort the jobs by disk and hand them out to the workers is the way to go.

It's reasonably sensible to have a set of workers, one per disk handling the connections -- establish what subject the connection wants to talk about where on the disks that is and give it to the appropriate worker. That worker then epolls the connections.

When the connection definitely wants data, setup the task and hand it off to the disk worker pool which read/write the data as appropriate. When the data rx/tx is complete, the main comms handling thread can have the connection back again.

The result of this is that even if (say) one of your disks is slow, the others will run OK.

Multiple threads per disk is better than a single thread -- the kernel (and the disk system) get a chance to optimise the accesses. If you force the accesses to be sequential by issuing the reads in turn, you cannot take advantage of that. The worst case scenario of multiple threads is going to look the same as the sequential accesses.

Don't use the async-IO stuff. All it does is thread it behind the scenes (the posix one) or not work on most disks (the other one).

We experimented by memory mapping the files, and tagging/untagging pages as connections wanted to read them. I don't recall the details ATM, but linux has a mechanism for tagging pages as "please read them ASAP" or "swap this I don't care about it" and the swapper will pull them in or out when it can, with the intention that we'd return and read the memory later, but the product was end-of-lifed before we got that running.
Well, I'm using "the other one" (io_submit and friends) :) I haven't gotten around to testing its behavior yet, but from what I understand it always works on files opened with O_DIRECT, and works on XFS even without O_DIRECT (this took some digging through the kernel code). Hopefully I'll get to do some benchmarks soon, so I can post the results.

Linux support for this kind of stuff is surprisingly bad - AIO is crucial to high performance networking. It might be that the OS supports these types of workloads well, but documentation is terrible, and Google returns obsolete results, with almost no credible benchmarks. Sorry, but ranting helps me blow off some steam :)
Begin Free Software rant:

It's not that surprising that Linux is behind on this, because it's the kind of thing that is really hard to get right, and that requires lots of dedicated people working really hard all over the kernel and drivers to implement it. While a team of hundreds can do this for the Windows NT kernel (because they're being paid to do it), it's much harder for a team of thousands of volunteers, most of whom are *not* experts, to do the same thing.

There are some things that Free Software is great at; especially farming out fairly repetetive, simpler tasks with high visibility. There are other things that Free Software doesn't do at all as well; this includes anything that both requires deep skills, lots of time investment, and doesn't give that much emotional/social reward.

Good printer drivers is another one of those things that never happen for Linux, btw :-)

Also, much Free Software is developed by college students who may have lots of time and dedication, but don't have the 20 years of experience necessary to get the subtle things right. That's why it often works "well enough" but seldom works as well as solutions built by long-time experts. And long-time experts find that they can often get money for their expertise, thus their output is seldom Free Software.

And, finally, yes, you want more than one thread per disk, else you won't be able to re-order requests on a per-disk basis. Re-ordering requests is the most important tool in reducing request latency and improving overall throughput, at least when each request is relatively small (below a handful of megabytes, say).
enum Bool { True, False, FileNotFound };
Quote:Linux support for this kind of stuff is surprisingly bad

Unluckily, Linux support for asynchronous IO is not only surprisingly bad, it's practically non-existent, at least, if you define "asynchronous" as "will not block".
I've found the madvise(MADV_WILLNEED) approach as stated by Katie to be the best thing you can do. It is nearly as fast in the worst case, somewhat faster in the average case (because it isn't ignoring the buffer cache, which is totally braindead), and orders of magnitude faster in favourable cases. More importantly, however, it is reliable.

Using io_submit, I was unpleasantly surprised to find that once you exceed some threshold (either request-wise or size-wise), your "asynchronous" requests suddenly start taking dozens of milliseconds to submit. Sure enough, you still get asynchronous completion some time later, but to what avail...
This wasn't documented anywhere in the man pages, and you got no error code or anything, it just screwed up silently.
After much, much searching, I found a reply in some mailing list archive which was something like "yeah that happens, the the command queue has a fixed size", and a statement from Mr.T. which was something like "yeah so what, most people don't even use epoll today, there's no really urgent need for that stuff anyway". I can't reproduce the exact wording now, as that was a couple of months ago, and I don't remember the precise words, but it was something along those lines.
Add to that the fact that of course it doesn't work asynchronously at all if you don't turn off the buffer cache.

Having said that, async IO is not *that* much better under Windows per se, but at least if you spend some time reading the ifs and whens, it can be made to work reliably if you don't mind the overhead of pushing everything through IOCP at least twice (in that case, it works really, really, really well).

It is stunning that something so seemingly easy as "load this stuff, I don't care how you do it or when, just don't block me" is so darn hard to implement.
Quote:Original post by samoth
Using io_submit, I was unpleasantly surprised to find that once you exceed some threshold (either request-wise or size-wise), your "asynchronous" requests suddenly start taking dozens of milliseconds to submit.

That actually doesn't surprise me. If the kernel didn't do this, it would be very easy to execute a DoS attack (just submit spurious IO requests in an infinite loop). Do you remember if the nr_events parameter to io_setup affect this behavior? According to the docs you should be able to submit at least nr_events without blocking.


I'm implementing a high performance database, so I can't use the page cache anyway. Since I have to implement my own caching, having to use O_DIRECT to make io_submit work isn't a problem. Having it block before reaching nr_events, however, would be a problem. I'll test this today and post the results.
Ok, just did some tests.

Despite all the fear mongering out there, I found that io_submit works rather well (for open source software [grin]). It takes about .8 microseconds to submit (as opposed to milliseconds when I run the same timing code on regular preads). If I submit more requests than the kernel can handle, I get -EAGAIN return value without a delay.

One open question is how many requests the kernel can handle without returning -EAGAIN. I found that even if I set nevents (when creating the io context) to a large number (larger than the total number of requests), in many cases I get -EAGAIN. This happens largely on slower drives - on a faster drive this is less of a problem. I suppose the kernel doesn't allow the outstanding queue of requests to get too big, no matter what you set nevents to.

I'm digging through the kernel aio code now (linux/fs/aio.c if you're interested), but it'll take me a while to ramp up. I'll post here if I find something.

This topic is closed to new replies.

Advertisement