Scalable network/disk IO server - Linux best practices

CoffeeMug · 2009-11-07T02:14:19

I'm writing a highly scalable network server that needs to do a lot of disk IO. On Windows, the best way to do it is well understood - just use IO completion ports. On Linux, there seem to be a plethora of polling mechanisms, event mechanisms, etc. From what I understand the best accepted practice is to use epoll in combination with aio. Is this correct? If this is so, I plan to design the server in the following way. I'll create a thread pool of workers (equal to the number of CPUs on the machine), and a single thread to listen/accept a socket connection and transfer it to one of the workers from the pool (probably in round-robin fashion). Each worker will then have an epoll_pwait loop so that it wakes up when there is a network IO ready, or an AIO ready from disk. If there is a network IO, the thread will process it by issuing an AIO request with an appropriate signal mask. When the AIO request completes, the thread can then wake up, handle the AIO, and reply to the client. Is there something I'm missing? Is this an acceptable "best practice" to build this sort of server?

Networking and Multiplayer Programming

Started by CoffeeMug October 24, 2009 03:59 AM

26 comments, last by SimonForsman 14 years, 6 months ago

samoth

9,833

October 27, 2009 04:28 AM

Quote:Do you remember if the nr_events parameter to io_setup affect this behavior?

No, the only thing that I remember is that it happened even with as few as a dozen or so requests if each request was supposed to read a few dozen megabytes of data. I can only guess why that happened, probably because one big request was broken up into several smaller bits that the hardware can chew, or something, and this was probably limited as well?
I didn't investigate too deeply, as this scared me off pretty fast, and the mmap/madvise thing worked really well.

Quote:
If I submit more requests than the kernel can handle, I get -EAGAIN return value without a delay.

That must have been changed since I tried then, which is a good thing. I mean, an error code is perfectly acceptable, no doubt. Just, blocking for 10-12 ms without warning is a no-go :-)

Quote:I'm implementing a high performance database, so I can't use the page cache anyway. Since I have to implement my own caching, having to use O_DIRECT to make io_submit work isn't a problem.

But even that should probably be possible with memory mapping and using the page cache. I'm not sure if anyone has ever tried any such thing for a database (at least, if you need transaction safety), but I think it should work just fine.
You can use msync(...MS_ASYNC) to initiate page writeback, and later msync(...MS_SYNC) before considering a transaction complete. For loading, madvise (and maybe mincore), as proposed above.
Shame I don't have time for writing a database server, would be interesting to explore how well that works.

CoffeeMug

852

Author

October 27, 2009 04:48 AM

Quote:Original post by samoth
Quote:Do you remember if the nr_events parameter to io_setup affect this behavior?
No, the only thing that I remember is that it happened even with as few as a dozen or so requests if each request was supposed to read a few dozen megabytes of data. I can only guess why that happened, probably because one big request was broken up into several smaller bits that the hardware can chew, or something, and this was probably limited as well?

Actually, I just figured out when the kernel returns with -EAGAIN. The size of the queue is the smaller number between nr_events passed during io_setup and the number stored in /sys/block/sdX/queue/nr_requests.

I'm using fairly small blocks now (512B), so the behavior you've observed might still persist. BTW, io_submit is only asynchronous with O_DIRECT, so this might have been the issue.

The page cache does a few things in suboptimal ways (well, it's really really good for general purpose stuff, but you can do much better with specialized stuff). MYISAM uses the page cache for example, but it's possible to do much better without it in a specialized situation.

defmacro.org

hplus0603

11,917

October 27, 2009 11:03 AM

One of the core problems in Linux async I/O is that the syscall model was initially synchronous. open/close/ioctl/read/write are all expected to return when done. The only way for the kernel to send asynchronous information back to the application is through a signal -- there is no concept of a callback function.

The real way to implement async I/O in a kernel that will give you the best peformance, is to make the kernel interface be inherently asynchronous, and to have a kernel-defined message queue per process (or per thread, or per user-defined token, or whatever). The higher-level blocking calls would then just be built as a layer on top of the async stuff.

This is similar to, for I/O, you really want to define a block-level interface, and a buffered character-level interface can be built on top. At least that part most implementations get it mostly right :-)

Unfortunately, so much UNIX software is so rooted in the synchronous model, that it's really quite hard to attempt to move the entire stack (applications, C library, kernel, drivers) to an asynchronous model.

On Windows, the window message loop and I/O completion ports and APCs all serve as well-defined system-to-application messaging models. In fact, one of Windows problems is that it has too many return mechanisms :-) Thus, writing asynchronous code on Windows (at any layer of the stack) is generally more straightforward than UNIX. It's by no means ideal, though.

Another problem is the way that a GUI is a bolt-on on top of UNIX. While that's great for running head-less servers, it's a real problem when it comes to interaction between the GUI system and the kernel.

enum Bool { True, False, FileNotFound };

Antheus

2,410

October 27, 2009 11:33 AM

Is Grand Central Dispatch for user mode tasks only, or does it apply to syscalls as well? It claims to work with file descriptors, but is the asynchronous processing emulated or actually implemented in kernel?

CoffeeMug

852

Author

October 27, 2009 12:01 PM

Quote:Original post by hplus0603
there is no concept of a callback function.

Well, there's io_getevents. Also the AIO mechanism can communicate with the process through an fdevent - this is ideal because you can interleave network IO and disk IO in one thread. There are also real time signals, which are essentially queues similar to Windows message queues. I'd say lack of interfaces isn't the problem - it's lack of a single well defined standard. That's the downside of bazaar development (of course there are upsides as well).

defmacro.org

swiftcoder

18,997

October 27, 2009 12:03 PM

Quote:Original post by Antheus
Is Grand Central Dispatch for user mode tasks only, or does it apply to syscalls as well? It claims to work with file descriptors, but is the asynchronous processing emulated or actually implemented in kernel?

There certainly is kernel support involved, because when Apple released Grand Central as open-source, they mentioned that it can run *without* kernel support.

If you want to look deeper, the source to grand central itself is available here, the source for the blocks runtime it depends upon is here, and you can pull the kernel support from the XNU project...

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

Antheus

2,410

October 27, 2009 02:53 PM

Somewhat related.

hplus0603

11,917

October 28, 2009 11:13 AM

Quote:there's io_getevents

As others have said, there are a number of possible mechanisms. That's not the point I was trying to make, though.

Here's the point: None of them is *the model* that most software was written to expect. UNIX has run for 40 years on the blocking-I/O, read()/write() model. It's pervasive throughout the ecosystem. The closest you can get to that in the form of less simplistic I/O is select(), which isn't really helpful for asynchronous operation, only for non-blocking operation (and it's not even non-blocking in the sense of non-stalling on disk I/O).

Changing the UNIX ecosystem is hard, and takes time. In fact, very hard, and takes lots of time (as in "decades").

enum Bool { True, False, FileNotFound };

samoth

9,833

November 02, 2009 09:42 AM

Quote:Original post by CoffeeMug
Also the AIO mechanism can communicate with the process through an fdevent

Out of curiosity (not that I'm likely to use KAIO any time soon again, but I'd like to know), do you have any *usable* documentation regarding this?
All I am aware of is a rather useless example of how to use eventfd as a poor man's IPC thingie, much like a pipe, and some "rumour" that it might be used to signal IO completion. The glibc man pages aren't any helpful, neither is the documentation on the io_* syscalls.
The only thing close to something "usable" that I was able to find was some kind of proof-of-concept "test program" for some kernel patch which I'm not sure whether it's mainstream at all or not.

The only well-documented thing I know is having it fire a signal and have a signalfd catch it, but that isn't the same, obviously.

Katie

2,255

November 02, 2009 10:14 AM

"madvise(MADV_WILLNEED) approach"

Thank you. I couldn't for the life of me remember what the call was named. I'm getting too old for this :-)

"While a team of hundreds can do this for the Windows NT kernel"

What amazes me about core kernel performance like this is not the levels of differences between the two options, but that Microsoft actually ISN'T a generation ahead; that a bunch of hippies and communists working for free actually are keeping up with a company spending millions of dollars on it.

I can only conclude that this is because dozens of groups of people (academics, hobbiests etc) can all set off and explore their ideas fairly easily with Linux without needing to get "approval"; and the community gets to pick from the best end results of all that fiddling.

{I'm still agog, for example, that not only did someone think it would be a nice idea to be able to change paging algorithms on a running machine, but that they actually implemented it and made it work... but of course once they had, using it is a no-brainer.}

"do you have any *usable* documentation"

We never found any. I resorted to reading the kernel source and wildly experimenting.

Scalable network/disk IO server - Linux best practices

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Scalable network/disk IO server - Linux best practices

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines