Quote:Do you remember if the nr_events parameter to io_setup affect this behavior?No, the only thing that I remember is that it happened even with as few as a dozen or so requests if each request was supposed to read a few dozen megabytes of data. I can only guess why that happened, probably because one big request was broken up into several smaller bits that the hardware can chew, or something, and this was probably limited as well?
I didn't investigate too deeply, as this scared me off pretty fast, and the mmap/madvise thing worked really well.
Quote:That must have been changed since I tried then, which is a good thing. I mean, an error code is perfectly acceptable, no doubt. Just, blocking for 10-12 ms without warning is a no-go :-)
If I submit more requests than the kernel can handle, I get -EAGAIN return value without a delay.
Quote:I'm implementing a high performance database, so I can't use the page cache anyway. Since I have to implement my own caching, having to use O_DIRECT to make io_submit work isn't a problem.But even that should probably be possible with memory mapping and using the page cache. I'm not sure if anyone has ever tried any such thing for a database (at least, if you need transaction safety), but I think it should work just fine.
You can use msync(...MS_ASYNC) to initiate page writeback, and later msync(...MS_SYNC) before considering a transaction complete. For loading, madvise (and maybe mincore), as proposed above.
Shame I don't have time for writing a database server, would be interesting to explore how well that works.