Quote:Original post by Jan Wassenberg
Thanks for the feedback :) I am aware of completion ports (using them for directory change notification), but can't see how they would help with the current usage. These aio routines are called by a synchronous IO splitter that does caching and decompression on a block level. There are no threads involved and a maximum of 16 IOs in flight at any given time (which is already quite high, not even a Fusion ioDrive card needs that much). How can completion ports improve things here?
Well, if you search around the literature surrounding IOCP, you won't find a lot of details on exactly what makes it faster and more scalable than event-driven overlapped I/O. All you'll find is everyone saying "IOCP is the fastest and most scalable way to do async i/o in windows". One reason that comes to mind is that IOCP requires no user-mode synchronization primitives. No mutexes, no critical sections, no events. Of course synchronization happens, but it happens inside the kernel using kernel synchronization primitives, which should be faster.
It has some flexibility advantages over traditional overlapped I/O as well since it has direct API support for performing arbitrary computations asynchronously, not just I/O. For example, you could have a thread pool with number of threads equal to number of CPUs on the system. Without using any user-mode synchronization primitives, you can post a message to this thread pool to perform encryption / decryption of binary data, which upon completion can post a message back to the main thread that it's complete, again without any user-mode synchronization primitives such as events.
A basic IOCP loop would look something like this:
//You can put other user-defined info in here if you wish.struct RequestPacket : public OVERLAPPED{ RequestPacket(LPVOID buf) : buffer(buf) { hEvent = NULL; } LPVOID buffer;};#define KEY_READ 0#define KEY_WRITE 1#define KEY_ENCRYPT 2void iocp_loop(){ //Create a handle for reading from. FILE_FLAG_NO_BUFFERING is required //for optimal throughput, but imposes alignment / request size restrictions HANDLE hRead = CreateFile(path, GENERIC_READ, FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, OPEN_EXISTING, FILE_FLAG_NO_BUFFERING | FILE_FLAG_OVERLAPPED | FILE_FLAG_SEQUENTIAL_SCAN, NULL); //Create a handle for writing to. FILE_FLAG_NO_BUFFERING is required //for optimal throughput, but imposes alignment / request size restrictions. //FILE_FLAG_WRITE_THROUGH is also required for optimal throughput. HANDLE hWrite = CreateFile(path2, GENERIC_WRITE, FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, OPEN_EXISTING, FILE_FLAG_NO_BUFFERING | FILE_FLAG_WRITE_THROUGH | FILE_FLAG_OVERLAPPED, NULL); //Create a new IOCP and associate it with the reading handle and read key HANDLE hiocp = CreateIoCompletionPort(hRead, NULL, KEY_READ, 0); //Associate the previous IOCP with writing as well, using the write handle and a different key. CreateIoCompletionPort(hWrite, hiocp, KEY_WRITE, 0); LARGE_INTEGER size; LARGE_INTEGER nextReadOffset; LARGE_INTEGER nextWriteOffset; GetFileSizeEx(hRead, &size); nextReadOffset.QuadPart = 0; nextWriteOffset.QuadPart = 0; int readsOutstanding = 0; int writesOutstanding = 0; const int maxOutstandingIo = 16; //Required since we're using FILE_FLAG_NO_BUFFERING. You can use FSCTL_GET_NTFS_VOLUME_DATA to fetch this number for real. int blockSize = GetVolumeBlockSize(); LPVOID lpBuffer = VirtualAlloc(NULL, 65536, MEM_COMMIT|MEM_RESERVE, 0); std::vector<RequestPacket*> packets; for (int i=0; i < maxOutstandingIo; ++i) packets.push_back(new RequestPacket(lpBuffer+i*blockSize)); //Force some reads to kick off the process. for (int i=0; i < maxOutstandingIo; ++i) { RequestPacket* packet = packets; packet->Offset = nextReadOffset.LowPart; packet->OffsetHigh = nextReadOffset.HightPart; ReadFile(hRead, packet->buffer, blockSize, NULL, packet); ++readsOutstanding; nextReadOffset.QuadPart += blockSize; } while ((readsOutstanding > 0) || (writesOutstanding > 0)) { DWORD bytes; ULONG_PTR key; OVERLAPPED* overlapped; RequestPacket* packet; GetQueuedCompletionStatus(hiocp, &bytes, &key, &overlapped, INFINITE); packet = static_cast<RequestPacket*>(overlapped); switch (key) { case KEY_READ: readsOutstanding--; packet->Offset = nextWriteOffset.LowPart; packet->OffsetHigh = nextWriteOffset.HighPart; WriteFile(hWrite, packet->buffer, bytes, NULL, packet); writesOutstanding++; break; case KEY_WRITE: writesOutstanding--; packet->Offset = nextReadOffset.LowPart; packet->OffsetHigh = nextReadOffset.HighPart; ReadFile(hRead, packet->buffer, bytes, NULL, packet); readsOutstanding++; break; } }}
I left out some details regarding manual buffering but you get the idea. I also defined a key for encryption but never used it. In theory, you could have a thread pool whose threads are all blocked in a call to GetQueuedCompletionStatus on a *different* i/o completion port object, which was created with a dwConcurrency value equal to the number of threads in the thread pool. The main thread could call PostQueuedCompletionStatus(hiocp2, KEY_ENCRYPT, ...). One of the threads from the thread pool would pick it up, do the encryption, and when it was done call PostQueuedCompletionStatus(hiocp, KEY_ENCRYPT, ...). This would send it back to the main thread (due to the fact that the main thread is waiting on hiocp and the thread pool is on hiocp2). Then add a switch handler for KEY_ENCRYPT, and under it put the code that's currently under KEY_READ.
This kind of hints at the flexibility advantage of IOCP. The entire asynchronous pipeline is managed through a single place. Furthermore, the WinSock API directly supports IOCP, so you call WSASend() or whatever the function name is, and it will gladly operate on an overlapped socket in exactly the same way. You'll get notification of the network completion through GetQueuedCompletionStatus().
All of this is abstracted out for you in boost::asio, but boost::asio requires an o/s specific interface to be implemented. It provides the windows class that uses IOCP internally for both disk i/o and sockets, and it provides the linux implementation for sockets, but it doesn't provide any linux implementation for disk i/o, which would ultimately be a mapping of the boost required interface to the AIO api.
I implemented such a system at work for some high performance disk backup software. Using the IOCP approach was the only way I could achieve high enough performance that the actual physical disk became the bottleneck. I can now read/write literally as fast as the disk allows, which is surprisingly hard to achieve.