Asynchron file loading aka. streaming

Started by
21 comments, last by Katie 14 years, 3 months ago
We're using async I/O due to significant performance benefits and went with the POSIX aio interface for portability. Since it's not available on Windows, we've implemented a wrapper on top of the Windows APIs that does the job.

Quote:Anyways, what is the biggest drawback of threading things on your own appart from the low level optimizations AIO will give me?

As frob has said, threading requires careful study to ensure you don't have any race conditions.

Quote:For instance what will happen if I load quite a big file in its own thread, will it noticeable slow down the Framerate if I dont load it in chunks?

No, IO happens via DMA nowadays, so it doesn't need lots of CPU time like PIO did. Large or small IOs in a separate thread shouldn't slow down your main thread. However, splitting your IOs into chunks makes a lot of sense (allows prioritizing IOs, simplifies a file caching mechanism and also speeds up aio).
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
Advertisement
Quote:Original post by mokaschitta

Anyways, what is the biggest drawback of threading things on your own appart from the low level optimizations AIO will give me?


The fact that, under Windows, OS, file system and drivers have the ability to organize disk access with knowledge of physical data layout. They could theoretically take into account disk rotation speed and head positions.

As a more general advantage, it allows for overlapped operations. It becomes possible to overlap read/process/write operations.

In general, kernel-assisted asynchronous IO has the potential of less overhead since it can be tuned to specifics of that particular OS and/or file system.


In practice results will vary, but considering this is a facility present in most OSes, it makes sense to use it.

Quote:For instance what will happen if I load quite a big file in its own thread, will it noticeable slow down the Framerate if I dont load it in chunks?

Disk IO is always done in chunks. Disk cannot transfer 16 Gigabytes in one call.

As far as processing goes - if you have idle cores, and if there is no memory contention, and if there is no scheduling conflicts, and if disk controller doesn't cause any stalls, and if disk transfer doesn't interfere with OS, then there will be absolutely no impact from file loading.

Unfortunately, that is a lot of ifs, and a lot of them depend on hardware and other factors.

It is not uncommon for disk IO to cause annoying stalls to entire system, threads or not, but it depends on very specific circumstances.


But - first, and by far most important performance factor - avoid access to multiple files - just .tar-ing everything into a single file, and seeking inside of that will offer drastic performance improvements, especially under Windows and NTFS. *nixes are less affected by this, but still - if you have only a single file (or a handful) you remove many file lookups that would otherwise need to be performed. At very least, they introduce serial dependency which is undesirable due to added latency.
Okay thanks, aio is my choice now. Since I will mostly use the engine for testing and in the future maybe exhibitions mac and linux support is enough for me right now!
Quote:Original post by Jan Wassenberg
We're using async I/O due to significant performance benefits and went with the POSIX aio interface for portability. Since it's not available on Windows, we've implemented a wrapper on top of the Windows APIs that does the job.


I took a quick look at your code. What you wrote is definitely better than a threaded approach, but you should really look into I/O completion ports, as it's significantly faster and more scalable.
Thanks for the feedback :) I am aware of completion ports (using them for directory change notification), but can't see how they would help with the current usage. These aio routines are called by a synchronous IO splitter that does caching and decompression on a block level. There are no threads involved and a maximum of 16 IOs in flight at any given time (which is already quite high, not even a Fusion ioDrive card needs that much). How can completion ports improve things here?
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
Quote:Original post by cache_hit


I took a quick look at your code. What you wrote is definitely better than a threaded approach, but you should really look into I/O completion ports, as it's significantly faster and more scalable.


Did you mean overlapped IO?
Quote:Original post by Jan Wassenberg
Thanks for the feedback :) I am aware of completion ports (using them for directory change notification), but can't see how they would help with the current usage. These aio routines are called by a synchronous IO splitter that does caching and decompression on a block level. There are no threads involved and a maximum of 16 IOs in flight at any given time (which is already quite high, not even a Fusion ioDrive card needs that much). How can completion ports improve things here?


Well, if you search around the literature surrounding IOCP, you won't find a lot of details on exactly what makes it faster and more scalable than event-driven overlapped I/O. All you'll find is everyone saying "IOCP is the fastest and most scalable way to do async i/o in windows". One reason that comes to mind is that IOCP requires no user-mode synchronization primitives. No mutexes, no critical sections, no events. Of course synchronization happens, but it happens inside the kernel using kernel synchronization primitives, which should be faster.

It has some flexibility advantages over traditional overlapped I/O as well since it has direct API support for performing arbitrary computations asynchronously, not just I/O. For example, you could have a thread pool with number of threads equal to number of CPUs on the system. Without using any user-mode synchronization primitives, you can post a message to this thread pool to perform encryption / decryption of binary data, which upon completion can post a message back to the main thread that it's complete, again without any user-mode synchronization primitives such as events.


A basic IOCP loop would look something like this:

//You can put other user-defined info in here if you wish.struct RequestPacket : public OVERLAPPED{    RequestPacket(LPVOID buf)        : buffer(buf)    {        hEvent = NULL;    }    LPVOID buffer;};#define KEY_READ 0#define KEY_WRITE 1#define KEY_ENCRYPT 2void iocp_loop(){   //Create a handle for reading from.  FILE_FLAG_NO_BUFFERING is required   //for optimal throughput, but imposes alignment / request size restrictions   HANDLE hRead = CreateFile(path, GENERIC_READ, FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, OPEN_EXISTING, FILE_FLAG_NO_BUFFERING | FILE_FLAG_OVERLAPPED | FILE_FLAG_SEQUENTIAL_SCAN, NULL);   //Create a handle for writing to.  FILE_FLAG_NO_BUFFERING is required   //for optimal throughput, but imposes alignment / request size restrictions.   //FILE_FLAG_WRITE_THROUGH is also required for optimal throughput.   HANDLE hWrite = CreateFile(path2, GENERIC_WRITE, FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, OPEN_EXISTING, FILE_FLAG_NO_BUFFERING | FILE_FLAG_WRITE_THROUGH | FILE_FLAG_OVERLAPPED, NULL);   //Create a new IOCP and associate it with the reading handle and read key   HANDLE hiocp = CreateIoCompletionPort(hRead, NULL, KEY_READ, 0);   //Associate the previous IOCP with writing as well, using the write handle and a different key.   CreateIoCompletionPort(hWrite, hiocp, KEY_WRITE, 0);   LARGE_INTEGER size;   LARGE_INTEGER nextReadOffset;   LARGE_INTEGER nextWriteOffset;   GetFileSizeEx(hRead, &size);   nextReadOffset.QuadPart = 0;   nextWriteOffset.QuadPart = 0;   int readsOutstanding = 0;   int writesOutstanding = 0;   const int maxOutstandingIo = 16;   //Required since we're using FILE_FLAG_NO_BUFFERING.  You can use FSCTL_GET_NTFS_VOLUME_DATA to fetch this number for real.   int blockSize = GetVolumeBlockSize();   LPVOID lpBuffer = VirtualAlloc(NULL, 65536, MEM_COMMIT|MEM_RESERVE, 0);   std::vector<RequestPacket*> packets;   for (int i=0; i < maxOutstandingIo; ++i)      packets.push_back(new RequestPacket(lpBuffer+i*blockSize));   //Force some reads to kick off the process.   for (int i=0; i < maxOutstandingIo; ++i)   {      RequestPacket* packet = packets;      packet->Offset = nextReadOffset.LowPart;      packet->OffsetHigh = nextReadOffset.HightPart;      ReadFile(hRead, packet->buffer, blockSize, NULL, packet);      ++readsOutstanding;      nextReadOffset.QuadPart += blockSize;   }   while ((readsOutstanding > 0) || (writesOutstanding > 0))   {      DWORD bytes;      ULONG_PTR key;      OVERLAPPED* overlapped;      RequestPacket* packet;      GetQueuedCompletionStatus(hiocp, &bytes, &key, &overlapped, INFINITE);      packet = static_cast<RequestPacket*>(overlapped);      switch (key)      {      case KEY_READ:         readsOutstanding--;         packet->Offset = nextWriteOffset.LowPart;         packet->OffsetHigh = nextWriteOffset.HighPart;         WriteFile(hWrite, packet->buffer, bytes, NULL, packet);         writesOutstanding++;         break;      case KEY_WRITE:         writesOutstanding--;         packet->Offset = nextReadOffset.LowPart;         packet->OffsetHigh = nextReadOffset.HighPart;         ReadFile(hRead, packet->buffer, bytes, NULL, packet);         readsOutstanding++;         break;      }   }}



I left out some details regarding manual buffering but you get the idea. I also defined a key for encryption but never used it. In theory, you could have a thread pool whose threads are all blocked in a call to GetQueuedCompletionStatus on a *different* i/o completion port object, which was created with a dwConcurrency value equal to the number of threads in the thread pool. The main thread could call PostQueuedCompletionStatus(hiocp2, KEY_ENCRYPT, ...). One of the threads from the thread pool would pick it up, do the encryption, and when it was done call PostQueuedCompletionStatus(hiocp, KEY_ENCRYPT, ...). This would send it back to the main thread (due to the fact that the main thread is waiting on hiocp and the thread pool is on hiocp2). Then add a switch handler for KEY_ENCRYPT, and under it put the code that's currently under KEY_READ.



This kind of hints at the flexibility advantage of IOCP. The entire asynchronous pipeline is managed through a single place. Furthermore, the WinSock API directly supports IOCP, so you call WSASend() or whatever the function name is, and it will gladly operate on an overlapped socket in exactly the same way. You'll get notification of the network completion through GetQueuedCompletionStatus().



All of this is abstracted out for you in boost::asio, but boost::asio requires an o/s specific interface to be implemented. It provides the windows class that uses IOCP internally for both disk i/o and sockets, and it provides the linux implementation for sockets, but it doesn't provide any linux implementation for disk i/o, which would ultimately be a mapping of the boost required interface to the AIO api.


I implemented such a system at work for some high performance disk backup software. Using the IOCP approach was the only way I could achieve high enough performance that the actual physical disk became the bottleneck. I can now read/write literally as fast as the disk allows, which is surprisingly hard to achieve.
Quote:Original post by Antheus
Quote:Original post by cache_hit


I took a quick look at your code. What you wrote is definitely better than a threaded approach, but you should really look into I/O completion ports, as it's significantly faster and more scalable.


Did you mean overlapped IO?


IOCP is a specific type of overlapped I/O that microsoft recommends for the most performance-intensive and demanding applications.
hm, IOCPs certainly are nice, but again, in this case, I don't see any relevant gains to be had. Overhead due to WaitForMultipleObjects+ResetEvent is absolutely negligible for the few active requests. The code manages to max out an ioDrive SSD already (700 MB/s IIRC).

Quote:Using the IOCP approach was the only way I could achieve high enough performance that the actual physical disk became the bottleneck. I can now read/write literally as fast as the disk allows, which is surprisingly hard to achieve.

I'm curious: were you seeing a bottleneck with overlapped completion mechanisms other than IOCP? Which one was it - events or callbacks?

The IOCP hype is probably coming from applications where you have tons of requests and multiple threads (web server), and definitely makes sense in that context.
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
Quote:Original post by Jan Wassenberg
hm, IOCPs certainly are nice, but again, in this case, I don't see any relevant gains to be had. Overhead due to WaitForMultipleObjects+ResetEvent is absolutely negligible for the few active requests. The code manages to max out an ioDrive SSD already (700 MB/s IIRC).

What about mechanical disks with multiple platters and read/write heads? Ironically it's harder to max out performance on these due to the fact that ordering of read/writes is more important.

Also, you mentioned doing compression / decompression. In your application, which is the slowest among { reading, writing, compression } of a single block of data? Is compression happening asynchronously as well? In theory if compression is the bottleneck, then you should notice 0 effect by introducing a slower disk, and if the disk is the bottleneck you should notice 0 effect by removing compression. Is this the case for you? If so maybe there is little reason to use IOCP.


The main reason I like it is that it provides an arbitrary asynchronous computational model that just happens to be extremely easy to use for I/O both on the network and to/from a disk, while still being pretty easy to use for arbitrary asynchronous computation. The event model loses scalability when your pipeline grows and you're waiting on hundreds of handles simultaneously.

This topic is closed to new replies.

Advertisement