This topic is 2944 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

##### Share on other sites
There is a much easier way to do this. Introducing threading to your system adds a ton of complexity.

Even if you do decide to use threading in your app, the method you described is still not ideal.

The OS provides asynchronous IO functions. Use them. The OS is (potentially) able to do smart things such as reordering the disk reads to reduce load times, chose smarter buffer settings, and otherwise make it faster. You can have multiple requests at the same time, and the OS can (potentially) reorder them to run faster than running the same requests sequentially.

Google shows a few tutorials on AIO, including this one that looks pretty good.

##### Share on other sites
While I wouldn't have said it was simpler, and I'm assuming the Mac has something very much like it, the 'best' solution is probably to go with low level file access.

On Win32 its possible to open a file with the Win32 file I/O functions, find out its size, then tell the IO subsystem to load that data into a chunk of memory in an async manner. The file IO is queued by the file system and you can then sleep the loading thread until the data is loaded at which point you can wake up and deal with the loaded file as required.

I would assume the Mac's native file IO system has something very much like it. I also recall someone mentioning boost's ASIO library as a possible solution for this however I've not looked into myself.

The advantage of this method is that you hand off worrying about file IO timing etc to the OS, you also don't end up busy waiting a CPU core while the data loads. the IO subsystem can also queue up multiple files to load so it's just a matter of fire off the request and wait until your data turns up.

##### Share on other sites
okay thanks guys I will look into that.- I was just hoping if I pulled off something simple myself I could use it cross plattform.

##### Share on other sites
Well, as I said, Boost::ASIO might be able to help you out and would be crossplatform, but other than that you have to hit the OS at the low level as most languages have no concept of async IO; heck last I checked it was a bit of a pain to do in .Net and required unsafe code to do so.

##### Share on other sites
To some extent, I'd second what Frob said. To another extent I can't.

Sure, the OS is able to do smart things about the file reads, but that is under the assumption that you don't have limits on what you are reading, and that you are willing to abide by the limits of the async IO calls on the OS.

Consider you trying to stream data behind a movie. It is important that the video get its X Mb/s of data to keep the video running. Whatever is happening in the background, who cares.

I prefer a setup like the following:
*Every read is issued as a "job" with a size, priority, bandwidth. I read files into buffers (or at least partially), and parse buffers, instead of calling

*Jobs are sorted by "credits". Being serviced removes all credits. Every loop through the servicing thread, all jobs get "priority" number of credits.

*Each loop through the service thread, X jobs are serviced (one for each AIO slot I reserve that is free). Servicing a job means issuing a read of at least MIN size (ie 64K), up to bandwidth*time_from_last_service+fudgefactor bytes.
The service thread then waits for one of the AIO calls to return from its callback, then loops again.

This insures that each file that needs X bandwidth gets close to that much data throughput, the X AIO calls give the OS a way to schedule disk reads efficiently, and the priority credit system insures that all files get service eventually. Breaking up each large read into several small ones of "bandwidth" based size keeps one file read from stalling out service to other file reads. Overall the system seems to provide stable throughput, and behaves better than just issuing random read calls from random subsystems. Since each read has some metadata about how important the read call is. It keeps item A from stalling out item B, which was the main concern when writing said system in the first place. Since the motivation came from issues seen in another project, where background streaming would make foreground streaming choppy (level load behind a movie making the movie choppy)

The hard part about all of this is syncing everything without introducing too much overhead or complexity. Because now your main thread needs to ignore-till-callback or poll each frame to see if the data is loaded.

##### Share on other sites
boost::asio won't work without significant work on your part. It works beautifully for windows disk i/o, and for cross-platform network I/O, but for non-windows disk i/o it just isn't supported. If you want it to work you have to implement a class conforming to the RandomAccessHandle Boost.Asio concept whose interfaces and methods delegate to the underlying OS aio api, which frob linked to some tutorials above. It's not very easy though and requires a lot of work.

##### Share on other sites
Thanks, I think I will look into that aio stuff, I also found another good link I'd like to share: LINK

##### Share on other sites
I've used posix aio on linux in the past and it contains a subtle but annoying suckage that you need to pay attention to at initial design time.

Works like this -- you send a request and tag it with a callback function which comes back to tell you that your request is complete. Peachy.

The obvious way of OOing the interface is to put an interface pointer into the aiocb structure, and have the callback thunk through to the interface and dispatch the call to the object to say it's task is now done.

So you might have terrain zone objects created when a player is nearby (and potentially might enter them), which schedule loading their textures on creation, and get prodded when the data has appeared by the callback. If the work goes away; for example if the player takes a turn that means you no longer need a zone then the obvious thing to do is delete the terrain block. Which in turn, either cancels the AIO task if it hasn't completed or deletes the texture.

And this works. *Almost* all of the time.

The problem is that AIO tasks can become uncancellable -- call cancel and it returns NOTCANCELLED. When it it's in that state, the callback WILL get called at some point in the future and cannot be changed... and therefore you can now no longer safely delete anything that that callback will refer to.

In the end, we had to fix this by making queues of control objects, and tagging them to say whether their data was actually needed or not, to handle the spurious complete notifications from cancels which didn't work.

The reason for this is that underneath, the AIO system is effectively just a worker thread pool doing regular IO[1]. The task becomes uncancellable when one of the threads is actually blocked doing the IO.

So you need to make sure you handle that special case and be aware that a task whose attempted cancellation fails will still call the callback function.

[1] On Linux 2.6 there are just kernel threads doing the work for you -- you can see them on process listings.

##### Share on other sites
Hey thanks for the infos! AIO overall sounds pretty useful and I will definately give it a try in the future.-

Anyways, what is the biggest drawback of threading things on your own appart from the low level optimizations AIO will give me?

For instance what will happen if I load quite a big file in its own thread, will it noticeable slow down the Framerate if I dont load it in chunks?

This might be stupid questions but I never seriously used threading and I would like to understand things a little better!

Thanks so far, that cleared up alot of things allready!

##### Share on other sites
We're using async I/O due to significant performance benefits and went with the POSIX aio interface for portability. Since it's not available on Windows, we've implemented a wrapper on top of the Windows APIs that does the job.

Quote:
 Anyways, what is the biggest drawback of threading things on your own appart from the low level optimizations AIO will give me?

As frob has said, threading requires careful study to ensure you don't have any race conditions.

Quote:
 For instance what will happen if I load quite a big file in its own thread, will it noticeable slow down the Framerate if I dont load it in chunks?

No, IO happens via DMA nowadays, so it doesn't need lots of CPU time like PIO did. Large or small IOs in a separate thread shouldn't slow down your main thread. However, splitting your IOs into chunks makes a lot of sense (allows prioritizing IOs, simplifies a file caching mechanism and also speeds up aio).

##### Share on other sites
Quote:
 Original post by mokaschittaAnyways, what is the biggest drawback of threading things on your own appart from the low level optimizations AIO will give me?

The fact that, under Windows, OS, file system and drivers have the ability to organize disk access with knowledge of physical data layout. They could theoretically take into account disk rotation speed and head positions.

As a more general advantage, it allows for overlapped operations. It becomes possible to overlap read/process/write operations.

In general, kernel-assisted asynchronous IO has the potential of less overhead since it can be tuned to specifics of that particular OS and/or file system.

In practice results will vary, but considering this is a facility present in most OSes, it makes sense to use it.

Quote:
 For instance what will happen if I load quite a big file in its own thread, will it noticeable slow down the Framerate if I dont load it in chunks?

Disk IO is always done in chunks. Disk cannot transfer 16 Gigabytes in one call.

As far as processing goes - if you have idle cores, and if there is no memory contention, and if there is no scheduling conflicts, and if disk controller doesn't cause any stalls, and if disk transfer doesn't interfere with OS, then there will be absolutely no impact from file loading.

Unfortunately, that is a lot of ifs, and a lot of them depend on hardware and other factors.

It is not uncommon for disk IO to cause annoying stalls to entire system, threads or not, but it depends on very specific circumstances.

But - first, and by far most important performance factor - avoid access to multiple files - just .tar-ing everything into a single file, and seeking inside of that will offer drastic performance improvements, especially under Windows and NTFS. *nixes are less affected by this, but still - if you have only a single file (or a handful) you remove many file lookups that would otherwise need to be performed. At very least, they introduce serial dependency which is undesirable due to added latency.

##### Share on other sites
Okay thanks, aio is my choice now. Since I will mostly use the engine for testing and in the future maybe exhibitions mac and linux support is enough for me right now!

##### Share on other sites
Quote:
 Original post by Jan WassenbergWe're using async I/O due to significant performance benefits and went with the POSIX aio interface for portability. Since it's not available on Windows, we've implemented a wrapper on top of the Windows APIs that does the job.

I took a quick look at your code. What you wrote is definitely better than a threaded approach, but you should really look into I/O completion ports, as it's significantly faster and more scalable.

##### Share on other sites
Thanks for the feedback :) I am aware of completion ports (using them for directory change notification), but can't see how they would help with the current usage. These aio routines are called by a synchronous IO splitter that does caching and decompression on a block level. There are no threads involved and a maximum of 16 IOs in flight at any given time (which is already quite high, not even a Fusion ioDrive card needs that much). How can completion ports improve things here?

##### Share on other sites
Quote:
 Original post by cache_hitI took a quick look at your code. What you wrote is definitely better than a threaded approach, but you should really look into I/O completion ports, as it's significantly faster and more scalable.

Did you mean overlapped IO?

##### Share on other sites
Quote:
 Original post by Jan WassenbergThanks for the feedback :) I am aware of completion ports (using them for directory change notification), but can't see how they would help with the current usage. These aio routines are called by a synchronous IO splitter that does caching and decompression on a block level. There are no threads involved and a maximum of 16 IOs in flight at any given time (which is already quite high, not even a Fusion ioDrive card needs that much). How can completion ports improve things here?

Well, if you search around the literature surrounding IOCP, you won't find a lot of details on exactly what makes it faster and more scalable than event-driven overlapped I/O. All you'll find is everyone saying "IOCP is the fastest and most scalable way to do async i/o in windows". One reason that comes to mind is that IOCP requires no user-mode synchronization primitives. No mutexes, no critical sections, no events. Of course synchronization happens, but it happens inside the kernel using kernel synchronization primitives, which should be faster.

It has some flexibility advantages over traditional overlapped I/O as well since it has direct API support for performing arbitrary computations asynchronously, not just I/O. For example, you could have a thread pool with number of threads equal to number of CPUs on the system. Without using any user-mode synchronization primitives, you can post a message to this thread pool to perform encryption / decryption of binary data, which upon completion can post a message back to the main thread that it's complete, again without any user-mode synchronization primitives such as events.

A basic IOCP loop would look something like this:

//You can put other user-defined info in here if you wish.struct RequestPacket : public OVERLAPPED{    RequestPacket(LPVOID buf)        : buffer(buf)    {        hEvent = NULL;    }    LPVOID buffer;};#define KEY_READ 0#define KEY_WRITE 1#define KEY_ENCRYPT 2void iocp_loop(){   //Create a handle for reading from.  FILE_FLAG_NO_BUFFERING is required   //for optimal throughput, but imposes alignment / request size restrictions   HANDLE hRead = CreateFile(path, GENERIC_READ, FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, OPEN_EXISTING, FILE_FLAG_NO_BUFFERING | FILE_FLAG_OVERLAPPED | FILE_FLAG_SEQUENTIAL_SCAN, NULL);   //Create a handle for writing to.  FILE_FLAG_NO_BUFFERING is required   //for optimal throughput, but imposes alignment / request size restrictions.   //FILE_FLAG_WRITE_THROUGH is also required for optimal throughput.   HANDLE hWrite = CreateFile(path2, GENERIC_WRITE, FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, OPEN_EXISTING, FILE_FLAG_NO_BUFFERING | FILE_FLAG_WRITE_THROUGH | FILE_FLAG_OVERLAPPED, NULL);   //Create a new IOCP and associate it with the reading handle and read key   HANDLE hiocp = CreateIoCompletionPort(hRead, NULL, KEY_READ, 0);   //Associate the previous IOCP with writing as well, using the write handle and a different key.   CreateIoCompletionPort(hWrite, hiocp, KEY_WRITE, 0);   LARGE_INTEGER size;   LARGE_INTEGER nextReadOffset;   LARGE_INTEGER nextWriteOffset;   GetFileSizeEx(hRead, &size);   nextReadOffset.QuadPart = 0;   nextWriteOffset.QuadPart = 0;   int readsOutstanding = 0;   int writesOutstanding = 0;   const int maxOutstandingIo = 16;   //Required since we're using FILE_FLAG_NO_BUFFERING.  You can use FSCTL_GET_NTFS_VOLUME_DATA to fetch this number for real.   int blockSize = GetVolumeBlockSize();   LPVOID lpBuffer = VirtualAlloc(NULL, 65536, MEM_COMMIT|MEM_RESERVE, 0);   std::vector<RequestPacket*> packets;   for (int i=0; i < maxOutstandingIo; ++i)      packets.push_back(new RequestPacket(lpBuffer+i*blockSize));   //Force some reads to kick off the process.   for (int i=0; i < maxOutstandingIo; ++i)   {      RequestPacket* packet = packets;      packet->Offset = nextReadOffset.LowPart;      packet->OffsetHigh = nextReadOffset.HightPart;      ReadFile(hRead, packet->buffer, blockSize, NULL, packet);      ++readsOutstanding;      nextReadOffset.QuadPart += blockSize;   }   while ((readsOutstanding > 0) || (writesOutstanding > 0))   {      DWORD bytes;      ULONG_PTR key;      OVERLAPPED* overlapped;      RequestPacket* packet;      GetQueuedCompletionStatus(hiocp, &bytes, &key, &overlapped, INFINITE);      packet = static_cast<RequestPacket*>(overlapped);      switch (key)      {      case KEY_READ:         readsOutstanding--;         packet->Offset = nextWriteOffset.LowPart;         packet->OffsetHigh = nextWriteOffset.HighPart;         WriteFile(hWrite, packet->buffer, bytes, NULL, packet);         writesOutstanding++;         break;      case KEY_WRITE:         writesOutstanding--;         packet->Offset = nextReadOffset.LowPart;         packet->OffsetHigh = nextReadOffset.HighPart;         ReadFile(hRead, packet->buffer, bytes, NULL, packet);         readsOutstanding++;         break;      }   }}

This kind of hints at the flexibility advantage of IOCP. The entire asynchronous pipeline is managed through a single place. Furthermore, the WinSock API directly supports IOCP, so you call WSASend() or whatever the function name is, and it will gladly operate on an overlapped socket in exactly the same way. You'll get notification of the network completion through GetQueuedCompletionStatus().

All of this is abstracted out for you in boost::asio, but boost::asio requires an o/s specific interface to be implemented. It provides the windows class that uses IOCP internally for both disk i/o and sockets, and it provides the linux implementation for sockets, but it doesn't provide any linux implementation for disk i/o, which would ultimately be a mapping of the boost required interface to the AIO api.

I implemented such a system at work for some high performance disk backup software. Using the IOCP approach was the only way I could achieve high enough performance that the actual physical disk became the bottleneck. I can now read/write literally as fast as the disk allows, which is surprisingly hard to achieve.

##### Share on other sites
Quote:
Original post by Antheus
Quote:
 Original post by cache_hitI took a quick look at your code. What you wrote is definitely better than a threaded approach, but you should really look into I/O completion ports, as it's significantly faster and more scalable.

Did you mean overlapped IO?

IOCP is a specific type of overlapped I/O that microsoft recommends for the most performance-intensive and demanding applications.

##### Share on other sites
hm, IOCPs certainly are nice, but again, in this case, I don't see any relevant gains to be had. Overhead due to WaitForMultipleObjects+ResetEvent is absolutely negligible for the few active requests. The code manages to max out an ioDrive SSD already (700 MB/s IIRC).

Quote:
 Using the IOCP approach was the only way I could achieve high enough performance that the actual physical disk became the bottleneck. I can now read/write literally as fast as the disk allows, which is surprisingly hard to achieve.

I'm curious: were you seeing a bottleneck with overlapped completion mechanisms other than IOCP? Which one was it - events or callbacks?

The IOCP hype is probably coming from applications where you have tons of requests and multiple threads (web server), and definitely makes sense in that context.

##### Share on other sites
Quote:
 Original post by Jan Wassenberghm, IOCPs certainly are nice, but again, in this case, I don't see any relevant gains to be had. Overhead due to WaitForMultipleObjects+ResetEvent is absolutely negligible for the few active requests. The code manages to max out an ioDrive SSD already (700 MB/s IIRC).

What about mechanical disks with multiple platters and read/write heads? Ironically it's harder to max out performance on these due to the fact that ordering of read/writes is more important.

Also, you mentioned doing compression / decompression. In your application, which is the slowest among { reading, writing, compression } of a single block of data? Is compression happening asynchronously as well? In theory if compression is the bottleneck, then you should notice 0 effect by introducing a slower disk, and if the disk is the bottleneck you should notice 0 effect by removing compression. Is this the case for you? If so maybe there is little reason to use IOCP.

The main reason I like it is that it provides an arbitrary asynchronous computational model that just happens to be extremely easy to use for I/O both on the network and to/from a disk, while still being pretty easy to use for arbitrary asynchronous computation. The event model loses scalability when your pipeline grows and you're waiting on hundreds of handles simultaneously.

##### Share on other sites
Quote:
 Original post by Jan WassenbergThe IOCP hype is probably coming from applications where you have tons of requests and multiple threads (web server), and definitely makes sense in that context.

IOCP gives you a framework for overlapping work by a kernel-managed threadpool. The "hype" is unwarranted, it actually is a good way to do it, especially if you need to interleave read/process/write operations.

The basic idea is the generic task dispatcher over a thread pool, similar to Mac's GCD, but somewhat more consistent that linux aio (reasons listed above).

IOCP received considerable critique when it first appeared, but since then some details were improved, and multi-core became the norm. For its purpose and intent, it does the job.

Quote:
 and if the disk is the bottleneck you should notice 0 effect by removing compression.

With slow media, removing compression decreases throughput. When full-disk compression first appeared, it often improved read/write performance for many disks.

##### Share on other sites
Quote:
 Original post by AntheusFor its purpose and intent, it does the job.
That's the understatement of the century :-)
IOCP are really one of the best implemented things in Windows. If they got one thing right, it's IOCP.

They are what epoll tries to be under Linux but embarrassingly fails. And, not only do IOCPs implement right what's wrong with epoll, they are fast, really fast, too. Posting a message onto an IOCP and retrieving it only takes about 2-3 times longer than doing the same thing on a lockfree queue. Except, well, you have none of the pain and limitation and all of the functionality, except it doesn't burn CPU cycles spinning, and except the IOCP will reduce context switches and keep caches warm by waking threads LIFO.

##### Share on other sites
Quote:
 What about mechanical disks with multiple platters and read/write heads? Ironically it's harder to max out performance on these due to the fact that ordering of read/writes is more important.

I think the performance results on mid-2008 hardware were 1016 MB/s on a 16 SAS disk array (limited by the interface) and ~700 MB/s for 8 disks. (For those tests, I had one core driving each disk and IIRC 8-deep reads of about 256 KB blocks.)
Unfortunately, those numbers don't help much because I don't know the disk specs offhand. An earlier version of this code using completion routines and only 2-deep reads maxed out the ST380021A (2003, 42 MB/s).

Quote:
 Also, you mentioned doing compression / decompression. In your application, which is the slowest among { reading, writing, compression } of a single block of data? Is compression happening asynchronously as well? In theory if compression is the bottleneck, then you should notice 0 effect by introducing a slower disk, and if the disk is the bottleneck you should notice 0 effect by removing compression. Is this the case for you? If so maybe there is little reason to use IOCP.

Our case differs vs. your backup scenario - we either read or write, but not both, so compression reduces the amount of data to be transferred.
The basic scheme looks like this:
queue up N IOsrepeat  wait until the next one is done  queue up another IO  [de]compress the [finished] blockuntil done

A test in 2006 showed that zlib was decompressing at 94 MB/s (far faster than IO), and therefore costs zero additional time because it is perfectly overlapped with the IO. However, since modern disks manage that kind of throughput and single-core performance hasn't kept up, it may be worthwhile to switch to faster LZ variants.

Quote:
 The event model loses scalability when your pipeline grows and you're waiting on hundreds of handles simultaneously.

Yes, however, that is not the case in my usage.

Quote:
 The "hype" is unwarranted, it actually is a good way to do it, especially if you need to interleave read/process/write operations.

I agree that IOCPs are fine and good, but there is definitely too much hype and too little understanding concerning their real advantages. In fact, cache_hit said it well:
Quote:
 Well, if you search around the literature surrounding IOCP, you won't find a lot of details on exactly what makes it faster and more scalable than event-driven overlapped I/O.

I see two main benefits:
- avoiding oversubscription by capping thread count and integrating with scheduler;
- avoiding fairness issues and overhead due to round-robin polling.
Neither is the least bit applicable here: no worker threads are needed, and you actually WANT unfair treatment (i.e. the first pending handle to be serviced first - compressed streams must be processed in-order).

Since IOCP-based code would be a bit more complex, I conclude that using them here would be a net loss (but servers with thousands of sockets are an entirely different story).

##### Share on other sites
"They are what epoll tries to be under Linux but embarrassingly fails. "

I've never found anything particularly slow or troublesome with epoll, and integration into application loops has got even easier now with the addition of eventfd and signalfd assisters; what is it that IOCP (which I've not yet looked at) has over epoll?

{Apart from the obvious of which OS supports which thing :-}

##### Share on other sites
Quote:
 Original post by Katiewhat is it that IOCP (which I've not yet looked at) has over epoll?
Ow... :( I only saw your post today, sorry for the late answer.

The most important thing that epoll does not have, and which makes it suck in my opinion is that it is single-threaded. Now, before you start "but of course you can...", remember that if you use one epoll with a group of threads and a readiness change happens, epoll will wake up every single one of them. The first one will get hold of the file descriptor, and the rest will be put on the wait queue again. Of course it is still possible to construct something that makes use of multiple cores, but if you do it the simple, obvious way, you get the same thundering herd which you were using epoll for in the first place -- you could as well just fire up a bunch of threads and have them all block on read() or recvfrom(), no real difference.

IOCP on the other hand, can take any number of threads and will wake up exactly one of them as an event arrives, and it will wake the most recently run thread first, which is a good thing. It also lets you specify the maximum number of threads from your pool that you want to be active at one time (this is only an approximate figure, you will occasionally get one more thread running, but it nevertheless works very well in practice). It's lightning fast, too.

Add to that some minor details like AIO + eventfd being entirely undocumented or the generally bad documentation of nearly every Linux-specific feature. Sadly, documentation is the one thing Linux always loses big time.
Sure enough, you can usually find everything in the headers somewhere if you keep searching long enough, and you can always dig through kernel sources... I'm sure some people see this as adaequate, but except for the coolness factor of being a geek, digging through the kernel sources in lack of documentation doesn't add much value. Well, not to me, anyway.
Also, you sometimes find that the available documentation has been lying to you after looking at the sources, which doesn't really make things better (take msync as an example).

Lastly, IOCP works pretty much identically on the most recent version and on 10 year old Windows, which is pretty cool.
The same cannot be said about epoll + eventfd + signalfd + timerfd (for example timerfd will not work with the still quite common Ubuntu 8.04LTS distro).