Jump to content

  • Log In with Google      Sign In   
  • Create Account

Awesome job so far everyone! Please give us your feedback on how our article efforts are going. We still need more finished articles for our May contest theme: Remake the Classics

#Actualsamoth

Posted 02 April 2012 - 09:00 AM

@Madhed:
In respect of the generally very interesting paper by Jan Wassenberg, one should note that it contains a lot of very useful information for some cases, and a lot of consideration in general. If one develops for a console or considers streaming data from CD, the paper hits the spot 100%. Some of the techniques described (e.g. duplicating blocks) are big win when you read from a medium where seeking is the end of the world (such as a DVD), or when you can't afford clobbering some RAM.
On the other hand, if one targets a typical Windows desktop PC with "normal" present time hardware, almost all of the claims and assumptions are debatable or wrong (that was already the case in 2006 when the paper was written).

What is indisputably right is that it's generally a good idea to have one (or few) big files rather than a thousand small ones.
Other than that, one needs to be very careful about which assumptions are true for the platform one develops on.

On a typical dektop machine which typically has half a gigabyte or a gigabyte of unused memory (often rather 2-4 GiB nowadays, or more), you absolutely do not want to bypass the file cache. If speed (and latency, and worst case behaviour) is of any concern, you also absolutely do not want to use overlapped IO.

Overlapped IO rivals memory mapping in raw disk throughput if the file cache is disabled and if no pages are in cache. This is cool if you want to stream in data that you've never seen and that you don't expect to use again. It totally sucks otherwise, because the data is gone forever once you don't use it any more. With memory mapping, you pull the pages from the cache the next time you use the data. Even with some seeks in between (if only part of a large file is in the cache), pulling the data from the cache is no slower and usually faster (much to my surprise -- this is counterintuitive, but I've spend some considerable time on benchmarking that).

Ironically, overlapped IO runs at about 50% of the speed of synchronous IO, if it is allowed to use the cache (this is, other than under e.g. Linux, actually possible under Windows). Pulling data from the cache into the working set synchronously peaks at around 2 GiB/s on my system (this is surprisingly slow for "doing nothing", a memcpy at worst, but it beats anything else by an order of magnitude).

Asynchronous IO will silently, undetectably, unreliably, and differently between operating systems and versions, and depending on user configuration, revert to synchronous operation. Also, if anything "unexpected" happens, queueing an overlapped request can suddenly block for 20 or 40 milliseconds or more (so much for threadless IO, which means your render thread stalls during that time). This is not singular to Windows, Linux has the exact same problem. If the command queue is full or some other obscure limit (that you don't know about and that you cannot query!) is hit, your io_submit blocks. Surprise, you're dead.

What you ideally want is to memory map the entire data file and prefault as much of it as you can linearly at application start (from a worker thread).

If you, like me, own a "normal, inexpensive" 3-4 year old harddisk, you can observe that this will suck a 200 MiB data file into RAM in 2 seconds, with few or no seeks at all. If you, like me, also have a SSD, you can verify that the same thing will happen in well under a second. Either way, it's fast and straightforward. If your users, like pretty much everyone, have half a gigabyte of unused memory, the actual read later will be "zero time" without ever accessing the disk.
This is admittedly the best case, not the worst case. But the good news is that the worst case is no worse than otherwise. The best (and average) case, on the other hand, is much better.

#3samoth

Posted 02 April 2012 - 08:59 AM

@Madhed:
In respect of the generally very interesting paper by Jan Wassenberg, one should note that it contains a lot of very useful information for some cases, and a lot of consideration in general. If one develops for a console or considers streaming data from CD, the paper hits the spot 100%. Some of the techniques described (e.g. duplicating blocks) are big win when you read from a medium where seeking is the end of the world (such as a DVD), or when you can't afford clobbering some RAM.
On the other hand, if one targets a typical Windows desktop PC with "normal" present time hardware, almost all of the claims and assumptions are debatable or wrong (that was already the case in 2006 when the paper was written).

What is indisputably right is that it's generally a good idea to have one (or few) big files rather than a thousand small ones.
Other than that, one needs to be very careful about which assumptions are true for the platform one develops on.

On a typical dektop machine which typically has half a gigabyte or a gigabyte of unused memory (often rather 2-4 GiB nowadays, or more), you absolutely do not want to bypass the file cache. If speed (and latency, and worst case behaviour) is of any concern, you also absolutely do not want to use overlapped IO.

Overlapped IO rivals memory mapping in raw disk throughput if the file cache is disabled and if no pages are in cache. This is cool if you want to stream in data that you've never seen and that you don't expect to use again. It totally sucks otherwise, because the data is gone forever once you don't use it any more. With memory mapping, you pull the pages from the cache the next time you use the data. Even with some seeks in between (if only part of a large file is in the cache), pulling the data from the cache is no slower and usually faster (much to my surprise -- this is counterintuitive, but I've spend some considerable time on benchmarking that).

Ironically, overlapped IO runs at about 50% of the speed of synchronous IO, if it is allowed to use the cache (this is, other than under e.g. Linux, actually possible under Windows). Pulling data from the cache into the working set synchronously peaks at around 2 GiB/s on my system (this is surprisingly slow for "doing nothing", a memcpy at worst, but it beats anything else by an order of magnitude).

Asynchronous IO will silently, undetectably, unreliably, and differently between operating systems and versions, and depending on user configuration, revert to synchronous operation. Also, if anything "unexpected" happens, queueing an overlapped request can suddenly block for 20 or 40 milliseconds or more (so much for threadless IO, which means your render thread stalls during that time). This is not singular to Windows, Linux has the exact same problem. If the command queue is full or some other obscure limit (that you don't know about and that you cannot query!) is hit, your io_submit blocks. Surprise, you're dead.

What you ideally want is to memory map the entire data file and prefault as much of it as you can linearly at application start (from a worker thread).

If you, like me, own a "normal, inexpensive" 3-4 year old harddisk, you can observe that this will suck a 200 MiB data file into RAM in 2 seconds, with few or no seeks at all. If you, like me, also have a SSD, you can verify that the same thing will happen in well under a second. Either way, it's fast and straightforward. If your users, like pretty much everyone, have half a gigabyte of unused memory, the actual read later will be "zero time" without ever accessing the disk.
This is admittedly the best case, not the worst case. But the good news is that the worst case is no worse than otherwise. The best (and average) case, on the other hand, is much better.

#2samoth

Posted 02 April 2012 - 08:59 AM

@Madhed:
In respect of the generally very interesting paper by Jan Wassenberg, one should note that it contains a lot of very useful information for some cases, and a lot of consideration in general. If one develops for a console or considers streaming data from CD, the paper hits the spot 100%. Some of the techniques described (e.g. duplicating blocks) are big win when you read from a medium where seeking is the end of the world (such as a DVD), or when you can't afford clobbering some RAM.
On the other hand, if one targets a typical Windows desktop PC with "normal" present time hardware, almost all of the claims and assumptions are debatable or wrong (that was already the case in 2006 when the paper was written).

What is indisputably right is that it's generally a good idea to have one (or few) big files rather than a thousand small ones.
Other than that, one needs to be very careful about which assumptions are true for the platform one develops on.

On a typical dektop machine which typically has half a gigabyte or a gigabyte of unused memory (often rather 2-4 GiB nowadays, or more), you absolutely do not want to bypass the file cache. If speed (and latency, and worst case behaviour) is of any concern, you also absolutely do not want to use overlapped IO.

Overlapped IO rivals memory mapping in raw disk throughput if the file cache is disabled and if no pages are in cache. This is cool if you want to stream in data that you've never seen and that you don't expect to use again. It totally sucks otherwise, because the data is gone forever once you don't use it any more. With memory mapping, you pull the pages from the cache the next time you use the data. Even with some seeks in between (if only part of a large file is in the cache), pulling the data from the cache is no slower and usually faster (much to my surprise -- this is counterintuitive, but I've spend some considerable time on benchmarking that).

Ironically, overlapped IO runs at about 50% of the speed of synchronous IO, if it is allowed to use the cache (this is, other than under e.g. Linux, actually possible under Windows). Pulling data from the cache into the working set synchronously peaks at around 2 GiB/s on my system (this is surprisingly slow for "doing nothing", a memcpy at worst, but it beats anything else by an order of magnitude).

Asynchronous IO will silently, undetectably, unreliably, and differently between operating systems and versions, and depending on user configuration, revert to synchronous operation. Also, if anything "unexpected" happens, queueing an overlapped request can suddenly block for 20 or 40 milliseconds or more (so much for threadless IO, which means your render thread stalls during that time). This is not singular to Windows, Linux has the exact same problem. If the command queue is full or some other obscure limit (that you don't know about and that you cannot query!) is hit, your io_submit blocks. Surprise, you're dead.

What you ideally want is to memory map the entire data file and prefault as much of it as you can linearly at application start (from a worker thread).

If you, like me, own a "normal, inexpensive" 3-4 year old harddisk, you can observe that this will suck a 200 MiB data file into RAM in 2 seconds, with few or no seeks at all. If you, like me, also have a SSD, you can verify that the same thing will happen in well under a second. Either way, it's fast and straightforward. If your users, like pretty much everyone, have half a gigabyte of unused memory, the actual read later will be "zero time" without ever accessing the disk.
This is admittedly the best case, not the worst case. But the good news is that the worst case is no worse than otherwise. The best (and average) case, on the other hand, is much better.

#1samoth

Posted 02 April 2012 - 08:57 AM

@Madhed:
In respect of the generally very interesting paper by Jan Wassenberg, one should note that it contains a lot of very useful information for some cases, and a lot of consideration in general. If one develops for a console or considers streaming data from CD, the paper hits the spot 100%. Some of the techniques described (e.g. duplicating blocks) are big win when you read from a medium where seeking is the end of the world (such as a DVD), or when you can't afford clobbering some RAM.
On the other hand, if one targets a typical Windows desktop PC with "normal" present time hardware, almost all of the claims and assumptions are debatable or wrong (that was already the case in 2006 when the paper was written).

What is indisputably right is that it's generally a good idea to have one (or few) big files rather than a thousand small ones.
Other than that, one needs to be very careful about which assumptions are true for the platform one develops on.

On a typical dektop machine which typically has half a gigabyte or a gigabyte of unused memory (often rather 2-4 GiB nowadays, or more), you absolutely do not want to bypass the file cache. If speed (and latency, and worst case behaviour) is of any concern, you also absolutely do not want to use overlapped IO.

Overlapped IO rivals memory mapping in raw disk throughput if the file cache is disabled and if no pages are in cache. This is cool if you want to stream in data that you've never seen and that you don't expect to use again. It totally sucks otherwise, because the data is gone forever once you don't use it any more. With memory mapping, you pull the pages from the cache the next time you use the data. Even with some seeks in between (if only part of a large file is in the cache), pulling the data from the cache is no slower and usually faster (much to my surprise).

Ironically, overlapped IO runs at about 50% of the speed of synchronous IO, if it is allowed to use the cache (this is, other than under e.g. Linux, actually possible under Windows). Pulling data from the cache into the working set synchronously peaks at around 2 GiB/s on my system (this is surprisingly slow for "doing nothing", a memcpy at worst, but it beats anything else by an order of magnitude).

Asynchronous IO will silently, undetectably, unreliably, and differently between operating systems and versions, and depending on user configuration, revert to synchronous operation. Also, if anything "unexpected" happens, queueing an overlapped request can suddenly block for 20 or 40 milliseconds or more (so much for threadless IO, which means your render thread stalls during that time). This is not singular to Windows, Linux has the exact same problem. If the command queue is full or some other obscure limit (that you don't know about and that you cannot query!) is hit, your io_submit blocks. Surprise, you're dead.

What you ideally want is to memory map the entire data file and prefault as much of it as you can linearly at application start (from a worker thread).

If you, like me, own a "normal, inexpensive" 3-4 year old harddisk, you can observe that this will suck a 200 MiB data file into RAM in 2 seconds, with few or no seeks at all. If you, like me, also have a SSD, you can verify that the same thing will happen in well under a second. Either way, it's fast and straightforward. If your users, like pretty much everyone, have half a gigabyte of unused memory, the actual read later will be "zero time" without ever accessing the disk.
This is admittedly the best case, not the worst case. But the good news is that the worst case is no worse than otherwise. The best (and average) case, on the other hand, is much better.

PARTNERS