• Create Account

### #Actualcr88192

Posted 06 February 2013 - 04:15 PM

samoth, on 06 Feb 2013 - 09:17, said:

Quote
I am writing mostly from personal experience, which tends to be that HDD's aren't very fast.

More precisely, one should say:
Harddisks are fast. Fast enough for pretty much everything. SSDs are faster, in fact, ridiculously fast. They're becoming more and more common (and thus cheaper, which is a positive feedback loop), and I'm hoping that in maybe 3-5 years from now you'll be able to assume a SSD as "standard" in every computer. Harddisk seeks are slow, but if you pay only a little attention (e.g. not ten thousand files), not so much that they're ever an issue.

seeks are basically why data can be packaged up, then a person can read a single larger package in place of a lot of little files.

as for HDD speed, it depends:
I am currently using 5400RPM 1TB drives, which have a max IO speed of roughly 100MB/s or so (though, 50-70MB/s is typical for OS file-IO).

although, yes, 7200RPM drives are also common, they are more expensive for the size, and SSDs are still rather expensive (like $140 for 128GB drive and similar), vs like$89 for 2TB or similar for a 5400RPM drive (vs like $109 for a similar-size 7200RPM drive). Quote DVDs are... oh well... about 1/3 to 1/5 the speed of a harddisk, but you normally need not worry. DVD seeks, on the other hand, are the devil. Which boils down to something as simple as: Try and avoid seeks, and you're good. If you ship on DVD, avoid them as if your life depended on it. A slow harddisk (some 3-4 year old 7200 rpm disk) realistically delivers upwards of 50MiB/s if you don't have too many seeks. That isn't so bad. With NCQ, you often don't even notice a few dozen seeks at all, and even without, the impact is very small if you have several concurrent reads. 4 year old 5400RPM drives (3.5 inch internal), in my case... things are slow enough to stall trying to do screen-capture from games if using uncompressed video, or so help you if you try to do much of anything involving the HDD if a virus scan is running... better capture is generally gained via a simple codec, like MJPEG or the FFDShow codec, then transcoding into something better later. I have a 10 year old laptop drive (60GB 4800RPM), which apparently pulls off a max read speed of around 12MB/s, and a newer laptop drive (320GB 5400RPM) which pulls off around 70MB/s (or around 30-40MB/s inside an OS). Quote A SSD delivers so fast that decompression can be the actual bottleneck. Seriously. You just don't care about its speed or about seek times. Now consider a DVD. A typical, standard 16x drive will, in the best case, deliver something close to 20 MiB/s, so between 1/3 and 1/2 of what you get from a slow harddisk. In the worst case (sectors close to the middle), you'll have about half that. Now throw in 30 seeks, which on the average take around 150ms each on a DVD drive. That's a full 4.5 seconds spent in seeking. Compared to these 4.5 seconds during which nothing at all happens (nothing useful from your point of view, anyway), pretty much every other consideration regarding speed becomes insignificant. my experience is roughly pulling in a few hundred MB of data over the course of a number of a few seconds or so. more time is generally spent reading the data than decompressing it. like, with deflate: I have previously tested getting decompression-speeds of around several GB/s or so (and in a few past tests, namely with readily compressible data, hitting the RAM speed limit). (mostly as the actual "work" of inflating data consists mostly of memory-copies and RLE flood-fills). implemented sanely, decoding a Huffman symbol is mostly a matter of something like: v=hufftab[(win&gt;&gt;pos)&amp;32767]; pos+=lentab[v]; if(pos&gt;=8) { pos-=8; win=(win&gt;&gt;8)|((*cs++)&lt;&lt;24); } which isn't *that* expensive (and more so, it is only a small part of the activity except mostly in poorly-compressible data). ( EDIT/ADD: more so, the poorly-compressible data edge-case is generally handled because the deflate encoder is like "hey, this data doesn't compress for crap" and falls back to "store" encoding, which amounts to essentially a straight memory copy in the decoder. side note: modern CPUs also include dedicated silicon to optimize special memory-copying and flood-fill related instruction sequences, so these are often fairly fast. ) granted, I am using a custom inflater, rather than zlib (it makes a few optimizations under the assumption of being able to compress/decompress an entire buffer at once, rather than assuming making use of a piecewise stream). this doesn't necessarily apply to things like LZMA, which involve a much more costly encoding, and have a hard time breaking 60-100MB/s for decoding speeds IME, but is more specific to deflate (which can be pretty fast on current HW). for things like PNG, most of the time goes into running the filters over the decompressed buffers, which has lead to some amount of optimization (generally, dedicated functions which apply a single filter over a scanline with a fixed format). this is why the Paeth optimization trick became important: the few conditionals inside the predictor themselves became a large bottleneck. this is because (unlike deflate) it is necessary to execute some logic for every pixel (like, add the prior pixel, or a prediction based on the adjacent pixels). JPEG is also something that would normally be expected to be dead-slow, but isn't actually all that bad (more so, as my codec also assumes the ability to work a whole image at once), and most operations boil down to a "modest" amount of fixed-point arithmetic (like in the DCT, no cosines or floating-point is actually involved, just a gob of fixed-point). sort of like with PNG, the main time-waster tends to become the final colorspace transformation (YUV -&gt; RGB), but this can be helped along in a few ways: writing logic specifically for configurations like 4:2:0 and 4:4:4, which can allow more fixed-form logic; generally transforming the image in a block-by-block manner (related to the above, often it is an 8x8 or 16x16 macroblock which is broken down into 2x2 or 4x4 sub-blocks for the color-conversion, partly to allow reusing math between nearby pixels); using special logic to largely skip over costly checks (like individually range-clamping pixeks, ..., which can be done per-block rather than per-pixel); for faster operation, one can also skip over time-wasters like rounding and similar; ... with a lot tricks like this, one can get JPEG decoding speeds of ~ 100 Mpix/s or so (somehow...), ironically not being too far off from a PNG decoder (or, FWIW, a traditional video codec). getting stuff like this fed into OpenGL efficiently is a little bit more of an issue, but, the driver-supplied texture compression was at least moderately fast, and in my own experiments I was able to write faster texture compression code. the main (general) tricks mostly seem to be: avoid conditionals where possible (straight-through arithmetic is often faster, as pipeline stalls will often cost more than the arithmetic); if at all possible, avoid using floating point in per-pixel calculations (the conversions between bytes and floats can kill performance, so straight fixed-point is usually better here). SIMD / SSE can also help here, but has to be balanced with its relative ugliness and reduced portability. as for an SSD: dunno... I suspect at-present, SSDs are more of a novelty though... even if the SSD is very fast, the speed of the SATA bus will still generally limit them to about 400MB/s or so, so using compression for speedups still isn't completely ruled out (though, granted, it will make much less of a difference than it does with a 50MB/s or 100MB/s disk-IO speed, and avoidance of "costly" encodings may make more sense). granted, it seems with the newly optimized PNG Paeth filter, and faster DXT encoding, the main time waster (besides disk-IO) during loading is now... apparently... the code for scrolling the console buffer... (this happens whenever a console print message prints a newline character, which involves moving the entire console up by 1 line, and is basically just a memory copy). (sode note: this console operation just naively uses "for()" loops to copy memory...). and, shaved a few seconds off the startup time (down to about 3 seconds to start up the engine, and around 9 seconds to load the world). ### #1cr88192 Posted 06 February 2013 - 02:28 PM samoth, on 06 Feb 2013 - 09:17, said: Quote I am writing mostly from personal experience, which tends to be that HDD's aren't very fast. More precisely, one should say: Harddisks are fast. Fast enough for pretty much everything. SSDs are faster, in fact, ridiculously fast. They're becoming more and more common (and thus cheaper, which is a positive feedback loop), and I'm hoping that in maybe 3-5 years from now you'll be able to assume a SSD as "standard" in every computer. Harddisk seeks are slow, but if you pay only a little attention (e.g. not ten thousand files), not so much that they're ever an issue. seeks are basically why data can be packaged up, then a person can read a single larger package in place of a lot of little files. as for HDD speed, it depends: I am currently using 5400RPM 1TB drives, which have a max IO speed of roughly 100MB/s or so (though, 50-70MB/s is typical for OS file-IO). although, yes, 7200RPM drives are also common, they are more expensive for the size, and SSDs are still rather expensive (like$140 for 128GB drive and similar), vs like $89 for 2TB or similar for a 5400RPM drive (vs like$109 for a similar-size 7200RPM drive).

Quote
DVDs are... oh well... about 1/3 to 1/5 the speed of a harddisk, but you normally need not worry. DVD seeks, on the other hand, are the devil.

Which boils down to something as simple as: Try and avoid seeks, and you're good. If you ship on DVD, avoid them as if your life depended on it.

A slow harddisk (some 3-4 year old 7200 rpm disk) realistically delivers upwards of 50MiB/s if you don't have too many seeks. That isn't so bad. With NCQ, you often don't even notice a few dozen seeks at all, and even without, the impact is very small if you have several concurrent reads.

4 year old 5400RPM drives (3.5 inch internal), in my case...

things are slow enough to stall trying to do screen-capture from games if using uncompressed video, or so help you if you try to do much of anything involving the HDD if a virus scan is running...

better capture is generally gained via a simple codec, like MJPEG or the FFDShow codec, then transcoding into something better later.

I have a 10 year old laptop drive (60GB 4800RPM), which apparently pulls off a max read speed of around 12MB/s, and a newer laptop drive (320GB 5400RPM) which pulls off around 70MB/s (or around 30-40MB/s inside an OS).

Quote
A SSD delivers so fast that decompression can be the actual bottleneck. Seriously. You just don't care about its speed or about seek times.

Now consider a DVD. A typical, standard 16x drive will, in the best case, deliver something close to 20 MiB/s, so between 1/3 and 1/2 of what you get from a slow harddisk. In the worst case (sectors close to the middle), you'll have about half that.
Now throw in 30 seeks, which on the average take around 150ms each on a DVD drive. That's a full 4.5 seconds spent in seeking. Compared to these 4.5 seconds during which nothing at all happens (nothing useful from your point of view, anyway), pretty much every other consideration regarding speed becomes insignificant.

my experience is roughly pulling in a few hundred MB of data over the course of a number of a few seconds or so.
more time is generally spent reading the data than decompressing it.

like, with deflate:
I have previously tested getting decompression-speeds of around several GB/s or so (and in a few past tests, namely with readily compressible data, hitting the RAM speed limit). (mostly as the actual "work" of inflating data consists mostly of memory-copies and RLE flood-fills).

implemented sanely, decoding a Huffman symbol is mostly a matter of something like:
v=hufftab[(win&gt;&gt;pos)&amp;32767];
pos+=lentab[v];
if(pos&gt;=8) { pos-=8; win=(win&gt;&gt;8)|((*cs++)&lt;&lt;24); }

which isn't *that* expensive (and more so, it is only a small part of the activity except mostly in poorly-compressible data).

granted, I am using a custom inflater, rather than zlib (it makes a few optimizations under the assumption of being able to compress/decompress an entire buffer at once, rather than assuming making use of a piecewise stream).

this doesn't necessarily apply to things like LZMA, which involve a much more costly encoding, and have a hard time breaking 60-100MB/s for decoding speeds IME, but is more specific to deflate (which can be pretty fast on current HW).

for things like PNG, most of the time goes into running the filters over the decompressed buffers, which has lead to some amount of optimization (generally, dedicated functions which apply a single filter over a scanline with a fixed format). this is why the Paeth optimization trick became important: the few conditionals inside the predictor themselves became a large bottleneck. this is because (unlike deflate) it is necessary to execute some logic for every pixel (like, add the prior pixel, or a prediction based on the adjacent pixels).

JPEG is also something that would normally be expected to be dead-slow, but isn't actually all that bad (more so, as my codec also assumes the ability to work a whole image at once), and most operations boil down to a "modest" amount of fixed-point arithmetic (like in the DCT, no cosines or floating-point is actually involved, just a gob of fixed-point). sort of like with PNG, the main time-waster tends to become the final colorspace transformation (YUV -&gt; RGB), but this can be helped along in a few ways:
writing logic specifically for configurations like 4:2:0 and 4:4:4, which can allow more fixed-form logic;
generally transforming the image in a block-by-block manner (related to the above, often it is an 8x8 or 16x16 macroblock which is broken down into 2x2 or 4x4 sub-blocks for the color-conversion, partly to allow reusing math between nearby pixels);
using special logic to largely skip over costly checks (like individually range-clamping pixeks, ..., which can be done per-block rather than per-pixel);
for faster operation, one can also skip over time-wasters like rounding and similar;
...

with a lot tricks like this, one can get JPEG decoding speeds of ~ 100 Mpix/s or so (somehow...), ironically not being too far off from a PNG decoder (or, FWIW, a traditional video codec).

getting stuff like this fed into OpenGL efficiently is a little bit more of an issue, but, the driver-supplied texture compression was at least moderately fast, and in my own experiments I was able to write faster texture compression code.

the main (general) tricks mostly seem to be:
avoid conditionals where possible (straight-through arithmetic is often faster, as pipeline stalls will often cost more than the arithmetic);
if at all possible, avoid using floating point in per-pixel calculations (the conversions between bytes and floats can kill performance, so straight fixed-point is usually better here).

SIMD / SSE can also help here, but has to be balanced with its relative ugliness and reduced portability.

as for an SSD: dunno...

I suspect at-present, SSDs are more of a novelty though...

even if the SSD is very fast, the speed of the SATA bus will still generally limit them to about 400MB/s or so, so using compression for speedups still isn't completely ruled out (though, granted, it will make much less of a difference than it does with a 50MB/s or 100MB/s disk-IO speed, and avoidance of "costly" encodings may make more sense).

granted, it seems with the newly optimized PNG Paeth filter, and faster DXT encoding, the main time waster (besides disk-IO) during loading is now... apparently... the code for scrolling the console buffer...
(this happens whenever a console print message prints a newline character, which involves moving the entire console up by 1 line, and is basically just a memory copy).

and, shaved a few seconds off the startup time (down to about 3 seconds to start up the engine, and around 9 seconds to load the world).

PARTNERS