Modern filesystems are very good at their job. Let them do it for you, don't reinvent them. The difference between reading a handful of files off disk separately versus reading one file and chunking it yourself is minimal in terms of actual performance. Unless you are well in excess of 100,000 files you're better off trusting the OS to do what it was built to do.
What modern, open-source compression library should I use
4 minutes ago, KarimIO said:Haha very clever. I was referring to minimizing disk lookups by storing everything in one file. While I was originally looking to compress it, that was because I believed it would decrease loading time in exchange for minimal decompression time. Now I'm just looking to load a set of files in one uncompressed archive.
It has been debated in the past whether loading compressed files could be faster, due to the fact that file access is usually slower than the CPU decompressing. Obviously this advantage has been diminished by modern solid state drives, however the issue remains that there is some trade-off.
The variety of media types that your program will encounter imply you can hardly make any assumptions about how to optimize the process. Physical drives could be slow and fragmented, SSDs might not care how your files are layed out, etc. I don't think there is much to gain by optimizing a low-level process that takes a while anyhow.
There is LZ4 which is slow to compress on highest level (thus should be done offline), but decompression is 10 times faster than zlib according to them (and half the speed of memcpy).
11 minutes ago, ApochPiQ said:Modern filesystems are very good at their job. Let them do it for you, don't reinvent them. The difference between reading a handful of files off disk separately versus reading one file and chunking it yourself is minimal in terms of actual performance. Unless you are well in excess of 100,000 files you're better off trusting the OS to do what it was built to do.
But between Id's megatextures and everyone from VALVe to Squenix to Dice chunking their data into larger files, isn't there clear performance increase by tying together larger files? Even GPUs work better by chunking together buffers into larger ones. And not everyone has an SSD. (statistics show HDDs are still ~2x as popular in terms of shipments, not to mention older machines)
The major advantage of packaging everything into an archive is for distribution. Users only need to download a single file and so long as that file is intact and it's checksums match then you can have a good degree of expectation that they have the correct content.
A simple collapse of a directory structure to a single file, with the inclusion of a table of contents, can satisfy this requirement.
After that it depends on how fancy you wish to go. One example here is that you might place resources that are loaded together - such as a map and it's textures - contiguous in the file so that they can load with less jumping around the disk. That kind of very specific fine-tuning is something that no format will give you; you need to do this yourself.
Another thing that a package can give is the ability to override content. To use the example of ID's old PAK system, a file in pak1.pak will override the same file in pak0.pak. Again, this isn't something you get from the format, you code this up in your file loading subsystem.
Some food for thought there, but again it highlights that the format is not the important thing, it's what you do with it, how you use it, that matters.
1 hour ago, ApochPiQ said:Modern filesystems are very good at their job. Let them do it for you, don't reinvent them. The difference between reading a handful of files off disk separately versus reading one file and chunking it yourself is minimal in terms of actual performance. Unless you are well in excess of 100,000 files you're better off trusting the OS to do what it was built to do.
6 minutes ago, mhagain said:The major advantage of packaging everything into an archive is for distribution. Users only need to download a single file and so long as that file is intact and it's checksums match then you can have a good degree of expectation that they have the correct content.
This is true in the world where your game lives on a hard drive. Some of us still remember the world where the game lives on optical media which practically mandates this type of packaging to have any semblance of control over read patterns. If you can block out a package to load a bunch of assets in one long sequential read to get through loading from optical, it makes a world of difference.
Probably not relevant to the question at hand of course, but I don't want to lose the plot of why some of these things were developed in the first place.
1 hour ago, Promit said:Probably not relevant to the question at hand of course, but I don't want to lose the plot of why some of these things were developed in the first place.
I assumed traditional hard drive or solid-state storage, yes. I personally always felt that laying out optical disks was a black magic, and if you're publishing on optical media, you're probably far beyond having to ask about the benefits (or costs) of file archives.
4 hours ago, KarimIO said:Now I'm just looking to load a set of files in one uncompressed archive.
The obvious one that springs to my mind is TAR.
They are very simply just file archives (no compression) but in the wild they usually will be compressed by gzip and given a tar.gz file extension.
20 hours ago, KarimIO said:But between Id's megatextures and everyone from VALVe to Squenix to Dice chunking their data into larger files, isn't there clear performance increase by tying together larger files?
Those solve different problems.
Megatextures solve a problem of graphics calls where texture state changes have a cost both in data bandwidth and graphics processing time. Using a large image reduced both of those costs. The made a time/space tradeoff that is common in computer science. By using more space in the form of an enormous block of video memory, they have a reduced time cost.
As partially discussed by others already, a larger file might give better results. It is all a matter of tradeoffs.
These have the same time/space tradeoffs. A large file might result in fewer time-expensive IO calls for things like opening a file in exchange for more calls to hopping around file windows. Opening files has a cost, which many operating systems reduce or eliminate through directory caching. Hopping around in files also has a cost, data must be transferred and buffering windows get invalidated.
The same tradeoff for compression. Compression might result in fewer calls to expensive data transfers from the disk, but it results in more expensive calls to the decompression library. The balance over which is faster is always shifting and depends on context, so there is no absolute answer.
These are always tradeoffs. Understand both sides and when each is better. Over your career you will find many of these, where something that was taken as a performance necessity one year becomes a terrible performance concern a few years later. If you don't understand the tradeoffs being made you will struggle to follow why the performance guidelines are constantly changing.
On 6/27/2017 at 10:38 AM, KarimIO said:In that case, is there any non-compressed data format for me to package all my resources into? I could make one but I figure using standards are usually better.
zlib (specifically contrib/minizip) can create zip files. Use a zero compression level if you really want the files in the zip file to be uncompressed. PAK files are usually just zip files.