Compressing assets (boost::iostreams)

Started by
6 comments, last by KulSeran 13 years, 4 months ago
Regarding the compression of one's own assets formats (models, textures, etc.) I am trying to decide between the following approaches:

1. Use Zlib (or Boost IOStreams anyway) to compress the individual assets (which are all serialized by boost anyway)
2. Leave them uncompressed and just store them in one large file (ala. Quake's PAK files).
3. Doing both of the above.

So my questions are really which would you (more experienced) people recommend? And why? Perhaps speed is an issue...? Additionally, any tips on using Boost to compress stream would be appreciated...

Thanks in advance.

Tim
Advertisement
I am going with option number 3 with 2 slightly modified as in allowing compression. I just write the files sequentially into one big file and have header information on them about are they compressed or not. This is a very simple file format to create.

I test the compression ratio - if it is not good-enough then I will leave the file uncompressed. You can easily create a compression tool with this feature.

Also I suggest your tool should detect duplicate files and either stop the process or store one copy and create two entries into your list of files stored in that package file if more than one copies of the same data is detected.

I think it is really best to have the files stored in one big file - this is to make the performance better. You only need one file handle from the operating system - it is one 'open' operation and one 'close' operation. And also consider that anti-virus programs may monitor file usage so the more files you touch the more work may be done by an anti-virus program and if they do this then it won't help with performance that is for sure.

Touching the minimal amount of file system level files is the best way to go.
reptor,

All good advice, all stuff I hadn't thought of. I'll certainly consider one big file as my preferable choice now.

Thanks.

T
With the big file approach, how do you locate individual assets that are requested at runtime? Let's say I have some code that wants the "tree01" model, does the big file have a table at the beginning with the offset to each asset?
Yes, of course it has. The big file has a "header" table which is basically a list of all the files contained in the file. The actual data of the files come sequentially after this header table.

Together with each file name, you can have other information listed. Such as the offset from the beginning of the file to the start of the data block, and the size of the data block.

You can list two sizes for each data block: 1) uncompressed size 2) compressed size. And you can have a variable indicating whether the data block is compressed or not (or use the compressedSize variable to figure that out).

Then you can read the header table and jump to whichever file data block you want to read and decompress if necessary.


Some people may prefer to store more information, such as the last write time.

I suggest writing the number of files stored in the package file before the header table. So you can read the header table in one go as you will then know how many entries it has. Watch out for padding issues if you try to write and read structures.
I was just thinking that having to do a linear search through asset names to find something wouldn't be the most efficient thing in the world (although I have no idea how the OS locates a file, but I would guess it's a lot more sophisticated than that). I guess you could create a hash table at runtime to expedite the process.
The benefit of having to open only one file versus having to open many files would make such name searching issues very insignificant. The overhead of opening a (file system level) file is significant.
Makeing something like a quake pak file is a good idea.
As far as the compressing goes, there are two things that are beneficial to do.
1) Compress each file in the pak individually. This way, you can store lots of files in a few paks and read then whenever you need.
2) Compress the WHOLE pak file. Each pak may contain some repeat resources, but you can bundle stuff up on a per-level basis. Players could go in a pak, and a global stuff like HUD in another just to reduce some repetition of stuff.

Opening a single file, as mentioned above is the best for the OS resources. But separating stuff into per-level paks makes it even better, as you can load the whole pak into memory in one step. This reduces the need to seek around the disk. While this may not be AS important in the day of SSDs, many games get a benefit from this as the data is on a DVD in your console, and seeking a DVD is SLOW.

As far as finding your file in the pak, there are two good approaches. The first, sort everything in the header by name. Binary search on the names. The second, make a hash out of the names, search by name hash. The second one has the added benifit of taking up a lot less space (like 4 bytes per name instead of upto MAX_PATH bytes). Searching by hash can also be alot faster, but finding file names in your pak file shouldn't be a performance bottleneck anyway.

Quote:
(although I have no idea how the OS locates a file, but I would guess it's a lot more sophisticated than that)

The OS will cache some data, but for the most part is has to seek around the disk alot picking up "iNode" type markers to find all the files in a directory. This can result in a lot of seeking. (some file systems store them at the front of the disk, so it seeks away from your data by alot. others store them interspersed with the data, so the seeking is shorter).

This topic is closed to new replies.

Advertisement