Sign in to follow this  
nstrg

archive / file system in a file -- format design

Recommended Posts

Howdy everyone! I'm currently attempting to design an archiving (more directly, a file system inside a file) file format and was hoping to get some feedback on what I currently have written out. The goal is to store files in a compressed and encrypted archive for streaming access in a real-time game project. Below is a very quick thought dump of what I have so far -> HEADER -ID -Vers -FileCount -DirectoryCount DIRECTORY LISTING (repeated for every directory) -Directory -Index (used so you can get the full file path) FILE LISTING (same as above) -FileName (path stripped off) -DirectoryIndex FILE DATA DIGITAL SIGNATURE The reason why I split the path and file information into two separate areas is to allow empty directories to be stored inside the archive for future use and to quickly tell if a file exists inside an archive when you do not know the full path. Unfortunately this creates the issue of only allowing a single file with the same name in the entire archive (I have considered storing a file hash when you get duplicate names; not sure if this is the solution I'll go with or not) Based on the information / thoughts above, does anyone have feedback or suggestions? Sorry for the short post, still at work and my lunch is almost over with :) Thanks! Nate S.

Share this post


Link to post
Share on other sites
First off, take a look at already existing formats like "PAK" or "ZIP", and try to implement a reader for those. There already exist tools to create and read those files, and so you can save yourself a lot of debugging headache.

Look into something like PhysFS as a solution. It will let you use both your plain folder structure as a source as well as an assortment of archives. This allows you to both quickly edit files in your folder structure and still use archives for speed.

Some random notes from my experiences with this though:

Storing a tree structure in the file will save a lot of space if you have text paths.
Storing a sorted list of files, where each file has the full path stored with it is much easier to setup.
Storing only a hash of the file paths is more complicated, but means much faster lookups at runtime (hopefully you aren't searching for files all the time anyway!). And a lot less wasted space on storing string paths. (a few hundred K in strings might not mean much on a PC, but it can on a console for sure)

Storing asset meta data in the archive can be advantageous. Consider marking nodes as "sound", "effect", "model", "editor_data", "level", "creature_data", etc. You can then pack all related nodes together in the same location in the archive for faster access. This keeps your data logically located on disk:
/imp
----/model/imp.ma
----/sound/growl.ogg
----/stats/editor_data.lua
/cat
----/model/cat.ma
----/sound/purr.ogg
----/design/editor_data.lua

but usage located in the archive:
#tag sound#
-HASH(/imp/sound/growl.ogg)
-HASH(/cat/sound/purr.ogg)
#tag editor_data#
-HASH(/imp/stats/editor_data.lua)
-HASH(/cat/design/editor_data.lua)
#tag models#
-HASH(/imp/model/imp.ma)
-HASH(/cat/model/cat.ma)

Check out what your target platform's disk boundaries are. It can be advantageous to align the start of each block you might seek to with a disk block (ie a 4k boundary). Some platforms like PS2 only allow a disk read to start/stop on such boundaries, so aligning data can significantly impact performance.

Consider storing processed data instead of raw data. For instance, store a built .dds texture instead of the source .tga file. If you can "load in place" it is much faster than parsing a text file, or other format to generate loadable data on the fly. This will also help reduce file size, since the binary formats will take a lot less space than their text sources.

Quote:

for streaming access

Your archiver is really going to need a good way to specify exactly where everything goes. Ontop of asset tagging, you might want "chunk" tagging, so that you can block off segments to say "all of these files are sector(x,y)" or "this is the 'global' data chunk". You will want all the data in each "chunk" in one place in the archive, even if that means duplicating file data. That way you don't have to seek the disk as much. The archive would then probably store a top-level directory of "chunks", while each "chunk" would start with a header that had a listing of the files in that "chunk".

Quote:

compressed and encrypted

1) Leave these for last. Get the archive packing files first. They make your file look like garbage, making it impossible to see other problems.
Being able to open the file in a filecarver, hex editor, or notepad and checking that all your data is in there the way you expected is very useful.
2) Leave these as an option. Diffing and patching your archive can be useful. And it doesn't work so well on a compressed archive.
3) If you are going to the trouble of compression, try to implement the compression and loading in parallel using asynchronous file reads as it will drastically speed up load times.



Also, many of my "speed" based comments might soon go out of fashion with the advent of consumer SSDs.

Share this post


Link to post
Share on other sites
I've never really seen the point of having a directory layout inside a resource file - other than to slow things down at runtime. The way i see it, arrange your source data in whatever directory structure you like, but compile it down into a flat system for the big resource file. This should make loading much faster for the game.

Share this post


Link to post
Share on other sites
Packing files into an archive doesn't necessarily slow things down for loading, actually on Darkest of Days, I was in charge of adding this functionality. PC load times went from ~30seconds to less than 10 seconds, with the data compressed. The reason is because the HDD does not need to scan for the next file, instead it just keeps reading. Scanning is a major portion of the slow down, so one scan, and just keep reading chunk after chunk worked miracles.

Now this was implemented in a specific way to achieve that goal, so just randomly storing files in an archive that can have 'random access' to the contents inside, that would provide useful for; compression, encryption, and keeping things cleaner/hidden from users. Beyond that, you are correct the random access method would slow things down, unless the entire archive is loaded into RAM then accessed, which in some senses defeats the point.

Share this post


Link to post
Share on other sites
My main point is that its not really worth the effort it takes to implement a directory structure. You can implement various optimizations to make the seeking faster, but i don't see the point in implementing a full directory structure/file system.

Share this post


Link to post
Share on other sites
Having a "real" directory structure can make things easier during development time because it's easier/faster to update things incrementally rather than having to rebuild the whole thing for every minor change. It can also be convienent in the field if you have to do updating/patching.

Your point is still valid though.

Share this post


Link to post
Share on other sites
Quote:

Having a "real" directory structure can make things easier during development time

Again mentioning tools like PhysFS, having a real directory structure is nice because you can mirror assets in your game. The engine will load "cat/model/fur.tga" from patch_20.pak first if it exist, then main.pak if there then developmentdrive:/devdir/. if it isn't in any pak file.

Share this post


Link to post
Share on other sites
I forget the name of it, but there was a rather popular format back in the day (I'm thinking it orginated on the Amiga), that was a chunk-based container format. It had gained popularity for a specific use, but the setup was such that it was essentially a file meant to contain other files in a flat format. If you can find information on it, it would be good to look at.

Another option I've been meaning to look at lately, is to separate storage from format by using a .VHD (Virtual Hard Drive) as the base, and then implimenting the format (essentially a filesystem) on top of that. But maybe thats overkill :)

As an alternative to hashing, you could consider truncating directory and filenames to their shortest unique values (which is basically a variable-length, perfect hash when you think about it) -- the only real downside is that it won't be as fast as an actual hash, and if the file is to be patched with additional files or directories, that you'll need a way to resolve conflicts... A patch file might first contain commands like "rename directory 's' as 'so'" (sound) when a new directory 'sc' is introduced for scripts.

I'd definately recommend these features:
metadata
versioning
physical alignment
chunking (grouping related files together physically, separate from their logical location)

In the interest of generality, I'd do away with the idea that Filecount and Directory count are part of the header, and instead simply assume that every archive contains at least one directory descriptor (the root directory, if you will) and make the counts a part of it's structure. Directories then become list of directorie structures and lists of file names and their offsets within the archive. This makes the system truly hierarchical.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this