Jump to content
  • Advertisement
Sign in to follow this  
Cornstalks

Are Pack Files (PAK, ZIP, WAD, etc) Worth It?

This topic is 2431 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Here's a question I've been mulling over. Are pack files worth using in a game? I've looked into using PhysicsFS, but I'm not liking its global state so much. I don't need any compression in my pack files, as just about everything will already be compressed (png for images, vorbis for audio, vp8 for video, protobuf binary objects for units/objects/maps, etc), and I'm more concerned about read/write times. If I used a pack file, I'd just need a format with random access (like zip). Here are the pros and cons as I see it of having an uncompressed pack file:

Pros

  • (Slightly) harder for users to muck with
  • It's only one file (it's kinda nice having things grouped in one file)


    Cons

    • (Slightly) harder for me to work with
    • Increased save times when modifying the file (which the game won't do, but I will in my editor, so it's a con for me, though users won't experience this)


      ???

      • Faster read times? (I've heard it can help not thrash the hard drive so much, but is this really much of a concern today on modern operating systems, and does it really help a significant amount?)


        Does anyone have much experience with the pros/cons of using pack files? Are there any significant pros to using pack files, and are there any significant cons to just using the normal file system?

Share this post


Link to post
Share on other sites
Advertisement
Wow, thanks a ton for the great insights Hodgman! I'm definitely looking at implementing a data compiler like that. The auto-asset-refresh sounds *really* nice. That, and it lets the artists maintain their normal workflow when it comes to updating assets. And I think I'll do what you do: use pack files in the release builds and the filesystem for development builds. Abstracting the data storage and using a swappable loading class would be nice.

Share this post


Link to post
Share on other sites
Properly packed, you can reduce load times. That is by far the biggest compelling reason. Ideally packed you have a small pointer table up front followed by all the data that gets memory-mapped and copied into place as fast as the OS streams it in. However, do that wrong and it will be SLOWER than a traditional load. Profile and proceed with careful measurements.

Making it harder for end users to reverse engineer is perhaps the most invalid reason. If that is your motivation then stop.

Properly packed you can have independent resource bundles that can be worked and replaced as individual components. A great example of this is The Sims where you can download tiny packs of clothes, people, home lots, and more. People generate custom content all the time and upload their hair models, body models, clothing models, the associated textures and whatnot, all in their own little bundle.

Many comprehensive systems will use dual-load systems, first checking the packaged resources and then checking the file system for updated resources. That enables you to make changes without rebuilding all the packages. Even better systems will watch the file system and automatically update when changes are detected. This is extremely useful when there are external tools, such as string editors, tuning editors, and various resource editors so you can see your changes immediately in game.

Share this post


Link to post
Share on other sites
I'm very interested by this.
I initially started by referring to resources (possibly the same thing as "asset names") however I had a few collisions here and there and I later switched to using file names directly. I didn't like this and I don't like it now, I want to go back to asset names in the future however I am still unsure on how to deal with naming collisions and in general provide a fine degree of flexibility.
Perhaps it would be just better to give better naming conventions?
Suggestions on rules about resource->file mappings?

Share this post


Link to post
Share on other sites

Making it harder for end users to reverse engineer is perhaps the most invalid reason. If that is your motivation then stop.

It's not. My primary goal is load times (though I wanted to confirm that was still an issue, as the last time I considered this topic was years and years ago).

The bundles idea is a cool concept I hadn't thought of. While I don't plan on my current game being very moddable, it's definitely something I'd like to do if I make a more moddable game.

Keep the good input flowing! This has all helped me a lot.

Share this post


Link to post
Share on other sites
@Madhed:
In respect of the generally very interesting paper by Jan Wassenberg, one should note that it contains a lot of very useful information for some cases, and a lot of consideration in general. If one develops for a console or considers streaming data from CD, the paper hits the spot 100%. Some of the techniques described (e.g. duplicating blocks) are big win when you read from a medium where seeking is the end of the world (such as a DVD), or when you can't afford clobbering some RAM.
On the other hand, if one targets a typical Windows desktop PC with "normal" present time hardware, almost all of the claims and assumptions are debatable or wrong (that was already the case in 2006 when the paper was written).

What is indisputably right is that it's generally a good idea to have one (or few) big files rather than a thousand small ones.
Other than that, one needs to be very careful about which assumptions are true for the platform one develops on.

On a typical dektop machine which typically has half a gigabyte or a gigabyte of unused memory (often rather 2-4 GiB nowadays, or more), you absolutely do not want to bypass the file cache. If speed (and latency, and worst case behaviour) is of any concern, you also absolutely do not want to use overlapped IO.

Overlapped IO rivals memory mapping in raw disk throughput if the file cache is disabled and if no pages are in cache. This is cool if you want to stream in data that you've never seen and that you don't expect to use again. It totally sucks otherwise, because the data is gone forever once you don't use it any more. With memory mapping, you pull the pages from the cache the next time you use the data. Even with some seeks in between (if only part of a large file is in the cache), pulling the data from the cache is no slower and usually faster (much to my surprise -- this is counterintuitive, but I've spend some considerable time on benchmarking that).

Ironically, overlapped IO runs at about 50% of the speed of synchronous IO, if it is allowed to use the cache (this is, other than under e.g. Linux, actually possible under Windows). Pulling data from the cache into the working set synchronously peaks at around 2 GiB/s on my system (this is surprisingly slow for "doing nothing", a memcpy at worst, but it beats anything else by an order of magnitude).

Asynchronous IO will silently, undetectably, unreliably, and differently between operating systems and versions, and depending on user configuration, revert to synchronous operation. Also, if anything "unexpected" happens, queueing an overlapped request can suddenly block for 20 or 40 milliseconds or more (so much for threadless IO, which means your render thread stalls during that time). This is not singular to Windows, Linux has the exact same problem. If the command queue is full or some other obscure limit (that you don't know about and that you cannot query!) is hit, your io_submit blocks. Surprise, you're dead.

What you ideally want is to memory map the entire data file and prefault as much of it as you can linearly at application start (from a worker thread).

If you, like me, own a "normal, inexpensive" 3-4 year old harddisk, you can observe that this will suck a 200 MiB data file into RAM in 2 seconds, with few or no seeks at all. If you, like me, also have a SSD, you can verify that the same thing will happen in well under a second. Either way, it's fast and straightforward. If your users, like pretty much everyone, have half a gigabyte of unused memory, the actual read later will be "zero time" without ever accessing the disk.
This is admittedly the best case, not the worst case. But the good news is that the worst case is no worse than otherwise. The best (and average) case, on the other hand, is much better.

Share this post


Link to post
Share on other sites
@samoth
fair point. I just wanted to point out the paper since that was the first thing that sprung to my mind when reading the thread title. I haven't acually implemented or verified the results but found the paper interesting enough to share.

Cheers

Share this post


Link to post
Share on other sites
I use PhysFS myself and it works great I think. It allows you to not use an archive, but instead mount an actual folder.

This means in development you can still be using PhysFS and be working with the resources on disk directly, and then create an archive and switch to using the archive by mounting the .pak file or whatever.

PhysFS has really nice FileIO functions too.

I find it's also pretty easy to write a little batch script or shell script that creates the archive from a folder in one click if you add something and want to see changed results.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!