Sign in to follow this  
Cornstalks

Are Pack Files (PAK, ZIP, WAD, etc) Worth It?

Recommended Posts

Here's a question I've been mulling over. Are pack files worth using in a game? I've looked into using PhysicsFS, but I'm not liking its global state so much. I don't need any compression in my pack files, as just about everything will already be compressed (png for images, vorbis for audio, vp8 for video, protobuf binary objects for units/objects/maps, etc), and I'm more concerned about read/write times. If I used a pack file, I'd just need a format with random access (like zip). Here are the pros and cons as I see it of having an uncompressed pack file:

[b]Pros[/b][list]
[*](Slightly) harder for users to muck with
[*]It's only one file (it's kinda nice having things grouped in one file)
[/list]

[b]Cons[/b][list]
[*](Slightly) harder for me to work with
[*]Increased save times when modifying the file (which the game won't do, but I will in my editor, so it's a con for me, though users won't experience this)
[/list]

[b]???[/b][list]
[*]Faster read times? (I've heard it can help not thrash the hard drive so much, but is this really much of a concern today on modern operating systems, and does it really help a significant amount?)
[/list]

Does anyone have much experience with the pros/cons of using pack files? Are there any significant pros to using pack files, and are there any significant cons to just using the normal file system?

Share this post


Link to post
Share on other sites
Wow, thanks a ton for the great insights Hodgman! I'm definitely looking at implementing a data compiler like that. The auto-asset-refresh sounds *really* nice. That, and it lets the artists maintain their normal workflow when it comes to updating assets. And I think I'll do what you do: use pack files in the release builds and the filesystem for development builds. Abstracting the data storage and using a swappable loading class would be nice.

Share this post


Link to post
Share on other sites
After a long search i found the paper that i read long ago:

[url="http://wassenberg.dreamhosters.com/articles/study_thesis.pdf"]http://wassenberg.dreamhosters.com/articles/study_thesis.pdf[/url]

It's by Jan Wassenberg of Wildfire games (0 A.D). He describes how packing of assets resulted in a massive reduction of load times.

Share this post


Link to post
Share on other sites
Properly packed, you can reduce load times. That is by far the biggest compelling reason. Ideally packed you have a small pointer table up front followed by all the data that gets memory-mapped and copied into place as fast as the OS streams it in. However, do that wrong and it will be SLOWER than a traditional load. Profile and proceed with careful measurements.

Making it harder for end users to reverse engineer is perhaps the most invalid reason. If that is your motivation then stop.

Properly packed you can have independent resource bundles that can be worked and replaced as individual components. A great example of this is The Sims where you can download tiny packs of clothes, people, home lots, and more. People generate custom content all the time and upload their hair models, body models, clothing models, the associated textures and whatnot, all in their own little bundle.

Many comprehensive systems will use dual-load systems, first checking the packaged resources and then checking the file system for updated resources. That enables you to make changes without rebuilding all the packages. Even better systems will watch the file system and automatically update when changes are detected. This is extremely useful when there are external tools, such as string editors, tuning editors, and various resource editors so you can see your changes immediately in game.

Share this post


Link to post
Share on other sites
I'm very interested by this.
I initially started by referring to resources (possibly the same thing as "asset names") however I had a few collisions here and there and I later switched to using file names directly. I didn't like this and I don't like it now, I want to go back to asset names in the future however I am still unsure on how to deal with naming collisions and in general provide a fine degree of flexibility.
Perhaps it would be just better to give better naming conventions?
Suggestions on rules about resource->file mappings?

Share this post


Link to post
Share on other sites
[quote name='frob' timestamp='1333310062' post='4927254']
Making it harder for end users to reverse engineer is perhaps the most invalid reason. If that is your motivation then stop.
[/quote]
It's not. My primary goal is load times (though I wanted to confirm that was still an issue, as the last time I considered this topic was years and years ago).

The bundles idea is a cool concept I hadn't thought of. While I don't plan on my current game being very moddable, it's definitely something I'd like to do if I make a more moddable game.

Keep the good input flowing! This has all helped me a lot.

Share this post


Link to post
Share on other sites
@Madhed:
In respect of the generally very interesting paper by Jan Wassenberg, one should note that it contains a lot of very useful information for some cases, and a lot of consideration in general. If one develops for a console or considers streaming data from CD, the paper hits the spot 100%. Some of the techniques described (e.g. duplicating blocks) are big win when you read from a medium where seeking is the end of the world (such as a DVD), or when you can't afford clobbering some RAM.
On the other hand, if one targets a typical Windows desktop PC with "normal" present time hardware, almost all of the claims and assumptions are debatable or wrong (that was already the case in 2006 when the paper was written).

What is indisputably right is that it's generally a good idea to have one (or few) big files rather than a thousand small ones.
Other than that, one needs to be very careful about which assumptions are true for the platform one develops on.

On a typical dektop machine which typically has half a gigabyte or a gigabyte of unused memory (often rather 2-4 GiB nowadays, or more), you absolutely do not want to bypass the file cache. If speed (and latency, and worst case behaviour) is of any concern, you also absolutely do not want to use overlapped IO.

Overlapped IO rivals memory mapping in raw disk throughput if the file cache is disabled and if no pages are in cache. This is cool if you want to stream in data that you've never seen and that you don't expect to use again. It totally sucks otherwise, because the data is gone forever once you don't use it any more. With memory mapping, you pull the pages from the cache the next time you use the data. Even with some seeks in between (if only part of a large file is in the cache), pulling the data from the cache is no slower and usually faster (much to my surprise -- this is counterintuitive, but I've spend some considerable time on benchmarking that).

Ironically, overlapped IO runs at about 50% of the speed of synchronous IO, if it is allowed to use the cache (this is, other than under e.g. Linux, actually possible under Windows). Pulling data from the cache into the working set synchronously peaks at around 2 GiB/s on my system (this is surprisingly slow for "doing nothing", a memcpy at worst, but it beats anything else by an order of magnitude).

Asynchronous IO will silently, undetectably, unreliably, and differently between operating systems and versions, and depending on user configuration, revert to synchronous operation. Also, if anything "unexpected" happens, queueing an overlapped request can suddenly block for 20 or 40 milliseconds or more (so much for threadless IO, which means your render thread stalls during that time). This is not singular to Windows, Linux has the exact same problem. If the command queue is full or some other obscure limit (that you don't know about and that you cannot query!) is hit, your io_submit blocks. Surprise, you're dead.

What you ideally want is to memory map the entire data file and prefault as much of it as you can linearly at application start (from a worker thread).

If you, like me, own a "normal, inexpensive" 3-4 year old harddisk, you can observe that this will suck a 200 MiB data file into RAM in 2 seconds, with few or no seeks at all. If you, like me, also have a SSD, you can verify that the same thing will happen in well under a second. Either way, it's fast and straightforward. If your users, like pretty much everyone, have half a gigabyte of unused memory, the actual read later will be "zero time" without ever accessing the disk.
This is admittedly the best case, not the worst case. But the good news is that the worst case is no worse than otherwise. The best (and average) case, on the other hand, is much better.

Share this post


Link to post
Share on other sites
@samoth
fair point. I just wanted to point out the paper since that was the first thing that sprung to my mind when reading the thread title. I haven't acually implemented or verified the results but found the paper interesting enough to share.

Cheers

Share this post


Link to post
Share on other sites
I use PhysFS myself and it works great I think. It allows you to not use an archive, but instead mount an actual folder.

This means in development you can still be using PhysFS and be working with the resources on disk directly, and then create an archive and switch to using the archive by mounting the .pak file or whatever.

PhysFS has really nice FileIO functions too.

I find it's also pretty easy to write a little batch script or shell script that creates the archive from a folder in one click if you add something and want to see changed results.

Share this post


Link to post
Share on other sites
[quote name='Krohm' timestamp='1333371606' post='4927466']I initially started by referring to resources (possibly the same thing as "asset names") however I had a few collisions here and there and I later switched to using file names directly. I didn't like this and I don't like it now, I want to go back to asset names in the future however I am still unsure on how to deal with naming collisions and in general provide a fine degree of flexibility.
Perhaps it would be just better to give better naming conventions?
Suggestions on rules about resource->file mappings?[/quote]
My build tool scans the entire content directory and builds a map of filenames to paths. If the same filename appears at multiple paths ([i]e.g. [font=courier new,courier,monospace]content/lvl1/foo.png[/font], [font=courier new,courier,monospace]content/lvl2/foo.png[/font][/i]), then a boolean is set in the map, indicating that this name->path mapping is conflicted.

When evaluating build rules, this table is used to locate input files on disk. If an entry from the table is used that has it's conflict flag set, then the tool spits out an error ([i]describing the two paths[/i]) and refuses to build your data. This is similar to bad code spitting out assertion failures and refusing to run.

Because my particular data-compiler is designed to always be running, it listens to changes to the content directory, and if you create a duplicate file, I can pop up one of those annoying bubbles from the system tray, letting you know you've just made a potential conflict before you even try to build your data.


Regarding naming conventions, I can somewhat enforce these by specifying them in my build rules.
For example, if I wanted to disallow "[font=courier new,courier,monospace]foo.texture[/font]" and enforce "[font=courier new,courier,monospace]foo_type.texture[/font]", where "[font=courier new,courier,monospace]type[/font]" is some kind of abbreviation, I can set only rules that contain "[font=courier new,courier,monospace]_type[/font]". Let's say one of my "types" is "colour+alpha", and that I want the artists want to author colour and alpha seperately.[code]Rule(TexCombiner, "temp/(.*)_ca.tga", {"$1_c.tga", "$1_a.tga"})
Rule(TexCompiler, "data/(.*)_ca.texture", "temp/$1_ca.tga" )[/code]
Then, if someone sets up a material to link to "foo_ca.texture", the data compiler follows this sequence:
[font=courier new,courier,monospace]Build[/font]: [font=courier new,courier,monospace]data/foo_ca.texture[/font]
*[font=courier new,courier,monospace]Matches rule[/font]: "[font=courier new,courier,monospace]data/foo_ca.texture[/font]", inputs are "[font=courier new,courier,monospace]temp/foo_ca.tga[/font]"
**[font=courier new,courier,monospace]Build[/font]: [font=courier new,courier,monospace]temp/foo_ca.tga[/font]
***[font=courier new,courier,monospace]Matches rule[/font]: "[font=courier new,courier,monospace]temp/(.*)_ca.tga[/font]", inputs are "[font=courier new,courier,monospace]foo_c.tga[/font]" and "[font=courier new,courier,monospace]foo_a.tga[/font]"
***[font=courier new,courier,monospace]Search content map[/font] for "[font=courier new,courier,monospace]foo_c.tga[/font]" and "[font=courier new,courier,monospace]foo_a.tga[/font]"
***[font=courier new,courier,monospace]Run plugin[/font]: [font=courier new,courier,monospace]TexCombiner[/font], inputs {"[font=courier new,courier,monospace]content/bar/foo_c.tga[/font]", "[font=courier new,courier,monospace]content/bar/foo_c.tga[/font]"}, output "[font=courier new,courier,monospace]temp/foo_ca.tga[/font]"
*[font=courier new,courier,monospace]Run plugin[/font]: [font=courier new,courier,monospace]TexCompiler[/font], input "[font=courier new,courier,monospace]temp/foo_ca.tga[/font]", output "[font=courier new,courier,monospace]data/foo_ca.texture[/font]"

Whereas, if someone sets up a material to link to something like "foo.texture", which doesn't follow the convention, the data compiler follows this sequence:
[font=courier new,courier,monospace]Build[/font]: data/foo.texture
*[font=courier new,courier,monospace]Error[/font]: No rule matches "data/foo.texture"

Share this post


Link to post
Share on other sites
Thank you very much, I think I understand the basic principles. I am currently doing something similar to the example you describe. It now appears reasonable not allowing conflicts to happen is better than fixing them.

Share this post


Link to post
Share on other sites
I use a system similar to java classpath/jars (or PhysFS ?). In fact I have a virtual filesystem with multiple layers of archives or directories which could be mounted. Important is the fact, that there're layers of archives. I.e.

data-archive:
/data/texture/tex1.png (Version 1.0)
/data/texture/tex2.png (Version 1.0)
/data/scripts/script1.lua (Version 1.0)

patch-archive:
/data/texture/tex1.png (Version 1.1)
/data/scripts/script1.lua (Version 1.1)

directory:
/data/texture/tex1.png (Version 1.2)

When I mount the archives/directories in the order data->patch->directory, I got the final virtual filesystem:
/data/texture/tex1.png (Version 1.2)
/data/texture/tex2.png (Version 1.0)
/data/scripts/script1.lua (Version 1.1)

This comes in really handy when delivering patches or exchanging single files for debugging purpose (atleast for a hobby dev [img]http://public.gamedev.net//public/style_emoticons/default/happy.png[/img] ).

Share this post


Link to post
Share on other sites
Yes, they typically are.

The main motivations are quicker load times and more convenience of distribution.

Suppose you have 10,000 files. This is a fairly low number of individual objects, even for a game without much content.

On a desktop OS, your on-access AV program must scan every file. This is usually very time consuming.

You'll probably distribute your game as an archive (e.g. zip) anyway. So it doesn't make any difference. Your packer may be less efficient than zip. or more, but it doesn't really matter.

The overhead of having large numbers of files in the OS filesystem is quite significant, particularly when you remember that EVERYONE has on-access AV scanners!

Development convenience can be provided by having dev-builds search the filesystem first (and the resources.zip second) for files.

---

There is no security / reverse engineering benefit, because it is just as easy for a cracker to modify your big zip file as it would be if they were individual files. If you want to discourage casual reverse-engineering (or graphics ripping etc), then rename your .zip file to .zpi or something :)

Share this post


Link to post
Share on other sites
Thousands of files in one directory will simply kill Windows-based machine, regardless of whether NTFS or FAT, SSD or regular, AV or not.

I don't know the reason, but this has been pathological worst case, perhaps due to CreateFile() or something.

At minimum, put all those files into an uncompressed zip and it will improve access times dramatically, despite same amount of data.

Share this post


Link to post
Share on other sites
An other point is, that many third party resources (textures,sounds,models etc.) are coming with a license which forces you to deliver the resources in a protected way. A resource file(!=simple zip) is atleast a basic protection.

Share this post


Link to post
Share on other sites
[quote name='Ashaman73' timestamp='1334664051' post='4932101']
An other point is, that many third party resources (textures,sounds,models etc.) are coming with a license which forces you to deliver the resources in a protected way. A resource file(!=simple zip) is atleast a basic protection.
[/quote]
Ha, no it's not. Any game that gains popularity is not protected (I mean really... Spore, anyone?), and any game that doesn't gain popularity isn't worth overcomplicating in the name of unnecessary and ineffective protection. And I highly doubt media licenses would seriously force me to deliver them in a "protected" pack file (I could be wrong, but I'd be surprised)... The reason I listed "(Slightly) harder for users to muck with" as a pro is not so much because it would stop Average Joe from replacing a texture (because if Pro-Hacker Henry hacks the file anyway, all he needs to do is release a program and Average Joe can now do everything Pro-Hacker Henry can do), but because I think some modders get a kick out of reverse engineering things and in a way it helps develop a modding community for the game, which (if done correctly) I think can be a good thing. [edit: Hodgman has pointed out that this has come across as quite arrogant; please read my post below, as that was not my intention (and realize I am not talking about the legal issues of fulfilling a contract here; I'm talking more about what frob said above)]

Anyway, sorry, I'm not trying to start a holy war here. You've all brought up some great points. I hadn't thought about anti-virus programs, and I didn't realize Windows struggled with lots of files in one folder (I'd probably categorize them in subfolders anyway, but it's good to know).

Share this post


Link to post
Share on other sites
Sorry to whomever voted down Cornstalks but I had to undo your vote.
I am the author of MHS (Memory Hacking Software) and I have a very informed view on this topic.

Hackers can’t really be prevented. Instead for what we anti-cheat specialists aim is to simply minimize the spread of cheats, and Cornstalks was basically trying to illustrate this.

I have models that I use as test material for my own engine which I got from a site, but in the back of my mind I know for a fact that they ripped that content out of a Final Fantasy game illegally. The model is a raw hack of data.

Fine. I am still going to use that data as test material so I can gauge the progress of my engine. Some of my data I know to be illegally ripped from Halo as well but I was not the one who ripped it.

But this is peanuts compared to some of the things I myself have ripped from games, which include entire levels, not just individual models.
If you were to proclaim that you had any form of unhackable resource in any product you made I would simply laugh. The digital age + protection? Give me a break.

No we can’t stop hackers. But you really don’t realize how effective the small stuff is.
As mentioned by Cornstalks, the “slightly harder for users to muck with” situation is actually extremely effective when combating hackers. I have first-hand experience talking with hackers of all levels and I can personally confirm that hackers without much skill give up very easily.
In my own engine I have a custom compression system that acts as a deterrent for hackers. Why?
Only a little has changed from the standard libraries, but in order to handle that change you still have to rewrite the entire decompression system.
Even if some people realize that, very few of them are willing to actually do it. “Meh, I will just hack something else,” is how most will reply.

They end up creating more basic-level hacks and then keeping them for themselves. Why? Because there is no prestige in releasing a hack that everyone else can make.

The benefits in deterring the basic and obvious cheats are actually quite huge and very very frequently underestimated.


L. Spiro

Share this post


Link to post
Share on other sites
[quote name='L. Spiro' timestamp='1334679765' post='4932176']Sorry to whomever voted down Cornstalks but I had to undo your vote.[/quote]I downvoted it because of this arrogant dismissal of the reality that lawyers don't understand technology. I can paraphrase it as "[i]Ashaman73, I've no experience with such legal requirements so I'll declare that they don't exist[/i]".
[quote name='Cornstalks' timestamp='1334675205' post='4932151'][quote name='Ashaman73' timestamp='1334664051' post='4932101']An other point is, that many third party resources (textures,sounds,models etc.) are coming with a license which forces you to deliver the resources in a protected way. A resource file(!=simple zip) is atleast a basic protection.[/quote]Ha, no it's not. Any game that gains popularity is not protected (I mean really... Spore, anyone?), and any game that doesn't gain popularity isn't worth overcomplicating in the name of unnecessary and ineffective protection. And[b] I highly doubt media licenses would seriously force me to deliver them in a "protected" pack file[/b] (I [s]could be[/s] [i]am [/i]wrong, [s]but I'd be surprised[/s])...[/quote]The point was that if you're legally obliged to use 'protected' files then you're forced to jump through this hoop. Yes, your 'protected' files can easily be opened, but that doesn't change the fact that if there's a legal requirement to use 'protected' files, then you may have to. And yes, such legal hoops do exist and are an important detail in the real world.

For example, anyone can rip a copy-protected DVD easily, but the fact that the DVD has weak anti-copying measures means that you've crossed a particular legal line in the sand, which makes the lawyers job much easier when trying to prosecute pirates. Even though this copy-protection is useless in stopping copies from being made, publishers use it anyway as it becomes a legal weapon ([i]the data was "protected" and you "broke" that protection[/i]).

Share this post


Link to post
Share on other sites
[quote name='Hodgman' timestamp='1334719087' post='4932350']
[quote name='L. Spiro' timestamp='1334679765' post='4932176']Sorry to whomever voted down Cornstalks but I had to undo your vote.[/quote]I downvoted it because of this arrogant dismissal of the reality that lawyers don't understand technology.
[quote name='Cornstalks' timestamp='1334675205' post='4932151'][quote name='Ashaman73' timestamp='1334664051' post='4932101']An other point is, that many third party resources (textures,sounds,models etc.) are coming with a license which forces you to deliver the resources in a protected way. A resource file(!=simple zip) is atleast a basic protection.[/quote]Ha, no it's not. Any game that gains popularity is not protected (I mean really... Spore, anyone?), and any game that doesn't gain popularity isn't worth overcomplicating in the name of unnecessary and ineffective protection. And I highly doubt media licenses would seriously force me to deliver them in a "protected" pack file (I could be wrong, but I'd be surprised)...[/quote]The point was that if you're legally obliged to use 'protected' files then you're forced to jump through this hoop. Yes, your 'protected' files can easily be opened, but that doesn't change the fact that if there's a legal requirement to use 'protected' files, then you may have to. And yes, such legal hoops do exist and are an important detail in the real world.

For example, anyone can rip a copy-protected DVD easily, but the fact that the DVD has weak anti-copying measures means that you've crossed a particular legal line in the sand, which makes the lawyers job much easier when trying to prosecute pirates. Even though this copy-protection is useless in stopping copies from being made, publishers use it anyway as it becomes a legal weapon ([i]the data was "protected" and you "broke" that protection[/i]).
[/quote]
I will apologize for the arrogance, as I honestly didn't intend for it to come across as arrogant as I suppose it has (when I read "protection" I immediately thought of Spore, which has always been a funny example of epic failure to me, hence the "ha" part). I am indeed sorry for that.

I will say I am surprised that it sounds like (from Hodgman and Ashaman) it's not uncommon for contracts to require things to be packed and protected... I understand that of course if a contract states that, that is what needs to be done. I was not disagreeing with that. My point was more in line with frob's point that "Making it harder for end users to reverse engineer is perhaps the most invalid reason." Sure, it can prevent newbs from messing around with things, but I'd use a simple encryption/obfuscation scheme for that if that was my goal (which could be implemented on top of the pack file or raw files on the file system) (I see "packing things into a file" and "encrypting/obfuscating them" as two different problems with two different goals, though they can be used in combination).

Share this post


Link to post
Share on other sites
I feel the need to add a good example of the media understanding enough about to screw you over.

Over here, we pay a small fee on every blank medium (from printer paper to DVDs) than can be used to copy stuff. At a time they even tried introducing it for RAM, arguing that there are "temporary copies" created in RAM. In return we have the right to create "private copies" and distribute small numbers to friends (rule of thumb says around 7 and you must be familiar with the people you give them to).

Then the lobbying began to change copyright laws and just as Hodgman said, it was now illegal to copy anything with a copy protection (there was a wording about "effective copy protection" which I bet has been removed by now, because obviously if you can copy it anyway, it wasn't that "effective"). As a result, everything is copy protected, we still pay that fee for everything, but the "private copy" thing has become purely hypothetical.

In short: don't underestimate the amount of legal stuff companies and lawyers can throw at things. Even if they don't have a clue about the technology they will still try to drown in licenses, just to legally be on the safe side. Particularly if you use licensed art, expect them to enforce some way make it non-trivial to just extract and distribute it.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this