Asset Format and Loading Schemes

Started by
4 comments, last by vreality 11 years, 11 months ago
Over the years I’ve been pretty happy with solutions I’ve seen to lots of engineering problems. But there are a couple of areas in video games that seem resistant to solutions which don’t leave you thinking there just has to be a cleaner, easier, less error-prone, lower maintenance approach.

Probably the longest lived offender has been asset formats and loading. I would like to know your favorite asset format and loading schemes. I’m not talking about packaging, compression, streaming, etc. I’m talking about the binary layout and structure of asset data, and how loaded data becomes useable by the run-time code.

How does your favorite approach work? What are its strong and weak points?
Advertisement
In-place data - assume that data-structures and file-formats are the exact same thing. Once you've described a "file-format", you can use it on disk and in memory, which means you can load assets in/out of memory without needing the words "parser" or "serializer". There's also no language boundary - anything with a byte-stream can read/write your files/data-structures once it implements the "format".

The catch is you've got to avoid using pointers inside data-structures. You can still pass pointers around and use them in run-time only data-structures, but you should avoid using them in any persistent data-structure.
If a data-structure does need pointers internally, you can add Deserialize/Serialize-esque calls that allow it to convert ID's to pointers (and back) during loading/saving.
N.B. you can almost entirely avoid pointers in practice. Instead you can use indexes, keys, relative byte-offsets (added to the address of the offset) and byte-addresses (added to a base pointer from the runtime).

Knock-on benefits to the runtime include:
* No parsing, no runtime bloat. N.B. this requires platform-specific data builders, e.g. Xbox final data files will be different to PC final data files.
* In-place dynamic allocation -- if for example, a particular level spawns up to 5 monsters at a time, then it's file can include a pool-allocator data-structure with the size of 5. This will be wasted space on the disc, but let's assume the streaming/compression layer takes care of that.
When the runtime code for that level spawns a monster, it doesn't need to allocate memory for it with new/malloc, it can instead grab memory from the pre-allocated pool.
Game middleware has to at least support the overriding/hooking of new/malloc calls by the user, but middleware that doesn't even make new/malloc calls outside of loading is even better.
* All structures (except runtime-only or Deserialize/Serialize-esque ones) are POD, which means you can memcpy them, treat them like values, cache/parallelise/distribute them, etc...
* Variable size structures are supported (e.g. where sizeof(Class) changes depending on the instance).
* If you describe your formats using a language with reflection, you can easily inspect/debug any data-structure, whether it's in the tools, or on disk or in the game's memory.
What Hodgman said.

Since I use Lua almost exclusively these days, most of my assets and data structures are saved to file as compiled Lua code. Loading a table becomes as easy as a dofile() call, or similar. Structured correctly, you can save your entire game state simply by iterating the set of tables comprising your game and saving the compiled code. It's fast, it's robust, it's easy. It's not quite as desirable as a straight "memory dump", but it's the next best thing. I take advantage of Lua's own parser to parse my format files, instead of writing a parser for each file type.
On 4/29/2012 at 6:47 AM, Hodgman said:

In-place data - assume that data-structures and file-formats are the exact same thing.

This is what I call "live" or "frozen" objects. Load 'em up and pretend like you constructed 'em. The tool has to know the binary layout of the live run-time objects. The first asset system I designed was like this.

Added features:

  • Arbitrary pointers within an asset file. They were written as relative pointers and a fix-up list was added to the file, so the loader could find them and convert them back to real pointers on load.

    IIRC this was done with serializable pointer objects which registered themselves in tool code. When the address range in which a pointer resided, and the address range to which it pointed, got written to disk, the system could automatically generate a fix-up record for it (and could automatically fail, if the second range wasn't written). In run-time code, it was a plain pointer.
     
  • Cross-file references via Asset Handles. The run-time kept a hash table of Asset ID/pointer pairs. Each file got an asset manifest, so each asset got registered on load. An asset handle held a pointer to an entry in the hash table. References to assets came in as IDs and got fixed up as handles.

    Hash table entries could be generated with IDs alone, so handles could be created to assets not yet registered. Those hash table entries would be fixed up when the asset was registered. By the same mechanism, assets could be swapped out while code was holding handles to them.
     
  • Virtual functions for top level asset classes. Since we had a manifest of asset instances, it wasn't a stretch to add a type ID and call placement new on each one, which would plug in its virtual function table pointer, if any (and wouldn't trash the object, as long as it used a constructor that didn't do anything).

I can't remember exactly why, but I didn't like that the game and tool shared code in our implementation. I remember trying to reverse engineer undocumented debug symbol files so the tool could perform reflection on C++ object layout, like the debugger does. Maybe it was the way the code was shared that was a problem. I can't remember.

The one issue that seems to come up often is what happens when data format changes. How do you approach versioning and format change roll-out?

 

On 4/29/2012 at 7:37 AM, JTippetts said:

most of my assets and data structures are saved to file as compiled Lua code.

Crazy as it sounds, I've contemplated trying to do the same thing with C++. I figure why worry about binary layout and fix-ups when that's what the compiler and dynamic linker do? I could theoretically make a tool spit out C++ code which defines a bunch of static data, then compile it to a dll and load it with the dynamic linker. But I think that not being able load a dll into memory which I manage has kept me from trying this, even just for kicks. Most dev teams are very picky about controlling memory management.


I can't remember exactly why, but I didn't like that the game and tool shared code in our implementation. Maybe it was the way the code was shared that was a problem. I can't remember.
In my current implementation I chose to not share code at all - I'd rather keep the data builders and the runtime code bases as decoupled as possible, so I'm only sharing the format/specification between the two, not the code. E.g. C++ has some [font=courier new,courier,monospace]struct[/font] layouts, and C# has some [font=courier new,courier,monospace]BinaryWriter[/font] routines.
The one issue that seems to come up often is what happens when data format changes. How do you approach versioning and format change rollout?[/quote]You change both the [font=courier new,courier,monospace]structs[/font] and the [font=courier new,courier,monospace]BinaryWriter[/font] routines (or sync down the latest code) and compile. Then you run the data build pipeline as usual, it detects that your files were built with an old C# plugin, and it rebuilds the out-of-date files. This really shouldn't be a problem assuming you've got a data build pipeline, instead of a directory that you manually put game data files into.
Compiled data files aren't kept in a version control system (they might be kept in an unversioned network cache as an optimisation) - they're built from source content files, which are kept in a version control system. When compiling a data file, the inputs are hashed (including source file and compilation code version) allowing you to automatically recompile data files when either the source file or the data-builder code are changed.
On 4/30/2012 at 1:43 AM, Hodgman said:

Compiled data files aren't kept in a version control system [...] - they're built from source content files, which are kept in a version control system.

Yes, that addresses the issue. However, most studios I've been at didn't like everyone building the whole project. They wanted at least an asset-engine split, in which there's a "published" version of the engine used by asset creators who don't build it, and there's a "published" version of the assets used by engineers who don't build them (except as they need to, for changing asset formats, prototyping new assets, etc.). This sort of firewall tends to reduce the impact of bad check-ins.

This topic is closed to new replies.

Advertisement