Serializing assets

Started by
12 comments, last by Zipster 3 years, 9 months ago

I'm presented with a problem in my framework right now, which has to do with storing and reading assets in my own format in order to create a data driven asset pipeline which won't feel bad to work with. I'm at a stage where loading models by file name has far overstayed its welcome, going through the whole preprocessing setup, setting materials and parameters in code etc. every time and wasting my CPU/HDD whenever I change anything, recompile and restart… you get the point. I got pretty far building a lot of UI for the editor, some custom file format to define my project and all that but now I'm stuck.

I'm trying to serialize my assets to a binary format. I'm also trying to have an editor that can lazy load them when I open one for editing. Think, similar to UE4 content browser, but at a scope of a single (confused) developer. Some work on serialization shown below (no versioning or endianess code, this is still an early prototype):

struct MemChunk
{
	std::unique_ptr<char[]> _ptr;
	UINT _size;

	MemChunk() : _size(0) {};

	MemChunk(std::unique_ptr<char[]> ptr, UINT size) : _ptr(std::move(ptr)), _size(size) {}

	MemChunk(UINT size) : _ptr(std::unique_ptr<char[]>(new char[size]())), _size(size) {}

	template <typename Datum> 
	inline void add(const Datum* datum, UINT&amp;amp;amp; offset)
	{
		write(datum, sizeof(Datum), offset);
	}

	template <typename VecData>
	inline void add(const std::vector<VecData>&amp;amp;amp; data, UINT&amp;amp;amp; offset)
	{
		write(data.data(), sizeof(VecData) * data.size(), offset);
	}

	void write(const void* data, UINT cpySize, UINT&amp;amp;amp; offset)
	{
		UINT newSize = offset + cpySize;

		if (newSize > _size)
		{
			char errStr[150];
			sprintf(errStr, 
				"Serializing overflow. Available: %d; \n"
				"Attempted to add: %d \n"
				"For a total of: %d \n",
				_size, cpySize, newSize);

			exit(7645);	// Let's pretend I actually have an error code table
		}

		memcpy(_ptr.get() + offset, data, cpySize);
		offset = newSize;
	}

	bool isFull(UINT offset)
	{
		return (offset == _size);
	}
};



class SerializableAsset
{
public:

	virtual MemChunk Serialize() = 0;
};

MemChunk is similar to a small stack allocator of sorts, it is a byte array wrapped into a unique_ptr. Every serializable asset class (currently model, mesh, texture, animation, skeleton, material) overrides the Serialize method of SerializableAsset. Whether it's necessary to use inheritance here - no, but I think it's just a nice interface to have since it expresses intention and possibility that the class can do this. You might be wondering “where's your read method?” and you'd be right, I don't have one yet. Now, here's that code above applied to a concrete type, SkeletalModel:

	MemChunk Serialize() override
	{
		UINT numMeshes = _meshes.size();
		UINT numAnims = _anims.size();
		UINT skelIndex = 0u;
		UINT headerSize = 3 * 4 + 64;

		UINT meshIndex = 0u;
		UINT animIndex = 0u;
		UINT dataSize = (numMeshes + numAnims) * 4;

		UINT totalSize = headerSize + dataSize;

		UINT offset = 0u;
		MemChunk byterinos(totalSize);

		byterinos.add(&amp;amp;amp;numMeshes, offset);
		byterinos.add(&amp;amp;amp;numAnims, offset);
		byterinos.add(&amp;amp;amp;skelIndex, offset);

		for (int i = 0; i < numMeshes; ++i)
			byterinos.add(&amp;amp;amp;meshIndex, offset);
		
		for (int i = 0; i < numAnims; ++i)
			byterinos.add(&amp;amp;amp;animIndex, offset);

		return byterinos;
	}

How do people handle this? I've read some threads here but mostly it seems like there's some sort of array for every object type that you can index. I can't wrap my head around this. In a game, allocating everything in their own array could make sense and I guess is good for preventing fragmentation and all that (I'm aware of, and have used, memory pools and I could get behind this), however…
If we are in editor environment, objects aren't even serialized yet, they might not be available (since we are just importing the bunch). Serializing components first seems like a solution, so that's ok. But how do you refer to them, by the asset name, some id (hash, incremental…). I feel names might be the best way during editing.

I actually started doing this from higher level static functions which could order all assets and move things around but the per class approach seemed cleaner (after all, serializing private members is more natural that way), now I think I still need something higher level to serialize such “aggregate” assets which depend on other assets and fix up these indices after the fact.


My main problem is I'm not sure the system which holds all these assets in a lazy-load limbo between the engine being aware of them but not necessarily having them in memory. It really looks like a database of project assets would be a great fit for this but do people use them? If I knew what kind of system is used by editors in general to access other assets I feel like this would be a lot easier to understand. Do I give assets names, that are not dependent on file path (I really would like it to not break on files moving, even if that move has to be done from the editor only), and keep a table of asset name → path pairs? And do you think my serialization function is too bare and I'll need more params (It feels a little too simple the way it is but I don't have a feel yet).


Advertisement

J. Rakocevic said:
If we are in editor environment, objects aren't even serialized yet, they might not be available (since we are just importing the bunch). Serializing components first seems like a solution, so that's ok. But how do you refer to them, by the asset name, some id (hash, incremental…). I feel names might be the best way during editing.

I'm just going to comment on the “how to reference?"-part. I think we have to make few distinction along the way, as this question (for me) falls into separate categories:

  1. How do we reference assets in serialization?
  2. How do we reference/display assets for editing?
  3. How do we reference assets, if we ever have to do so, in code?

Now I've probably already given away the answer with this list, but yeah, I do think those all need separate systematics. So let me just give you a brief rundown on how I'm handling those things in my engine:

  1. Every time an asset is written out, be it binary or in a text format, it is references via a (semi-randomly generated, unique 64-bit number). This “guid” is generated when the asset is first created and imported, and stays the same for as long as the asset exists. In truth, it is stored in a sort of “meta”-file alongside the actual asset file, like it would be the case in unity. This allows assets to be renamed, moved, even between projects rather easily. You need some sort of collision-avoidance, but thats usally not the most problematic part.
  2. Now on to editing. How do you refer to assets there? Well, what you display to the user can range from minimal to complex, but I've settled on something like this:

So what I show the user in the end is the name of the asset, alongside a preview (only for assets where previews are available). But internally, those are all still the same guids as before, only remapped to a previewable string when displayed. Of course humans work best with names and not some random-ass number, but you will want to offer that as an interface to the internal storage-system and not as “I save the name of the asset here”-type of deal.

3) For the case of referencing assets in code directly, there are many ways to go about it. Unity prevents it upfront, requiring to always have assets be assigned to their reflection-system; or only when stored at designated paths. This is mostly due to being able to auto-remove unused assets (which, spoiler-alert, I don't think is a good idea because you always end up with unnecessary assets being included over 3 corners, but maybe thats more a detail-thing). Unreal allows to load assets in constructors by specifying their path (the ctor-restriction is again so they can refcount assets to see if they need to be included in build.
I personally allow accessing assets in code wherever by using a specialized shortened path, which is something like MODULE/PATH1/PATH2/…/ASSSET_NAME (w/o file-extension). This is obviously not portable for when you rename assets or something, but its makes it easier to understand the code than if we had some opague id; and I never had the issue since usually code only references assets within core/plugin-implementations where assets are not created and moved that often).

Hope this gives you a rough idea.

Man that's exactly what I needed! I also realized you need different way to address assets at different times (development vs game runtime) and it was bugging me on how to do it right. I like the approach with both ID and the name because swear to god I'm not remembering that some tree's diffuse map is #536104 or whatnot. Keeping ID + name in the editor is light enough to not care.

Thanks, I'll look into generating something unique on the fly with a hash. Still want to let people move things around but that should be easy enough with a look up table in a file or DB.

One more question though. Let's say I have a GUID, and I want to retrieve the asset object. Do you provide an interface to your editor that queries some kind of project configuration file with a list of assets and searches the file or are these small definitions/GUIDS available in a map somewhere in RAM? he latter would be significantly faster, but memory heavy if everything was in. This is primarily why I was considering a database.
I was thinking about making a cache that keeps whatever you open once loaded for some time or just until you close the editor (I think UE4 does that). But then again I might just pay for the memory (projects that I work on got like three and a half assets in total lol…)

J. Rakocevic said:
One more question though. Let's say I have a GUID, and I want to retrieve the asset object. Do you provide an interface to your editor that queries some kind of project configuration file with a list of assets and searches the file or are these small definitions/GUIDS available in a map somewhere in RAM? he latter would be significantly faster, but memory heavy if everything was in. This is primarily why I was considering a database.

Personally, on project loadup (lets just keep it that simple for that example), I just scan the entire assets-folder, load the content of each meta-file and store them in one map (in actually its multiple maps as I require the lookup by path as well, but yeah). My game-project is not that big but not that small eigther. It contains of about 1500 asset-files right at the moment. Project load time is 3s in debug and 1s in release, but theres a lot of things aside from asset-loading thats going on. RAM-usage is about 200MB for the entire thing which contains a massive editor by now. Not trying to flex, but just saying the unless you are going very very very big, don't try to overengineer for performance. I'm even doing stupid things like loading ALL textures on loadup, and there is no problem with that. I mean, whats the cost of a map with a bunch of GUIDs? 64 bit * 1.5k = 12kb, hard to measure the map but are you really worried about using a few MBs for an asset-directory?


I once had a small side-project which contained of 50k icons, which made me write a more optimized YAML-parser instead of XML before I dropped the project, but ie. if I were to do something like that, I could easily optimize the system. Other than that, no need to complicate things. Load everything upfront, be able to lookup all the date at any time, great! (ok, just for the record, I have a thing going on where assets that are only required for certain scenes can be placed directly in the scenes. So that saves about ⅓ of all assets having to be loaded upfront.)

Also, one last thought. Doing optimized shit like streaming-loading is fine for textures, meshes, audio, etc… but there is always going to be a bunch of stuff that has to be loaded always anways. Think about scripts mostly, which if you think C# or blueprints or whatever; they have to be all there to be able to compile. You can try to do some optimizations (like I do with my scene, like unity does with ASMDEFs), but in the end there is always a base that I think you'll just make life harder by trying to get rid of having to know. I tend to make a quick indexing-past for most of my systems now instead of loading things on-demand where possible. It makes life easier in away. Like, if I load all my scene-GUIDs upfront, then I can detect when a scene is really missing or when some logic is broke. If I just search on-demand then I always have to load some cross-reference which I at that time can not know if it is valid or not (since then I would just have to load everything instantly again, if you get what I mean).

I get you. In fact this is exactly the reason I ask questions, because people that already did it and know the pitfalls of it will give me advice. I didn't even consider scripts (don't have any support for it yet) or validating whether there's missing stuff in the scene.
Big help, thanks.

J. Rakocevic said:

MemChunk Serialize() override {
...
 UINT totalSize = headerSize + dataSize; 
...
 MemChunk byterinos(totalSize); byterinos.add(...); 
 return byterinos; 
 }

Having to know the size of your MemChunk in advance is likely to be inefficient, because you need to scan recursive data structures twice (before allocating the buffer to query sizes, then when the buffer is allocated to write data).

Usually a more automatically managed and more stream-like buffer abstraction doesn't need to know the total size in advance and allows writing data during the first and only visit of data structures.

You might be wondering “where's your read method?” and you'd be right, I don't have one yet.

Another popular design pattern, particularly in C++, is using some combination of templates, overloading and macros to use the same code to describe the serialized data structure both for serialization and deserialization, avoiding any duplication or discrepancy between the two. Flagship example: Boost::Serialization.

Omae Wa Mou Shindeiru

The only way I see that working would be reusing a memory pool big enough to have any asset allocated in it.
Otherwise I'd end up having some array resize itself (like a vector would when push is called past it's capacity) and therefore copy several times over what could be a lot of data.
Is there another option I'm not aware of or do you suggest a pool?
(By the way this isn't going to be used for writing some heavily stacked, deep structure like a scene with it's components, just assets which are mostly local data + only references for other things, the biggest cost would be models I guess because I do plan to store individual mesh data serialized within a model as I don't really see the need to “reuse” meshes outside of their models, they usually aren't useful standalone)

@J. Rakocevic You are assuming you need to write to a single contiguous, dedicated memory buffer containing the serialized representation of your whole data structure (and, presumably, load all the data into a buffer before deserialization begins). This is an unnecessary constraint:

  • The buffer can be fragmented, allocated in segments as needed. The buffer data structure can easily translate between “virtual” positions (the logical offset where a read or write operation is taking place) and the physical address of where that object is in the data structure. Only individual primitive values, no larger than a few bytes, need a contiguous memory area.
  • You don't need to hold all data in memory: you can begin writing the initial part of the serialized representation to a file as soon as you finish building it and constructing objects as soon as you load their representation, possibly in multiple threads and in both cases reusing buffers as you go.
  • You don't have to make a copy to resize a buffer, you can allocate or reuse a buffer segment to obtain a place for new items.
    It isn't an array or a std::vector, there is no need to maintain invariants that compromise performance in order to support operations you don't need.

Also, you are going to have tree and graph data structures of unpredictable depth and variable size data, think about how to support any object from the beginning. For the purpose of designing a correct software architecture “mostly” and “usually” mean “no”.

Omae Wa Mou Shindeiru

I might be steering away from the initial topic of the discussion, but having considered what @lorenzogatti said and realized it might be better to use a more flexible way to serialize objects rather than a memory block of size known in advance, I installed cereal.

It's good because it has everything low level already sorted out, and I can write my own serialization functions for custom types which was easy enough. It writes straight to files therefore it's flexible in some ways (I don't have to count up the size ahead of time). But I think I ran into the exact same problem that made me consider my own solution initially.

What I'm currently working on is serializing a ledger of all assets in the game project. It is basically a std::unordered_set of ResourceDef which contains the following.

struct ResourceDef
{
uint32_t _ID;
std::string _path;
std::string _assetName;
ResType _resType;

// Functions omitted for clarity
...
}

// and then for the whole set it's as simple as

template <typename Archive>
void serialize(Archive&amp;amp;amp;amp; ar)
{
	ar(_assetMap);
}

I'm not a big fan of strings in the engine and do think this needs work eventually as a set with strings would be causing some fragmentation and cold caches however, the lifetime of this ledger is during editing not gameplay so for now I'd rather keep it as simple as possible.

But consider how big this can get. Not huge by all means, however, the way cereal works is you create an archive out of a stream (in my case file stream) and then you just pass stuff in.

Serializing / Deserializing the whole set probably wouldn't take that long, in practice this isn't much of an issue. But I would highly prefer being able to do partial updates instead of rewriting the whole file (which I am fairly certain cereal does). I know changing files in the middle is not really possible (elegantly that is, I know how to do it but it kinda sucks) unless the update is of same size as the previous content. My question, then, is whether this will bit me in the ass in the long run? What if I have heavier content that I'd actually need partial updates for, as the performance would be otherwise abysmal. Currently all heavy assets are in separate files so that's not an issue but you never know.

The reason I opted for my initial MemChunk was so that I could gather up all small parts of memory to their own MemChunks and then organize writing them out in whatever way I wanted to. I guess similarly, Cereal could work with a stream to a buffer rather than a file. But the needless rewriting of the whole thing remains.

To me this looks like a schoolbook example of a problem solved by a database… Any thoughts?
(Also yes there would be a separate set for every asset type for several good reasons, this was an example but that's the plan)

P.S. I thought about not serializing it on every asset change, and having a separate save button to commit the changes to some file which would make this rewrite a lot less common and I guess that's how it's done. But I would value a word about it from someone more experienced as to whether that method would be good enough.

J. Rakocevic said:
But I would highly prefer being able to do partial updates instead of rewriting the whole file

J. Rakocevic said:
The reason I opted for my initial MemChunk was so that I could gather up all small parts of memory to their own MemChunks and then organize writing them out in whatever way I wanted to.

J. Rakocevic said:
To me this looks like a schoolbook example of a problem solved by a database… Any thoughts?

Partial updates require a data structure that keeps objects separate enough to write them individually, and writing many “MemChunks” together is effectively do-it-yourself transaction handling to make several updates happen simultaneously. Since you are effectively implementing a database, why don't you base your editor on a proper DBMS that has already solved this kind of problem with higher quality and less effort than your homemade solutions?

  • You can serialize objects into BLOB columns in tables that can be searched by appropriate keys and, in all likelihood, evolve your database by moving data from serialized objects that need to be loaded as a unit to specific columns that can be queried directly.
  • You can do the opposite, starting with a database schema that covers your needs with nice queries, and using serialized objects in BLOB columns only if you add something that doesn't need to be searched (e.g. binary files or uninteresting data structures).
  • You don't need to use the database with your game: the editor can also export appropriate consolidated files that can be used directly (presumably, a serialized dump of the full data set).

SQLite is particularly suitable because it is easy to embed in an application.

Omae Wa Mou Shindeiru

This topic is closed to new replies.

Advertisement