Low level serialisation strategies

Started by
18 comments, last by Kylotan 7 years, 7 months ago

I've been surprised to see that quite a few developers are still using serialisation strategies that are equivalent to fread/fwriting structs. Sometimes they do fancy things to change pointers into offsets at save time and then fix them up at load time, but the emphasis is still on minimising memory allocations and being able to write directly into the final structure with no extra copying. As someone out in Unity/C# land for a few years, this came as a surprise to me when I got back to working with C++ code.

My main question (especially for people who've worked on shipped games) is - do you see this often? Or are you using safer (and easier to debug) methods, whether a full serialisation system (e.g. with each field getting read or written individually), or a 3rd party serialisation system like Protocol Buffers, FlatBuffers, Capt'n Proto, etc? The latter seem to have their own limitations, such as including the schema in the data being transmitted, or expecting you to use their generated data structures rather than working from your own. Are people optimising for debuggability, or deserialisation speed, or size on disk, or compatibility with 3rd party tools, or...?

Advertisement

I use a simple set of functions that I hand-wrote in C for writing all kinds of things to a buffer called a 'message'.

I have ones for writing 32-bit integers, floats.. Others for writing floats as a single byte, with a min/max range specified for the value. So if I have a float that's expected to be between -1 and 1 it will go ahead and pack it down into one byte, then on the other end a corresponding read function for re-constituting the original -1 to 1 float from the single byte.

I also have read/write functions for quaternions, angle-axis rotation velocities, of different bit counts so that in cases where I want to dispose of some accuracy for smaller amount of data, I can. A quat, for example, can be conveyed as either 24-bits, 32-bits or 64-bits. An angle-axis velocity can be conveyed in 40 bits or 64 bits. etc..

I can also read/write 3D vectors at varying bitdepths, specifying min/max values to map each vector component to all the bits allowed.

I don't think I will ever just read/write full on precision data structures, unless there's a case where that's absolutely necessary.

Being able to block copy something from disk directly into memory was pretty much the only thing I missed when transitioning from C++ to C#. You miss out on things like forward/backward compatibility, but the loading/saving code is so much simpler and load times are really fast. There's usually no debugging - either your entire set of structs come in perfectly, or you have the wrong endianness (which was only really relevant when building data on PC with little endian, then consoles reading that data using big endian), or the pragma pack is different.

We only used that technique for data loaded from disk, and not for network serialization.

Or are you using safer (and easier to debug) methods


I've seen so many more complicated, over-engineered, _far harder_ to debug, and practically-speaking no safer serialization systems out there.

Sometimes simpler is better.

Sean Middleditch – Game Systems Engineer – Join my team!

My main question (especially for people who've worked on shipped games) is - do you see this often?


Yes. Every title I've shipped has ultimately relied on a fairly low-level serialization strategy, even if it's only for "release" builds.

...a full serialisation system (e.g. with each field getting read or written individually)...


Never found a need for this personally. If you have multiple fields that need to be read/written, what's the argument against doing it in a single pass?


...or a 3rd party serialisation system like Protocol Buffers, FlatBuffers, Capt'n Proto, etc? The latter seem to have their own limitations, such as including the schema in the data being transmitted, or expecting you to use their generated data structures rather than working from your own.


I've researched similar systems several times over the years. They never manage to really live up to our needs. Packing schema data into a message is totally inappropriate for realtime binary communication protocols, for instance. It's also often a no-go for shipping assets because of desires to keep the size down and the format obscure.

Are people optimising for debuggability, or deserialisation speed, or size on disk, or compatibility with 3rd party tools, or...?


All of the above, in varying combinations. Debugging is important for tools and pipelines. Speed is important for network protocols. Size is important for network traffic as well as disk storage. Sometimes we need to interop with things which usually means foregoing binary serialization and just using an interchange format.

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

I haven't switched from simple serialization methods since i learnt them long ago. It's basic. It's understood. Both in c++ and c#.

You understand exactly what you are writing out. Block xfer methods are quick but you really need to be clear on what is in the block of data. As the other guys attest. To have a tried and true method you know works and try to stick with it. But there is always other ways.

I use the most basic of serialisation to write my objects out. I try to keep this simple because i like to keep it in atomic steps to see what I'm writing.

Draw back is that if you are writing out 50mbs. Having said that. It's only a problem serialising in files. Finding a specific problem element can be an issue.

Indie game developer - Game WIP

Strafe (Working Title) - Currently in need of another developer and modeler/graphic artist (professional & amateur's artists welcome)

Insane Software Facebook

I've my own engine for my Indie game, and I contract on another game engine at the moment. Both make use of this strategy extensively for data files that are compiled by the asset pipeline (so are automatically rebuilt if the format changes).

Often we don't even fix up pointers on load. I've a template library containing pointer-as-relative-offset, pointer-as-absolute-offset, fixed array, fixed hash table, fixed string (with optional length prepended to the chars, and the hash appearing next to the offset to the chars), etc.
Instead of doing a pass over the data to deserialize these, they're just left as is, and pointers are computed on the fly in operator->, etc. This can be a big win where you know a 2byte offset is enough, but a pointer would be 8bytes.

As above, I find that this KISS solution is often less stress and time than the more over-engineered solutions.
Also as above, I don't spend too much time debugging this stuff at all, and it either works fine or breaks spectacularly. Leaving a few unnecessary offsets to strings in the data can be useful if you do have to debug something.
I usually generate my data with just a C# BinaryWriter (and extension classes to make writing things like offsets and fixed size primitives clearer), and use assertions when writing structures that the number of bytes written equals some hard coded magic number. The C++ code also contains static assertions that sizeof(struct) equals a magic number. If you upgrade a structured and forget to update these assertions, the compiler reminds you very quickly.

Save game files, user generated content, and online data tend to use more heavyweight serialisation systems/databases that can deal with version/schema changes, as these don't go through the asset compiler.
I just finished writing an article on this exact topic - its currently pending approval - hopefully some of the members interested in this topic will participate in the peer review!
I have to agree that a KISS strategy is best. A well-organized suit of serialization methods reads nearly as cleanly as any of the declarative frameworks like Protocol Buffers, reduces dependencies, simplifies the build process (again, protocol buffers), and is far simpler to debug. Serialization isn't rocket science, no reason to make it that way with some opaque abstraction.

...a full serialisation system (e.g. with each field getting read or written individually)...


Never found a need for this personally. If you have multiple fields that need to be read/written, what's the argument against doing it in a single pass?


1. Because 99% of the objects in the engines I've worked with are not Plain Old Data
2. Because the data is padded or aligned differently on different platforms

Issue 1 I've seen approached with wacky pointer-mangling tricks. Then if one bit of data is wrong on the way in, everything's completely wrong. It seems to be complicated by requiring all the data to be coalesced into one contiguous chunk, or perhaps using several chunks each marked with their former location. Messy. I also have no idea if anyone ever got this to work on standard library objects.

Issue 2 I've seen approached with a variety of brittle attempts at manual padding, switching member ordering around, macro-d types that add padding depending on platform, etc. This seems to be a massive source of bugs because you don't always notice data that's getting splashed over the wrong fields, offset 4 bytes earlier (for example).

I can appreciate the hypothetical speed benefits of this but given how error-prone they are, I wonder whether there is any real benefit.

This topic is closed to new replies.

Advertisement