Sign in to follow this  

Protodata—a language for binary data

Recommended Posts

I recently rewrote and improved some old software of mine, and figured the next step is to put it in the hands of people who might use it.


Protodata lets you write binary data using a textual markup language. I’ve found this really useful in game development when I want a custom format, would rather not use plain text or XML, and don’t want to invest the time to make a good custom editor. Protodata supports signed and unsigned integers, floating-point numbers, and Unicode strings, all with a choice of bit width and endianness.


Edit: here’s an example document describing a cube mesh.


# File endianness and magic number
big u8 "MESH"

# Mesh name, null-terminated
utf8 "Cube" u8 0

# Using an arbitrary-width integer for zero padding
u24 0

# Vertex count
u32 8

# Vertex data (x, y, z)
+1.0 +1.0 -1.0
+1.0 -1.0 -1.0
-1.0 -1.0 -1.0
-1.0 +1.0 -1.0
+1.0 +1.0 +1.0
-1.0 +1.0 +1.0
-1.0 -1.0 +1.0
+1.0 -1.0 +1.0

# Number of faces
u32 6

# Face data (vertex count, vertex indices)
4 { u16 0 1 2 3 } # Back
4 { u16 4 5 6 7 } # Front
4 { u16 0 4 7 1 } # Right
4 { u16 1 7 6 2 } # Bottom
4 { u16 2 6 5 3 } # Left
4 { u16 4 0 3 5 } # Top


Please tell me what you think and offer suggestions for improvement. smile.png

Edited by EvincarOfAutumn

Share this post

Link to post
Share on other sites
C0lumbo    4411

Have you considered including support for pointers?


In the textual form you would need some syntax for creating a label (a thing you can point at) and some syntax for adding a pointer (a thing that points at a label). You would probably have to define whether pointers are 64bits or not either at the top of the text file or through a command line parameter when you convert text to binary.


In the binary form, the labels would be converted into values representing the offset in bytes from the start of the binary blob and also store a table of the pointer locations. The load code could then iterate through the pointer table and fix up the pointer offsets.


That way you'll have support for in-place loading of potentially quite complicated structures.

Share this post

Link to post
Share on other sites

Have you considered including support for pointers?


I hadn’t really, but that’s a good idea. Macros and expressions are in the works—the main challenge there is just designing something nice, not so much the actual implementation. I’ll be sure to include a way to make absolute and relative references.

Share this post

Link to post
Share on other sites

Support for something like typedef, and perhaps structs.


Yeah, there definitely needs to be some way to avoid repetition. Both your examples would be covered by a macro system. For now the C preprocessor works:


#define int s32

int 1 2 3

#define something(foo, bar, text) \
  u16 foo \
  u32 bar \
  utf8 text \
  u8 0

something(0, 1, "x")


But I intend to add something more concise and typesafe. In particular, something that feels more native to the language.

Share this post

Link to post
Share on other sites
Aardvajk    13205

I wrote something like this a few years back, primarily as an assembler for virtual machines but then ended up using it to prototype level data for games prior to writing an editor.


I found in the end that it was kind of useful as a temporary tool but that the time invested in writing the scripts (correctly) was actually much more overall than the time to write a quick and dirty tool specific for the job I needed the binary data for.


Still, nice to see someone else ended up thinking the same way I did back then. :)

Share this post

Link to post
Share on other sites
ApochPiQ    22999
I think conceptually this is a good idea, but it's really hard to make something like this into a "killer app" for basically the reasons that Aardvajk already articulated.

I would see this sort of thing being far more powerful in an environment where you have a lot of platforms that need to speak a common protocol, and having a readable spec of the protocol sitting in front of you is very valuable versus trying to reverse engineer the protocol from some code.

Unfortunately, someone else beat you to the killer app in that space.

I honestly have a hard time coming up with things I would change about protocol buffers for the specific cause they're used for; but if nothing else, you could start by looking over their work and seeing what sorts of enhancements scratch your itches.

Share this post

Link to post
Share on other sites

I think conceptually this is a good idea, but it's really hard to make something like this into a "killer app" for basically the reasons that Aardvajk already articulated.


Yeah, I’m well aware of the limitations. Just trying to decide where to go with it.


I think what I’d really like for Protodata to become is the killer DSL for two things:

  1. Writing a legible, executable file format spec; and
  2. Doing simple reporting and transforming of data in existing files.

Think “binary sed with external modules”.

Share this post

Link to post
Share on other sites
ApochPiQ    22999
Well I'd see format version control as just a special case of data transformation. It's also a common enough need that I could see it being very cool to have a system that can automatically version up old data into new data (or backwards if needed).

Where that gets tricky in your current model is that format metadata is interleaved with the actual content. IMHO what makes protocol buffers so nice is that the two are cleanly separated; versioning is trivial to compute based on just metadata inspection.

Honestly I can only see a relatively small niche case for having the two interleaved as in your approach - which is rapid prototyping. For long-lived and stable systems, I see the interleaving as more of a liability than anything, although it'd still be handy to support transformations between formats for on-the-fly "dynamic" compatibility/interoperation.

Share this post

Link to post
Share on other sites
Hiwas    5807

I have also done something very much like this in the past.  The reason for it was a bit different but the explanation may be useful for further thought.  First off, I wrote the binary data as text just like this but I did so in a manner which made data tables compatible with Lua.  Eep, a script language for my binary data.  Oh how useful it is though...  I had three goals:


1)  Source control.  Need I mention that binary sucks for source control? :)

2)  Rebuild from source.  I personally think of two levels of source.  There is the DCC content (max/maya/etc files) and then the "exported" data.  If that exported data is in basic text format and you change just how you pack it into a file, you can avoid re-export via-DCC third party items and speed up turn around considerably by simply rebuilding the binary outputs (and not having to implement versioned readers).

3)  Variable input data formats.  Max and Maya at the time didn't have similar curve types, bits of data were exported in different orders and all those hassles you generally have.  (And FBX wasn't really a common format at the time.)  So I just tagged the file with "exported from Max" or "exported from Maya" and wrote Lua scripts loaded during conversion to binary which dealt with the variations.  As such the tools pipeline was solid and fixed, only the little bits of lua here and there were modified 90+% of the time when a problem was found or a format was changed.  I.e. using a Lua state, you load the exported file, do a single table lookup "data { exporter="Max" }" then read in "Max.lua"/"Maya.lua"/etc to create the accessors which translate from text to binary.


In this way I was able to do everything I believe you intend in a manner which was not only simple to read/modify by hand, but also able to deal with the long tail of overall maintenance of the data.  Not to mention some meta data saying where the lua file came from, and as such if artist x didn't check in the DCC source the nightly build screamed at them when it realized it had only a lua file and not DCC source data.


I do suggest looking into Lua or other script languages for the format of the text because it is of huge benefit to later have a "language" available when going from text to binary.  I personally like Lua due to it's freeform nature of mixing map/vectors into a single data structure.  For reasons of access later it just simplified things not to have to worry about order of data and such.  (Not to mention it is damned fast for what it is. :))

Share this post

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this