• entries
146
436
• views
198529

# XML and Binary formats, designing for flexibility.

457 views

Superpig (yes, he can fly), asked an interesting question in a reply to my previous post on XML, his question concerned how one might create a binary format that was as flexible as XML. I thought I would spend some time examining this question and look into a few solutions.

First of all, we need to figure out a way of uniquely naming nodes in our binary format. Now, since this is binary, the first thing that should leap right out at you are IDs. So, lets assume that we'll be using ID numbers instead of node names (like XML does). But how long should an ID be? Well, the first thought would be to use a 32bit integer. But the chances of you needing 4.2 billion unique identifiers is close to zero. So, how about a byte? A byte allows for 256 unique identifiers, but I can probably come up with cases where I need a few thousand identifiers. So, lets aim for flexibility: We'll use a 7 bit encoded integer value. As long as the node ID is below 128 it will be 1 byte, as long as the node ID is below 32768 it will be 2 bytes long, and as long as the ID is below 2147483648 it will be 3 bytes long.

With naming out of the way, we come across another problem, how do we delineate the start of a node and the end of a node? Now, here we come to a crossroads. On the one hand we can use length encoded nodes. Each node would have a length that specified how long it was. Now, if you are thinking of inserting into the middle of a file...don't. Technically speaking, you can't do it anyways. You have to rewrite the file moving the data around as you need it anyways. Since you would probably have the node structure in memory as a tree, the saving process would easily be able to calculate the size of each node, and hence this structure is flexible. On the other hand you can use chunk start and end identifiers. These would be special values that you would use to mark the start and end of a chunk. Of course, the problem then comes in: 'How do we guarantee that these values are unique?' The simple solution would be to encode all other occurrences of the identifiers such that you don't get duplicates. A simple example would be to use 0x3C3C as our start identifier. Whenever we encounter a 0x3C in the rest of the data, we would append a 0x00 onto it, so that if we had the data value: 0x3C3F, it would end up as: 0x3C003F. The value 0x3C00 would end up being 0x3C0000, obviously. This does complicate the loading and saving of the datafile however.

We also have a problem with name collisions. If we wish to assembly documents that are composed of other documents, we may get collisions with the ID values, where one ID was used in one document to mean one node, and the same ID was used in another document to mean another ID. To solve this, we'll use a namespace ID as well, with the default namespace being 0. Again, this will be 7bit encoded to save space.

Lets assume that we go with the length encoded system. How large should the length be? Well, since we want to store an arbitrary amount of data in each node, we should be able to support any length. So, we'll use another 7 bit encoded integer. Since it's 7 bit encoded, we can (in theory) support any length integer we want. It could be stored in a 32bit integer, a 64bit integer, heck even a 128bit integer.

Our current node format looks like this: [namespace][ID][length][data].

Document structure is also important. Do we want to allow for multiple root nodes? Or would we rather just allow one root node? How about special nodes that tell us something about the document? I would probably only allow a single root node, the exception being a header node that allows me to specify information about the document. The header node would have the special ID of 0, and can contain any information that pertains to the documents format or encoding (i.e. No data). The ID values between 1 and 16 should be reserved for header nodes.

Given this, we can now write up a simple model, which is just one triangle with some normals.
[0][0][13][ <-- Header  [0][1][4][1.0] <-- version  [0][2][5][utf-8] <-- Text is encoded in UTF8 format.][0][17][94][ <-- Model root node  [0][19][45][ <-- Model Vertices    [0][18][12][0,0,0] <-- Vector<0,0,0>    [0][18][12][1,0,0] <-- Vector<1,0,0>    [0][18][12][0,1,0] <-- Vector<0,1,0>  ]  [0][20][45][ <-- Model Normals    [0][18][12][0,0,1] <-- Vector<0,0,1>    [0][18][12][0,0,1] <-- Vector<0,0,1>    [0][18][12][0,0,1] <-- Vector<0,0,1>  ]]

In XML this might look something like:
"1.0" encoding="utf-8" ?>			"0" y="0" z="0" />		"1" y="0" z="0" />		"0" y="1" z="0" />				"0" y="0" z="1" />		"0" y="0" z="1" />		"0" y="0" z="1" />

Obviously to load either file we would need to know something about the contents of it. In this case we opted out of using schemas, and so we assume that the one loading the file knows what the contents should look like. In loading either of these files, I would generate a tree containing the data in each node. This also simplifies loading and saving of both files as well, as you can just iterate over the tree.

So, the second part would be to figure out how to describe the data that each node can hold, this would involve schemas, of course, but I'll leave that up as an exercise for the reader, and perhaps a later journal entry.

Guess we'll get to XSLT next time.

Interesting post.

It would be interesting to "cross compare" this against some regular binary formats - I know that a few of my games have used formats that, to some extent, follow a similar heirarchical pattern.

I read about it a while ago, but lost touch, but there seems to be some consideration into a "Binary XML" standard. Google popped up this article, but I can't find the better "overview" that I used to have [headshake]

Cheers,
Jack

This might be of some interest.

Wow Washu,
Some of these journal's of yours should be made into articles. I would never have read these journals if I hadn't read a recent reply to a topic by evolutional suggesting to supply a link to it in the For Beginners FAQ. These are excellent articles you're writing here...everyone should read them.

ArchG

## Create an account

Register a new account