• Create Account

Banner advertising on our site currently available from just \$5!

### Categories (See All)

Mark as peer reviewed: 1 votes (Zaoshi Kaba)
Still needs work: 2 votes (Krohm, EDI)

Editor Feedback:
• Unclear or Incomplete
• Unclear or Incomplete

Like
9Likes
Dislike

# Design Considerations for Custom Binary Formats and Data Compression (Part 1) UNDER REVIEW

By Brendan G Bohannon | Published Sep 06 2013 09:54 AM in General Programming

file formats data compression entropy coding huffman coding

While often there are general purpose file-formats or data representations to achieve a goal, sometimes it may make sense to create something a little more specialized, in cases where encoded size may be significant.

I will attempt to address some of the relative tradeoffs based on my own experiences, and experiences with various file-formats.

I will not address the matter of ASCII Text vs Binary file-formats here, and will assume the case where the data in question could not be viably represented in an ASCII format.

Similarly, I will focus more here on formats that can be processed "reasonably efficiently", rather than those focusing on maximizing compression. The primary use cases here are assumed to be for things like network communication and storage of game assets.

Note: I am not presently certain whether this is a "good" topic, or if it is presented adequately. Maybe it should be split up into multiple parts? (As-is, it is a little long).

# Byte-Oriented Formats

These are generally much simpler than bit-oriented formats.

In the vast majority of cases, a byte-oriented format is likely to be sufficient for most general data storage needs, with bit-oriented formats more left for cases where either the data is overly bulky, or where there is significant need to minimize storage.

Generally, handling of byte-oriented data tends to be faster than that of handling bit-oriented data.

Within the confines of bytes there are still a few things that can be done to reduce storage requirements:
• Variable-Length Numbers;
• Encoding more items in less bytes.

Additionally, these are more fundamental in the sense that most bit-oriented formats will still usually require some sort of byte-oriented packaging format (such as using a TLV format to hold things like headers and similar).

• Flexibly representing data structures and allowing for future expansion;
• Allowing for seeking or resynchronization in stream oriented formats;

## Variable Length Numbers

One simple trick for saving bytes is based on a pattern:
Small numbers tend to be more common than big numbers.

So, rather than always using 2, 4, or 8 bytes for a number, it may instead be practical to use less bits when it is practical to do so.

A fairly common scheme is to have small values represented as a single byte, with the high-order bit(s) used to indicate when the value is larger than a single byte.

For Example:
0-127 -> 0xxxxxx
128-16383 -> 10xxxxxx xxxxxxxx
16384-2M -> 110xxxxx xxxxxxxx xxxxxxxx
...

Another variant is using the high bit in each byte to indicate that the following byte contains more bits:
0-127 -> 0xxxxxxx
128-16383 -> 1xxxxxxx 0xxxxxxx
16384-2M -> 1xxxxxxx 1xxxxxxx 0xxxxxxx
...

With it then being specified whether the bytes come in high-low or low-high order.

The advantage of the former scheme is that the size of the value can be seen directly from the first byte, which can be used for a more efficient decoder, though with the tradeoff that the encoding/decoding logic tends to be more complex.

The advantage of the second is that the decoding logic is fairly simple, for example:

i=ReadByte(); val=0;
while(i&0x80)
{
val=(val<<7)|(i&0x7F);
}
val=(val<<7)|i;


Usually, the encoding logic is a little more involved though, but the level of complexity is typically "fairly similar" with either strategy (usually an if/else tree or a match/encode loop).

Another issue is how to best handle encoding signed numbers.

Typically, a few common options here are:
Sign-extending the most significant bit of the number;

Sign extension has an advantage in that it allows for small positive signed and unsigned numbers to look equivalent, but requires the sign issue to be dealt with directly when encoding and decoding the numbers. This may in turn lead to a bit more logic in the case where both signed and unsigned variations are needed (since the logic for encoding and decoding the numbers may need to be repeated for each variation), or less efficient encoding if we simply encode all unsigned numbers as if they were signed (you may only be able to encode 0-63 in a single byte, regardless of whether the value represents a signed or unsigned quantity).

Sign folding has the advantage that it allows reusing the same number encoding/decoding logic for both signed and unsigned numbers by mapping all signed integers to unsigned equivalents. For example, we may map things such that the unsigned quantities map to signed quantities as:
0, -1, 1, -2, 2, -3, 3, ...

This may be done fairly straightforwardly:

	//to encode:
uval=(sval>=0)?(sval<<1):(((-sval)<<1)-1);
//to decode:
sval=(uval&1)?(-((uval+1)>>1)):(uval>>1);

//or, expressed alternatively:
if(sval>=0) { uval=sval<<1; } else { uval=((-sval)<<1)-1; }
if(uval&1) { sval=-((uval+1)>>1); } else { sval=uval>>1; }


The downside is, granted, that signed and unsigned quantities naturally have different representations.

A few downsides of variable length numbers is that typically they are slightly more expensive to encode and decode, as well as their value generally needs to be known before they are encoded, as they are less subject to the strategy of emitting a placeholder and patching the value later, which will typically require leaving a space for the largest possible, or reasonable, value.

## Packing more values into less bytes

This is basically using one or more of these techniques:
Using different bits within a byte or number to represent different values;
Assigning special meanings to various value-ranges, and switching to alternate representations based on these value ranges;
Optionally including or omitting things based on flags or value-ranges.

A very simple and common example of this sort of thing is a "flags" field, where different bits are assigned meanings, and may be set or cleared, or potentially small multi-bit fields are encoded within a flags field.

A simple example of using some of these would be encoding a sparse series of typically small integers as bytes:
High 4 bits gives the number of preceding zeroes;
Low 4 bits gives the numeric value.

So, we may encode the series of numbers 1,0,0,0,3,0,0,0,0,0,9,... as 0x01,0x33,0x59,...

However, it may not escape notice that the maximum run-length and value-range is very limited, and we may also want to be able to encode the end of a list.

The end-of-list case is easy: 0x00. By this, we simply do not allow a run of zero-zeroes to encode a zero.

The other case can be handled via reserving part of the value range for trap-encodings, for example: 0-14=value is encoded directly, 15=value is encoded as a variable-length number. Now, suddenly, we can moderately efficiently encode a sparse collection of arbitrary sized numbers, and without wasting space for long runs of zeroes.

Another possible variant is that rather than the high bits giving zeroes, the bits indicate how many times to repeat a given value. This will slightly increase the cost of runs of zero (since they are now an explicit value, and we also lose the nice EOL special case), however, this will make runs of most other values significantly cheaper.

For example: 1,1,1,1,1,1,1,2,2,2,2,3,3,3,3,3,0,0,4,63,63,63,...
Could be encoded as: 0x61,0x32,0x43,0x10,0x04,0x2F,0x3F,...

## Tagged Data and Escape Coding

Tagged Data is basically data where some form of tagging-scheme is intermixed with the data. Commonly this is accomplished by designating one or more byte-values as "special", and not allowing them to appear in normal data, or if they are, it is typically via an escape coding.

For example, we could consider 0xFF to be special and not allow it within ordinary data (example: JPEG). Whenever an 0xFF would appear, we may instead replace it with, say, 0xFF followed by 0x00, with 0xFF followed by other values being used to indicate various markers or tags.

Often, these formats rely on the ability to, when in doubt, scan forwards to the next marker.

This escape-coding case comes at a cost though, namely in that it typically results in a slight expansion of the encoded data, and in some performance-sensitive applications, may incur additional costs related to detecting and handling the escape-byte. It may also potentially hinder the use of fixed-size structures or file offsets, or in some cases result in additional use of temporary buffers.

It is sometimes an option to relax this requirement though, such as relying on heuristic measures to identify markers (example: ZIP, BZ2). In these cases, usually the marker will be longer, and may include check-values in order to help reduce the chances of a false positive.

These types of formats may also allow for optional extension markers, where if such a marker is encountered and is not recognized, the marker's data will be skipped over.

## TLV Formats

TLV formats are another common strategy, where TLV stands for Tag Length Value. The Tag is a value that indicates what type of data is contained, the Length tells how big it is, and the Value is all the data for the tag. This construction is often referred to as a Lump or Chunk.

TLV formats often have an advantage in that they can be fairly readily extended without breaking existing implementations, and can easily hold a wide range of data. Very often, TLV formats are used for building the packaging for another file-format.

Depending on the format, unknown lumps may either be ignored or may potentially result in the data being rejected, and some formats will indicate whether or not the decoder should reject unknown lumps on a lump-by-lump basis.

A common example of a TLV format is RIFF, which is used to implement formats such as WAV and AVI (among others).
PNG and MKV are examples of formats which make use of TLV for their packaging (MKV being notable in that it uses variable-length encodings for both tags and lengths).

It is also possible to use both escape-coded markers and TLV, such as to allow for things like resynchronization.

For example, RIFF has a structure like:
tag:FOURCC, length:DWORD, value:BYTE[(length+1)&(~1)]
Where, some chunks (namely: 'RIFF' and 'LIST') may be nested, containing another FOURCC, followed by any number of other chunks. Most other chunks will simply contain data.

Another possible TLV format is:
0xFF, 0xEC, length:WORD, tag:FOURCC, value:BYTE[length-6]
Where FF EC is a magic escape-sequence, and the data uses an escape-coding scheme.

We might use an escape-case for the length, say:
0xFF, 0xEC, 0:WORD, tag:FOURCC, length:DWORD, value:BYTE[length-10]

Or, alternately, allow tag to be 0, at which point the lumps will be limited to 64k, but are appended together to form encode a larger payload.

A few other possible variants:
length:WORD, tag:TWOCC, value:BYTE[length-4]
Where chunk headers are a little more compact.

tag:VLI, length:VLI, value:BYTE[length]
Where tag and length are variable-length. This allows very compact headers in some cases, at the cost that they are more difficult to work with.

Generally, random access to data is achieved via the use of some sort of index structure, which may give the offset of each item within the file.

For example, in a video container format, we might encode the offset of every frame in the video file (such as AVI), allowing a video player to easily jump to any position in the file as is desired.

Something like a region-file in a voxel-engine might instead have an index table holding the offset of every chunk in the region, allowing those parts of the world to be decoded as they come into view.

In other cases, we may rely on the size of the data to calculate where to look. An example is in an array of fixed-size structures, where one can easily calculate where each member is, making an explicit index unnecessary.

Another example is a CBR format. In this case, we may have a format, such as for an audio or video file, and may calculate the offset within the file based on its time-value. Like with the array, if every frame (or group of frames) is encoded within a fixed-size box, then we can easily calculate where the frame is at based on the desired time-value and a known reference point.

In either case, to allow for such random access, it is important that the data can be decoded without reliance on preceding data. This may come at a (typically minor) cost to the compression efficiency, since data within the preceding chunk can't be used by the chunk in question, and as a result will need to be repeated.

## Packages with Named Members

Sometimes, it may also make sense for a format to use names or keys to refer to data members.

Historical examples of these include the Doom WAD and Quake WAD2 formats, which used a flat directory of lumps with 8 or 16 character names.

A related use-case is the Quake PACK format, which was structurally similar to a WAD file, with the primary practical difference of having a much larger name field (56 characters), and storing a file-path (rather than a simple lump name).

Some formats also use a ZIP file to package up its contents, which has the advantage of being a fairly standardized container which supports both named members and Deflate. Examples: ODT, DOCX, and ORA (OpenRaster).

The advantage is that these allow the ability to dynamically refer to members via names or paths, which can be useful if storing data which is accessed by a name or path.

Sometimes it may also make sense to combine the use of named members with a TLV format, where certain lumps or chunks may be named, rather than by using an explicit directory. This may make sense when the number of lumps is small and the data is all closely related.

## Data Compression via Deflate

This has been moved mostly to a new draft.

One simple and common option I will mention here is that of using something like Deflate (for example, via the common zlib library) to compress the data. This can significantly reduce the size for many types of data, and the data can be decoded reasonably quickly.

Many file formats (including, for example, PNG) have used this strategy with good success.

# Conclusion

Designing a file format can be useful, but like anything, there may be pitfalls involved, like time/effort involved, or ending up with a format that no other software can read or write, ... So, it may make sense to balance these sorts of things against the needs of the project.

Most of these topics can be found in more detail elsewhere, such as on Wikipedia.

For those who want to know in-depth how various file formats work, it might be interesting to find and read the various file-format specifications.

Also note that file-format specifications may often be a source of nifty tips or tricks, or possible design ideas, ...

While this is far from a complete (or even adequate) coverage of the topic, maybe it is enough to hopefully at least be informative.

## Article Update Log

27 Aug 2014: Writing Article
28 Aug 2014: Still writing Article
11 Sep 2014: Split article, data compression moved to new draft.

Brendan G Bohannon is a hobbyist programmer with experience in areas including 3D engine and tools development, compilers/interpreter/VM development, digital signal processing, data compression, ...

I only read the first few paragraphs and skimmed over the article (but I've bookmarked this for further reading) and I noticed it seems a little chaotic at first glance. Maybe you could improve a little on the formatting, such as putting lists or the like, this should make the article more readable as well.

These are only formal complaints though - the info provided here seems like a very helpful overview on available (compressed) binary formats. Will definitely have my try at some of them!

it is mostly broken up by topic. I ended up also trying to conserve vertical space some as it was getting longer than ideal, and after writing this I ended up having some doubts about the contents and presentation, ...

may look into the formatting issue though.

I like what you wrote here. Very interesting to get ideas of how to start designing a file format. I'm fairly new to this topic and the article is worth a lot to me because I now have a not too detailed overview and some keywords I can google to get deeper into the stuff.

Thanks for that!

As most things I read from cr88192, the dudes tries to be accurate and ends up writing something which I cannot understand. This has been a recurring pattern, as I said, I begin to be seriously worried by this guy.

But to the topic.

I cannot understand the whole point. I seriously hope it's not saving a few % of disk footprint.

I am also worried by the fact the article seem to go in great detail about how to compress. It does not seem to be spending as much attention on why to get the data here in the first place.

This is a recurring pattern even on BGB engine documentation. Documentation goes on apparently covering all the details, I cannot infer a strong connection to the real world problem.

I am also scared by the fact this article still contains ellipsis.

As such, I'm currently marking this for more work.

there are reasons for why a person would do this stuff.

not everyone needs to use most of it (or, for that matter, to create any of their own file-formats), and not for every format, but a lot of this stuff does tend to matter, sometimes.

a lot of it is a general attempt at a summary of the topic matter though.

but, yes, a person wouldn't use most of this in most cases (nor probably in any single file-format either, for that matter).

there are cases where compression matters though, for example, things like YouTube, and practical real-time video storage and playback in general, are only really possible due to the powers of (fairly aggressive) data compression. the raw video would be far too large otherwise.

and, for many other use-cases, it doesn't matter so much.

This article contains a lot of great information, but it loses itself in the overwhelming number of topics and would be better suited being at least broken into two articles (one on format, and another on compression)

In particular I think the binary format article would be the most valuable, as understanding how to organize data in binary, and the pros and cons of various patterns can help developers understand why certain designs are more desireable than others; for a given set of required features.

This article contains a lot of great information, but it loses itself in the overwhelming number of topics and would be better suited being at least broken into two articles (one on format, and another on compression)

In particular I think the binary format article would be the most valuable, as understanding how to organize data in binary, and the pros and cons of various patterns can help developers understand why certain designs are more desireable than others; for a given set of required features.

yeah, seems fair enough.

I probably would have done it this way, except I ended up hitting "publish" probably a little earlier than ideal (I was getting rushed by other things at the time, and didn't think about it enough), and was like "yeah, probably too late now" (but, AFAICT, there is no "oh wait, nevermind, give me more time to work on it" button, so most changes after that were fairly minor editing/clarification changes).

not sure how drastic of changes are OK at this stage, like just yanking out the whole section on data compression and putting it into a new draft?...

(but, AFAICT, there is no "oh wait, nevermind, give me more time to work on it" button, so most changes after that were fairly minor editing/clarification changes).

not sure how drastic of changes are OK at this stage, like just yanking out the whole section on data compression and putting it into a new draft?...

In this case you would email/message me to revert it back to draft status.

If you yank out a section, that's fine as long as it's something no one would obviously miss reading the rest of the article. Seems like a separate but related topic so pulling it would be fine IMO

(but, AFAICT, there is no "oh wait, nevermind, give me more time to work on it" button, so most changes after that were fairly minor editing/clarification changes).

not sure how drastic of changes are OK at this stage, like just yanking out the whole section on data compression and putting it into a new draft?...

In this case you would email/message me to revert it back to draft status.

If you yank out a section, that's fine as long as it's something no one would obviously miss reading the rest of the article. Seems like a separate but related topic so pulling it would be fine IMO

Ok, most of the stuff related to data compression (and bit-oriented formats) has been moved to its own article.

Did still leave mention of Deflate, as this option is still fairly simple/useful, and does not involve all the usual complexities of writing a compressor.

I like your article, but I would suggest splitting it into parts. A natural split (for me) would be general compression techniques, which could either encompass both byte and bit-based systems or be split into two articles again. And then a separate article on usages for more specific file formats, e.g. tagged systems, random access, etc.

It may be worthwhile covering compression techniques such as Huffman, LZW, static dictionaries, etc. And it may be worth covering other topics like container formats and file format versioning. But it really depends if it's relevant to what you want to say.

My other suggestion is to try to make use of graphics or link out to references for lengthy topics. It has a bit of a wall of text feeling.

Thanks for the hard work!

Note: Please offer only positive, constructive comments - we are looking to promote a positive atmosphere where collaboration is valued above all else.

PARTNERS