Binary vs Text file formats

Started by
13 comments, last by ApochPiQ 17 years, 7 months ago
Recently I've been doing a lot of work creating file formats for my engine, but everywhere around me I see people moving over to text-based formats, paticularily XML and XML-esque structures. I've never understood this myself. While text-based formats kick ass for debugging (And in some cases extensibility), I just find text formats to be too slow (And large) to deal with in a commercial environment. The main problems I see with text based formats are as follows: - Loading times: In a binary-based format, you can read a primitive value directly into memory, with at most a file read, an assignment and an endian-swap if required. With text-based formats, you'll need to convert the text to the required type, which is a comparitivly expensive operation. The number of digits you require can take its toll on that as well, as in binary you'll normally be reading 4 or 8 bytes regardless, whilst with a text-based format you'll be reading a variable amount depending on the size of value. You also have to deal with skipping over whitespaces, parsing block headers and the like. This can take a substantial hit on loading times, and now that streaming is becoming the norm (Which requires lightning-fast reads to be effective), it seems like a step back. - File size: Pretty much the same as above, storing numbers as text generally takes up more space than if you just wrote out the value directly. You also have to deal with plain-text headers, whitespaces and syntax characters. For a comparison, I've got a dual binary/text file format, and exporting the same model in binary was about half the size of the (Incredibly vanilla) text-based alternative. While hard-drives and storage mediums are getting a whole lot bigger, a saving of ~25-50% is certainly nothing to be scoffed at. - Accuracy: Floating-point numbers can be a real pain, but at least with a binary-based format you'll always get out the exact same value that was put in. With text-based formats, you'll probably have to limit the number of values in the mantissa so you don't have to read in 30 characters for a tiny value. Whilst the difference between 0.000000005 and 0.00000001 is pretty damned minimal, it can have an affect on objects with a paticularily small scale or where extreme accuracy is required (Very, very rare, but I have seen it happen). Although that's not to say I don't recognise the benefits of text-based formats: - Easy to debug: Debugging file-formats can always be a total pain the ass, especially when a lot of floating-point numbers are involved (Converting hex to IEEE in your head is something I doubt many people have mastered). Text formats make it a lot easier to verify, and change during testing. - Extensibility: While it's not an implicit property of text-based file formats, the nature of them promotes a more ordered structure where objects can be identified by unique strings, which allows programmers to add new elements and information without breaking the format. - Community-friendly: For games that thrive on active mod communities (Such as Doom 3), a text-based format allows even fairly inexperienced programmers to write importers/exporters for their favorite programs with reasonably little effort. Binary formats, on the other hand, can be pretty much illedgible without a rosetta stone. So text-based formats certainly have their uses, but I don't believe they're the way forward for all formats. Config files should always be text (Unless of course the system is exceptionally tight on memory/storage) so that non-programmers can tweak the game settings easily, and users can customise the game to their liking. 3D file formats, level files and animations, on the other hand, have little to be gained from moving to a text-only approach. They give larger file sizes and increased loading times, and the main benefit of a text-based approach (Easy editing) is fairly moot as it's incredibly hard for the human mind to understand a complex object given only numerical coordinates, let alone edit it with any sort of real comprehension for what they're doing. So what are your thoughts on this? Do you believe in the push for text-based formats is a good thing? If so, why?
Advertisement
Well, text based formats are certainly way, way better for debugging, hand edits, etc. But at runtime or at least for production code, binary formats should be in use. So the ideal case is that your content build compiles the text formats into a binary format and packages that. Your game then loads the binary format. (Optionally your game can load the text format directly if necessary.)

Since most people (most hobbyists, anyway) don't bother with proper content build pipelines, they simply lack the step that converts to a binary representation and simply stick to text. That's my theory, anyway.

Binary formats rock at runtime.
SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.
Quote:Original post by Promit
Well, text based formats are certainly way, way better for debugging, hand edits, etc. But at runtime or at least for production code, binary formats should be in use. So the ideal case is that your content build compiles the text formats into a binary format and packages that. Your game then loads the binary format.

Agreed, all text formats should be able to be compiled into a binary version. I personally see no benefit to XML unless you need to share text data with outside sources. Elsewise a script (or even ini file) format can easily be made into a superior alternative.

Text for debugging, binary for release.
Programming since 1995.
Personally, I think the move to XML is a mistake, because XML is _way_ too verbose. If you want a text-based format, use something based on S-Expressions (used by Lisp). It's just as flexable, but it doesn't require repeating everything for an end tag, and if you make your scripting language also use S-Expressions, all your data can be stored as regular scripts (that possibly have access to different native functions).

Personally, I use binary formats for most things simply because they're easier to create with existing editors for images, models, levels, etc.

As far as community friendly, editing files manually is not at all community friendly, whatever information they store. You really would be better served by creating editors for things that don't already have editors.

For extensibility, binary formats can be just as extensible by using some kind of 'tagged format' somewhat like the TIFF image format or WAV audio format.
"Walk not the trodden path, for it has borne it's burden." -John, Flying Monk
Quote:Original post by Promit
So the ideal case is that your content build compiles the text formats into a binary format and packages that. Your game then loads the binary format. (Optionally your game can load the text format directly if necessary.)


A common improvement of that scheme is to load both text and binary formats (check for binary format first, then for text format), and to have the game output the binary equivalent of a text resource upon loading.

This has the advantage of allowing the game to output binary content highly compatible with its internal representation without code duplicates in an external compiler, and also allowing you to bundle the game with smaller, compressed text files that are turned into bulkier but faster-loading binary files when the game is first run (ideal when you want a small downloadable installer).

Quote:
- Accuracy: Floating-point numbers can be a real pain, but at least with a binary-based format you'll always get out the exact same value that was put in. With text-based formats, you'll probably have to limit the number of values in the mantissa so you don't have to read in 30 characters for a tiny value. Whilst the difference between 0.000000005 and 0.00000001 is pretty damned minimal, it can have an affect on objects with a paticularily small scale or where extreme accuracy is required (Very, very rare, but I have seen it happen).


please read what you wrote, it is totally wrong, you don't seem to understand how floating point numbers work! every number in memory can be reprisented with text description, but the 0.000000005 tolerance is NOT from conversion, it is what is actually stored in the data.


you just need to know when to use text files and when not to. if the file will be updated and tweaked often use text, if it is 3d model or texture use binary, it is simple as that. and try not to use xml, because it is overkill even for what it was originally intended.

Projects: Top Down City: http://mathpudding.com/

As long as you only use attributes in an element you dont need an end tag so you can make it quite small actually. I'm doing this for my own project file format and find it as compact as you can get it without removing the self descripting that follows it.
<Nodes><Node Type="Screen" Name="Screen1" Viewport="0 0 1 1" /><Node Type="RenderObject" Name="RenderObject1" Col="1 1 1 1" Pos="0 0 0" Rot="0 0 0" Scale="1 1 1" CamPos="0 0 -9" CamRot="0 0 0" CamFov="60" ScaleByAspect="false" /></Nodes>

It's still a lot more than a binary format of course. What I would like to see is a binary format of XML so you could use the exact same interface to you xml file just exchange the file. I think they are working on that. Not sure how it's going though.
Plane9 - Home of the free scene based music visualizer and screensaver
in most cases i think, it´s desirable to sacrifice some performance in order to debugg better

although its only reading and writing performance which is lost ... if you want you can make cachefiles out of your xml data, in order to load them faster the next time?

as for the numbers: saving in hex make it pretty fast to load into a programm (at least faster than with base 10)

and if HD-Speed is your bottleneck, try zlib
What I don't like about XML is it's verbosity and that it has too many ways of doing the same thing (attributes are not really needed, they're just sugar, complicating the syntax). Most data that you need to store in a game is structured as a tree, which XML can represent, but in a very verbose way. A much simpler language for expressing tree structures can easily be thought up, such as the following:
nodes(    node(        type:Screen;        name:Screen1;        viewport(            from(x:0; y:0;)            to(x:1; y:1;)        )    ))
Ie. two types of node: one that can contain other nodes (the ones with paranthesis) and one that can contain a string (the ones with the colon/semicolon). The result is much less syntactic noise, less (one) special characters to 'escape', making it easier to read, write and parse. The only functionality that is lost is the ability to do text markup easily, such as specifying that a range of text should be bold (since strings cannot contain nodes descriping this).

ronnybrendel: Saving floating point values in hexadecimal kinda defeats the purpose of XML being easy to read. And besides I don't think the processing time required to parse a number means anything compared to the disk access.
In case you were wondering what to put in your next christian game; don't ask me, because I'm an atheist, an infidel, and all in all an antitheist. If that doesn't bother you, visit my site that has all kinds of small utilities and widgets.
Quote:Original post by Joakim_ar
ronnybrendel: Saving floating point values in hexadecimal kinda defeats the purpose of XML being easy to read. And besides I don't think the processing time required to parse a number means anything compared to the disk access.


your right. for integers it is a bit faster, and it requires less space in most cases

but you´re flexible with the size of your number 1-8 bytes for 32bit e.g. ... i mean it could eat less memory than binary (so you can say that loading a fully charged binary integer maybe as slow/fast as loading a hexvalue --- depending on the number itself) - and you can read it with your own eyes ;)

This topic is closed to new replies.

Advertisement