Jump to content

  • Log In with Google      Sign In   
  • Create Account


Binary vs Text file formats


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
14 replies to this topic

#1 PlayfulPuppy   Members   -  Reputation: 419

Like
0Likes
Like

Posted 09 September 2006 - 08:03 PM

Recently I've been doing a lot of work creating file formats for my engine, but everywhere around me I see people moving over to text-based formats, paticularily XML and XML-esque structures. I've never understood this myself. While text-based formats kick ass for debugging (And in some cases extensibility), I just find text formats to be too slow (And large) to deal with in a commercial environment. The main problems I see with text based formats are as follows: - Loading times: In a binary-based format, you can read a primitive value directly into memory, with at most a file read, an assignment and an endian-swap if required. With text-based formats, you'll need to convert the text to the required type, which is a comparitivly expensive operation. The number of digits you require can take its toll on that as well, as in binary you'll normally be reading 4 or 8 bytes regardless, whilst with a text-based format you'll be reading a variable amount depending on the size of value. You also have to deal with skipping over whitespaces, parsing block headers and the like. This can take a substantial hit on loading times, and now that streaming is becoming the norm (Which requires lightning-fast reads to be effective), it seems like a step back. - File size: Pretty much the same as above, storing numbers as text generally takes up more space than if you just wrote out the value directly. You also have to deal with plain-text headers, whitespaces and syntax characters. For a comparison, I've got a dual binary/text file format, and exporting the same model in binary was about half the size of the (Incredibly vanilla) text-based alternative. While hard-drives and storage mediums are getting a whole lot bigger, a saving of ~25-50% is certainly nothing to be scoffed at. - Accuracy: Floating-point numbers can be a real pain, but at least with a binary-based format you'll always get out the exact same value that was put in. With text-based formats, you'll probably have to limit the number of values in the mantissa so you don't have to read in 30 characters for a tiny value. Whilst the difference between 0.000000005 and 0.00000001 is pretty damned minimal, it can have an affect on objects with a paticularily small scale or where extreme accuracy is required (Very, very rare, but I have seen it happen). Although that's not to say I don't recognise the benefits of text-based formats: - Easy to debug: Debugging file-formats can always be a total pain the ass, especially when a lot of floating-point numbers are involved (Converting hex to IEEE in your head is something I doubt many people have mastered). Text formats make it a lot easier to verify, and change during testing. - Extensibility: While it's not an implicit property of text-based file formats, the nature of them promotes a more ordered structure where objects can be identified by unique strings, which allows programmers to add new elements and information without breaking the format. - Community-friendly: For games that thrive on active mod communities (Such as Doom 3), a text-based format allows even fairly inexperienced programmers to write importers/exporters for their favorite programs with reasonably little effort. Binary formats, on the other hand, can be pretty much illedgible without a rosetta stone. So text-based formats certainly have their uses, but I don't believe they're the way forward for all formats. Config files should always be text (Unless of course the system is exceptionally tight on memory/storage) so that non-programmers can tweak the game settings easily, and users can customise the game to their liking. 3D file formats, level files and animations, on the other hand, have little to be gained from moving to a text-only approach. They give larger file sizes and increased loading times, and the main benefit of a text-based approach (Easy editing) is fairly moot as it's incredibly hard for the human mind to understand a complex object given only numerical coordinates, let alone edit it with any sort of real comprehension for what they're doing. So what are your thoughts on this? Do you believe in the push for text-based formats is a good thing? If so, why?

Sponsor:

#2 Promit   Moderators   -  Reputation: 6777

Like
0Likes
Like

Posted 09 September 2006 - 08:06 PM

Well, text based formats are certainly way, way better for debugging, hand edits, etc. But at runtime or at least for production code, binary formats should be in use. So the ideal case is that your content build compiles the text formats into a binary format and packages that. Your game then loads the binary format. (Optionally your game can load the text format directly if necessary.)

Since most people (most hobbyists, anyway) don't bother with proper content build pipelines, they simply lack the step that converts to a binary representation and simply stick to text. That's my theory, anyway.

Binary formats rock at runtime.

#3 T1Oracle   Members   -  Reputation: 100

Like
0Likes
Like

Posted 09 September 2006 - 09:05 PM

Quote:
Original post by Promit
Well, text based formats are certainly way, way better for debugging, hand edits, etc. But at runtime or at least for production code, binary formats should be in use. So the ideal case is that your content build compiles the text formats into a binary format and packages that. Your game then loads the binary format.

Agreed, all text formats should be able to be compiled into a binary version. I personally see no benefit to XML unless you need to share text data with outside sources. Elsewise a script (or even ini file) format can easily be made into a superior alternative.

Text for debugging, binary for release.
Programming since 1995.

#4 Extrarius   Members   -  Reputation: 1412

Like
0Likes
Like

Posted 09 September 2006 - 09:20 PM

Personally, I think the move to XML is a mistake, because XML is _way_ too verbose. If you want a text-based format, use something based on S-Expressions (used by Lisp). It's just as flexable, but it doesn't require repeating everything for an end tag, and if you make your scripting language also use S-Expressions, all your data can be stored as regular scripts (that possibly have access to different native functions).

Personally, I use binary formats for most things simply because they're easier to create with existing editors for images, models, levels, etc.

As far as community friendly, editing files manually is not at all community friendly, whatever information they store. You really would be better served by creating editors for things that don't already have editors.

For extensibility, binary formats can be just as extensible by using some kind of 'tagged format' somewhat like the TIFF image format or WAV audio format.

#5 ToohrVyk   Members   -  Reputation: 1591

Like
0Likes
Like

Posted 09 September 2006 - 09:29 PM

Quote:
Original post by Promit
So the ideal case is that your content build compiles the text formats into a binary format and packages that. Your game then loads the binary format. (Optionally your game can load the text format directly if necessary.)


A common improvement of that scheme is to load both text and binary formats (check for binary format first, then for text format), and to have the game output the binary equivalent of a text resource upon loading.

This has the advantage of allowing the game to output binary content highly compatible with its internal representation without code duplicates in an external compiler, and also allowing you to bundle the game with smaller, compressed text files that are turned into bulkier but faster-loading binary files when the game is first run (ideal when you want a small downloadable installer).



#6 Delfi   Members   -  Reputation: 106

Like
0Likes
Like

Posted 09 September 2006 - 09:42 PM

Quote:

- Accuracy: Floating-point numbers can be a real pain, but at least with a binary-based format you'll always get out the exact same value that was put in. With text-based formats, you'll probably have to limit the number of values in the mantissa so you don't have to read in 30 characters for a tiny value. Whilst the difference between 0.000000005 and 0.00000001 is pretty damned minimal, it can have an affect on objects with a paticularily small scale or where extreme accuracy is required (Very, very rare, but I have seen it happen).


please read what you wrote, it is totally wrong, you don't seem to understand how floating point numbers work! every number in memory can be reprisented with text description, but the 0.000000005 tolerance is NOT from conversion, it is what is actually stored in the data.


you just need to know when to use text files and when not to. if the file will be updated and tweaked often use text, if it is 3d model or texture use binary, it is simple as that. and try not to use xml, because it is overkill even for what it was originally intended.


Projects: Top Down City: http://mathpudding.com/

#7 Xetick   Members   -  Reputation: 243

Like
0Likes
Like

Posted 09 September 2006 - 09:46 PM

As long as you only use attributes in an element you dont need an end tag so you can make it quite small actually. I'm doing this for my own project file format and find it as compact as you can get it without removing the self descripting that follows it.

<Nodes>
<Node Type="Screen" Name="Screen1" Viewport="0 0 1 1" />
<Node Type="RenderObject" Name="RenderObject1" Col="1 1 1 1" Pos="0 0 0" Rot="0 0 0" Scale="1 1 1" CamPos="0 0 -9" CamRot="0 0 0" CamFov="60" ScaleByAspect="false" />
</Nodes>

It's still a lot more than a binary format of course. What I would like to see is a binary format of XML so you could use the exact same interface to you xml file just exchange the file. I think they are working on that. Not sure how it's going though.

#8 ronnybrendel   Members   -  Reputation: 100

Like
0Likes
Like

Posted 09 September 2006 - 10:07 PM

in most cases i think, it´s desirable to sacrifice some performance in order to debugg better

although its only reading and writing performance which is lost ... if you want you can make cachefiles out of your xml data, in order to load them faster the next time?

as for the numbers: saving in hex make it pretty fast to load into a programm (at least faster than with base 10)

and if HD-Speed is your bottleneck, try zlib

#9 Joakim_ar   Members   -  Reputation: 192

Like
0Likes
Like

Posted 09 September 2006 - 10:16 PM

What I don't like about XML is it's verbosity and that it has too many ways of doing the same thing (attributes are not really needed, they're just sugar, complicating the syntax). Most data that you need to store in a game is structured as a tree, which XML can represent, but in a very verbose way. A much simpler language for expressing tree structures can easily be thought up, such as the following:

nodes(
node(
type:Screen;
name:Screen1;
viewport(
from(x:0; y:0;)
to(x:1; y:1;)
)
)
)
Ie. two types of node: one that can contain other nodes (the ones with paranthesis) and one that can contain a string (the ones with the colon/semicolon). The result is much less syntactic noise, less (one) special characters to 'escape', making it easier to read, write and parse. The only functionality that is lost is the ability to do text markup easily, such as specifying that a range of text should be bold (since strings cannot contain nodes descriping this).

ronnybrendel: Saving floating point values in hexadecimal kinda defeats the purpose of XML being easy to read. And besides I don't think the processing time required to parse a number means anything compared to the disk access.

#10 ronnybrendel   Members   -  Reputation: 100

Like
0Likes
Like

Posted 09 September 2006 - 10:30 PM

Quote:
Original post by Joakim_ar
ronnybrendel: Saving floating point values in hexadecimal kinda defeats the purpose of XML being easy to read. And besides I don't think the processing time required to parse a number means anything compared to the disk access.


your right. for integers it is a bit faster, and it requires less space in most cases

but you´re flexible with the size of your number 1-8 bytes for 32bit e.g. ... i mean it could eat less memory than binary (so you can say that loading a fully charged binary integer maybe as slow/fast as loading a hexvalue --- depending on the number itself) - and you can read it with your own eyes ;)

#11 Extrarius   Members   -  Reputation: 1412

Like
0Likes
Like

Posted 09 September 2006 - 10:34 PM

Quote:
Original post by Xetick
As long as you only use attributes in an element you dont need an end tag so you can make it quite small actually. I'm doing this for my own project file format and find it as compact as you can get it without removing the self descripting that follows it.[...]
You entirely lose the ability to nest elements, though, which you don't do if you use s-expression format:
(Root
:Type "Dialog"
:SpeechFiles "/speech/bob/chapter2/"
(Dialog
:Name "Bob"
(Statement
:File "Hello"
:Text "Hello. How are you?"
)
(Choice
:FileSet "HelloResponse"
:TimeLimit 10
:Default :Neutral
'(
:Good "Good. How are you?"
:Neutral "Fine."
:Evil "Shut Up!"
)
)
)
)
and if you're using Lisp, Code Is Data Is Code, so you only have to write one scripting system.
Quote:
Original post by Delfi
[...]please read what you wrote, it is totally wrong, you don't seem to understand how floating point numbers work! every number in memory can be reprisented with text description, but the 0.000000005 tolerance is NOT from conversion, it is what is actually stored in the data.[...]
Don't forget that converting text to binary involves a lot of math (for floating point numbers), and thus without a _LOT_ of effort, you will actually get less precision than is actually available in your floating point format.

#12 Kambiz   Members   -  Reputation: 758

Like
0Likes
Like

Posted 09 September 2006 - 10:42 PM

Read this:
POV-Ray : Documentation : 1.4.4.7 Why are triangle meshes in ASCII format?

"...meshes use floating point numbers.
It might come as a bit of surprise that it is far from easy to represent them in binary format so that they can be read in every possible system...
...In order to store floating point numbers so that they can be read in any system, you have to store them in an universal format. ASCII is as good as any other."

#13 PlayfulPuppy   Members   -  Reputation: 419

Like
0Likes
Like

Posted 09 September 2006 - 11:09 PM

Quote:
Original post by Extrarius
As far as community friendly, editing files manually is not at all community friendly, whatever information they store. You really would be better served by creating editors for things that don't already have editors.


Well, what I should have pointed out is that it's a lazy form of community friendly. It doesn't really require any effort from programmers or marketing, whilst putting out a toolset post-release (Usually) requires some additional clean-up to be done by the programmers, as internal-only dev tools are generally raw, ugly and error-prone. Then it normally needs to have a look-over by the legal department to make sure nothing incriminating is released with the dev-tools (How many of you, even breifly for testing, have had and error dialog or log entry that contained profanity?).

It's more effort than most developers are willing to spend/publishers are willing to give, so most games either don't have toolsets released at all, or only have shitty little half-working plugins released. Text formats allow the community to take a stab at writing their own tools, which will probably get far more support than dev-released tools would.

(I'm not saying I agree with this practice, it's just the way it is a great deal of the time)

Quote:

For extensibility, binary formats can be just as extensible by using some kind of 'tagged format' somewhat like the TIFF image format or WAV audio format.


For sure, as a matter of fact one of the primary goals of my (Binary) file format is to be extensible without breaking the format (My previous format was really getting on my nerves as I had to modify the exporter, synchronise the loader and re-export ALL the assets whenever I wanted to add a tiny new feature).

#14 Xai   Crossbones+   -  Reputation: 1436

Like
0Likes
Like

Posted 09 September 2006 - 11:23 PM

Also, there is a big difference between 100% internal data that you do not intend users, modders, or 3rd party tools to read or manipulate and data that might be customizable, etc.

And in favor of XML, for data that is editable by external, tool or human, sources. ONE benifit is their readablitity, which can be thought of as a small amount of self-description .. although it is also a minus, since the words can be poorly chosen and confuse or decieve the user. The real benifit comes when you create a DTD / SCHEMA. In this way you can tell the world in a 100% agreed upon manner a very fundamental description of your datas organization and rules. Sure not everything that passes the schema is valid data (because schema hold almost no business rules), but everything which fails the schema is invalid data. So its a good starting poing. And your modding / tools can perform schema validation when loading or reading files, to let you know when you have broken anything.

Such checks can also be performed at game load, and in the case of my config files I embed a valid "default" file in the executable to be used in cases of invalid custom files. Oviously this is for small files you have few of, not large bodies of assests, which just have to be right or the program can't do whatever relies on them ... but that's the case no matter what you use.

#15 ApochPiQ   Moderators   -  Reputation: 15190

Like
0Likes
Like

Posted 09 September 2006 - 11:57 PM

This question should not be answered with "always" or "never" type responses. There are cases where text-based formats (even XML) are extremely advantageous, and cases where binary is simply the only acceptable medium.

The equivalent-binary scheme that Promit mentioned is in very widespread use, and pretty much solves all the issues involved. An even simpler method is to store all the data in a purely binary format, and create an editor tool that expands the binary data to a textual equivalent representation. This allows trivial hand-editing of data, without introducing a potential point of failure in the content build pipeline, and without requiring duplicated code (i.e. game code has to be able to read the textual format). We're currently migrating to that approach.


There's a nice heuristic question that can be used to determine whether to take this route, or to use a format like XML: does an editor for this type of data already exist? I don't think this is by any means a hard-and-fast rule, but it'll handle the majority of the cases. (Note that I personally find sexprs to be more elegant than XML, but infinitely less practical, due to the lack of schema and editor support.)

Consider a real-world case: 3D models have plenty of existing editing tools. All of the decent modelling packages have plugin infrastructures that support custom file-format exporters. Therefore, using XML is not really useful here, since you can export from the editor directly to the needed binary format. (This gets subtly more complicated when including portability formats like Collada, but that's beyond the scope of discussion I think.)

On the flip side, consider a data format that defines animated cutscene/movie sequences to be rendered in-engine. There is no editor for such things, at least not one that is guaranteed to suit our requirements. This is where XML becomes an attractive option: XML parsing libraries are ubiquitous, so the cost of integration with the game code is going to be minimal; once the tree is loaded in-memory it can be trivially converted to any equivalent tree structure desired, meaning that there's no reason to worry about the XML format affecting the rest of your code architecture if you don't want it to; and a good schema combined with a good XML editor (like VS2005) gives you a validation tool and file format documentation all in the same place.

XML in this case truly helps eliminate duplication of information to a huge extent: since schemas can have embedded human-readable documentation, you don't need separate doc files; schemas instantly eliminate the need for sophisticated validation, and in the worst case additional runtime validation on the tree (post-loading) is trivial; and you've totally removed the need for a custom editor tool without precluding the introduction of one at a later date.


Use the right tool for the job; don't mangle the job so you can do it with one particular tool.




Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS