# Why XML is all the rage now?

This topic is 1426 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Whatever happened to craftsmanship. Anyway, rant over.

I feel like I give this talk daily in my workplace. Sometimes people even listen.

##### Share on other sites
reworking to a slightly more concise syntax:

[obj type="someobject"
[argument name="arg0" value="value"]
[argument name="arg1" value="value"]
[subobjects
[obj type="subobjecttype"
[argument name="arg0" value="value"]]
[obj type="subobjecttype"
[argument name="arg0" value="anothervalue"]]]]

so, yeah, not a huge difference...

a bigger saving is (for performance) eliminating things like free-floating text, omitting support for full namespaces, and adding support for explicit numeric values (this is essentially what my XML-based compiler AST formats did, though retaining the normal external syntax). also sometimes useful is options for encoding raw binary data (in ASCII form, generally dumped out as Base64 or a Base85 variant). a lot here depends on the exact in-memory node representation, ...

as for reducing size (via compression/serialization), there are a few options:
XML+Deflate, which is relatively straightforward, and compresses fairly well, but is slightly more expensive to encode/decode;
WBXML, basically works, but has some limitations, and results in bigger files than XML+Deflate;
EXI, never got around to fully evaluating, compresses well but I found the spec difficult to make much sense out of;
...

I had a few of my own variants, one example was SBXE, which was a "slightly improved" alternative to WBXML (slightly more compact, and more features).
SBXE+Deflate was generally slightly more compact than XML+Deflate, but the difference was fairly small.

another was related to the (never fully implemented or used) XML-coding mode of my "BSXRP" protocol, which would have used Huffman compression and VLC coding for values. (as-is, the protocol is mostly used for encoding S-Expression like data...).

both formats were loosely based (in concept) on LZP, in particular the encoding tries to predict the following tag or attribute (based on recent history), allowing this case to be coded more efficiently (and does not depend on the use of a schema), otherwise (should this prediction fail) there is the option of reusing a recently-coded value, or (as-needed) explicitly encoding the tag or attribute name (as a string). SBXE used an LZP variant for strings, whereas BSXRP used LZ77 (and an otherwise Deflate-like data representation).

##### Share on other sites

And that kind of thinking is what results in modern computers feeling just as crap as early ones even though they're thousands of times more powerful (or in the case of memory, millions of times). I know some stuff does indeed require more power, but this idea that we should waste resources just because we can waste is just plain stupid.

QFE.

My quad-core i7 should be able to launch Microsoft Word faster than a 386 in the mid-90's. And yet... it takes 10x longer.

How much of that is due to picking inferior approaches just "because"?

But that's not because some guy in the content pipeline used a tool that uses XML to lay out a dialog or such. Even the fact that they use zip-compressed XML in their office documents now doesn't bog things down (except if you use LibreOffice, which for some reason totally stinks importing these).

It's because Office first compiles a ton of C#, then loads three dozen of libraries half of which probably aren't needed at all while it shows an animation that nobody wants to see, then connects to Live.Microsoft.com and Facebook, and whatnot, and  because every single thing goes through 4 or 5 layers of legacy code and libraries.

And you know what? Nobody cares. Companies buy Office, and will contiue to buy Office, so all is good. The next incantation (Office 360) will be even worse when everything is "cloud only". And again, nobody will care, because "cloud is cool", and nobody wants to be less cool than everyone else.

It's like Windows 8, which is on all accounts much worse than Windows 7. Nobody cares. People will buy.

##### Share on other sites

The next incantation (Office 360) will be even worse when everything is "cloud only".

Maybe. Google docs launches in 3.5 seconds for me, though...

##### Share on other sites

YAML is taking off? I have yet to see anything use it. JSON though, yes, it seems that lately everybody and their dogs are using JSON now. Probably because we're in full HTML5 mode, and JSON data is valid javascript, so using it is a non-brainer as you don't even need a parser (I wonder if anybody understands the implications of loading data as code though).

Personally, I prefer INI files anyway (well, INI-like at least). Yeah, call me old-fashioned, but they're a lot easier to deal with. XML is good when you need tree-style nesting, but most of the time you don't, really (and even then, those using XML more often than not abuse it resulting in ridiculously complex formats for no real reason).

Added systems for localization AND fairly comprehensive theming in a single night for our next release... using ini files. Still need a good sit-down or two to link remaining text, gui elements, etc to these systems, but the functionality itself is complete and works great.

Honestly I've been paranoid that the design choice was naive or missing something critical because it was so darn simple and ini files are all but unheard of these days - but it works like a charm! Not to mention it takes like 30 lines of code to have a cross platform ini parser.

I can definitely see the need to support human readable hierarchies in some cases, but it seems to happen way too often when it is unneeded.

I hold the firm belief that given time, the open-source world will achieve its ultimate goal of reducing every piece of software in the world down to operations on a key/value store (see the rise of plist, JSON, Lua, and NoSQL).

Then we can resurrect the INI file, and be done with it.

Haha.

##### Share on other sites
<obj type="someobject">
<argument name="arg0" value="value"/>
<argument name="arg1" value="value"/>
<subobjects>

<obj type="subobjecttype">
<argument name="arg0" value="value"/>
</obj>

<obj type="subobjecttype">
<argument name="arg0" value="anothervalue"/>
</obj>

</subobjects>
</obj>

In YAML:

someobject:
arg0: value
arg1: value
subobjects:
- arg0: value
- arg0: anothervalue


In JSON:

{ "someobject" : { "arg0": "value", "arg1": "value", "subobjects": [ { "arg0": "value" }, { "arg1": "anothervalue" } ] } }


##### Share on other sites
And just to reinforce my point, in INI:
[obj]
id=someobject
arg0=value
arg1=value

[obj]
parent=someobject
arg0=value

[obj]
parent=someobject
arg0=anothervalue


##### Share on other sites

And just to reinforce my point, in INI:

Just when I thought "maybe I should do the INI too!"

##### Share on other sites

In YAML:

someobject:
arg0: value
arg1: value
subobjects:
- arg0: value
- arg0: anothervalue


In JSON:

{ "someobject" : { "arg0": "value", "arg1": "value", "subobjects": [ { "arg0": "value" }, { "arg1": "anothervalue" } ] } }


Hm, that doesn't seem right. Shouldn't it be:

arguments:
arg0: value
arg1: value
subobjects:
- arguments:
arg0: value
- arguments:
arg0: anothervalue

Edited by patrrr

##### Share on other sites

Points about JSON > XML aside...

As a game developer, the big deal to me is that it's a flexible standardized text format which means:

• I don't have to create my own libraries to read, write, or navigate it.
• At least for the purposes of developing and debugging tools that use, generate, or convert it, it's human readable.
• It's diff-able and potentially merge-able, which to me makes it first-class revisionable.

^^ THIS!

It is a shame this is a Lounge post and we cannot upvote.

As a person who remembers the bad-old-days of the 90s and late 80's, I easily recall when most tools and technologies relied entirely on binary formats.

I don't care one bit if the language is XML, JSON, YAML, or Maya Ascii or Collada or anything else.  That does not matter.

I absolutely care about factors like those above.

I absolutely care that as a developer I can understand and interpret the file without a binary-file parser.  I can crack open a file, run a text search, and find the data I need.  Even better, sometimes I can run find-and-replace on that file using simple tools.

I absolutely care that I can run diff against two versions of the file and see the difference. I can open a diff of two revisions with an artist sitting next to me, and we can both glance at the file and see that they moved a joint a little to the left.

I don't care about the exact format, but I strongly care that these human-readable, standardized, diff-able files are used.  There was a time when storage space was expensive and everything needed to be encoded.  We are past those days.

Edited by frob

##### Share on other sites

Add some newlines to that JSON example O_O

##### Share on other sites

In YAML:
someobject:
arg0: value
arg1: value
subobjects:
- arg0: value
- arg0: anothervalue
In JSON:
{ "someobject" : { "arg0": "value", "arg1": "value", "subobjects": [ { "arg0": "value" }, { "arg1": "anothervalue" } ] } }

Yeah, and in XML, the sane definition would be:

<someobject arg0="value" arg1="value">

<subobject arg0="value"/>

<subobject arg0="anothervalue"/>

</someobject>

Granted, there is some repetition of "subobject" and "someobject", but the bloat isn't really that much is it?

##### Share on other sites

In YAML:
someobject:
arg0: value
arg1: value
subobjects:
- arg0: value
- arg0: anothervalue
In JSON:
{ "someobject" : { "arg0": "value", "arg1": "value", "subobjects": [ { "arg0": "value" }, { "arg1": "anothervalue" } ] } }

Yeah, and in XML, the sane definition would be:

<someobject arg0="value" arg1="value">

<subobject arg0="value"/>

<subobject arg0="anothervalue"/>

</someobject>

Granted, there is some repetition of "subobject" and "someobject", but the bloat isn't really that much is it?

this is why i like xml over yaml.  json is interesting, but i'm sticking to xml personally.  yea their's a bit of bloat, but imo xml's minor bloat is made up for in terms of readability.  of course xml is obviously more easily abusable(as was pointed out above), but that shoudn't be a reason not to use something.

##### Share on other sites

Granted, there is some repetition of "subobject" and "someobject", but the bloat isn't really that much is it?

Nah, and as a bonus it'll be also easier to parse for the program, so everybody wins (it's easier to hand-edit and the program is easier to maintain).

The problem is that XML tends to be abused in the worst ways possible x_x; It's like programmers see it's a tree so they need to turn everything into a deep as possible tree no matter what, just in case (some may argue it's for expandability). Ugh. Sadly, I wouldn't be surprised if that also reflects the complexity of the program themselves... (one reason why I always have trouble with third party code, more often than not it's way more complex than it needed to be)

##### Share on other sites

Being lazy isn't necessarily a bad thing, it can lead to the programmer writing simpler code for the sake of having to do less work (both at the moment and later during maintenance). The problem is being incompetent (seriously, most of the horrible code actually takes lots of effort to make, so it's hard to argue it's lazy), and in some cases, not knowing the implications of what the code does so things are done in a suboptimal way (high level languages sometimes can get rather bad about this).

But yeah, incompetent managers are way too common and a serious source of creep. Or maybe they think that by being too hard they can get a higher pay or something. Or possibly both. (depends on who's in charge really, as well as company culture)

##### Share on other sites

Being lazy isn't necessarily a bad thing, it can lead to the programmer writing simpler code for the sake of having to do less work (both at the moment and later during maintenance). The problem is being incompetent (seriously, most of the horrible code actually takes lots of effort to make, so it's hard to argue it's lazy), and in some cases, not knowing the implications of what the code does so things are done in a suboptimal way (high level languages sometimes can get rather bad about this).

There's a lot of truth in this in my opinion, not only in respec of "managers" but also in respect of the original topic "why use XML".

Being "lazy" for doing less work means nothing but showing competence in using the available work time. That's at least true as long as the end user observable result is identical (which is the case).

Now, XML may be unsuitable for your tasks, then you should indeed use something different (for example, I would not use it to serialize data that goes over network, even though even this "works fine" as has been proven). But on the other hand, it might just be good enough with no real and serious disadvantage other than being less pretty than you'd like. You have working libraries that you know by heart to handle the format, it plays well with your revision control system, and in the final product it's either compiled into a binary format anyway, or the load time doesn't matter. Maybe you don't like one or the other fearure, but seriously, so what.

In the rather typical case of "no visible difference in end product", one needs to ask which one shows more competency. Using something that works or investing extra time so one can use something that... works.

##### Share on other sites

Being lazy isn't necessarily a bad thing, it can lead to the programmer writing simpler code for the sake of having to do less work (both at the moment and later during maintenance). The problem is being incompetent (seriously, most of the horrible code actually takes lots of effort to make, so it's hard to argue it's lazy), and in some cases, not knowing the implications of what the code does so things are done in a suboptimal way (high level languages sometimes can get rather bad about this).

There's a lot of truth in this in my opinion, not only in respec of "managers" but also in respect of the original topic "why use XML".

Being "lazy" for doing less work means nothing but showing competence in using the available work time. That's at least true as long as the end user observable result is identical (which is the case).

Now, XML may be unsuitable for your tasks, then you should indeed use something different (for example, I would not use it to serialize data that goes over network, even though even this "works fine" as has been proven). But on the other hand, it might just be good enough with no real and serious disadvantage other than being less pretty than you'd like. You have working libraries that you know by heart to handle the format, it plays well with your revision control system, and in the final product it's either compiled into a binary format anyway, or the load time doesn't matter. Maybe you don't like one or the other fearure, but seriously, so what.

In the rather typical case of "no visible difference in end product", one needs to ask which one shows more competency. Using something that works or investing extra time so one can use something that... works.

yes.

this is partly a reason behind the current funkiness of using both XML and S-Expressions for a lot of stuff...

a lot comes back to my interpreter projects, as most of the other use-cases had been "horizontal outgrowths" of these, and most unrelated systems had typically ended up using line-oriented text-files (partly because, in simple cases, these tend to be the least implementation effort).

note that in my case, JSON is mostly treated as a horizontal side-case of the S-Expression system (different parser/printer interface, but they map to the same basic underlying data representations).

the main practical difference then (in program) is the main dominance of types:

S-Expression data primarily uses lists, and generally avoids maps/objects (non-standard extension);

JSON primarily uses maps/objects, and avoids lists, symbols, keywords, ... (non-standard extensions).

secondarily, this also means my "S-Expression" based network protocol naturally handles serializing JSON style data as well (it doesn't really care too much about the differences).

for largish data structures, the relative costs of various options tends to weigh-in as well, and (in my case) objects with only a few fields tend to be a bit more expensive than using a list or array (though objects are a better choice if there is likely to be a lot of fields, or if the data just naturally maps better to an object than it does to a list or array).

this leads to a drawback for JSON in this case that it tends to (by convention) rely fairly heavily on these more-expensive object types, and also for my (list-heavy) data-sets tends to produce slightly more "noisy" output (lots of extra / unnecessary characters). both formats can be either "dumped" or printed using formatting.

brief history (approximately of the past 14 years):

at one point, I wrote a Scheme interpreter, and it naturally uses S-Expressions.

later on, this project partly imploded (at the time, the code became mostly unmaintainable, and Scheme fell a bit short in a few areas).

by its later stages, it had migrated to a form of modified S-Expressions, where essentially:

macros were expanded; built-in operations used operation-numbers rather than symbols; lexical variables were replaced with variable-indices; ...

there was also a backend which would spit out Scheme code compiled to globs of C.

elsewhere, I had implemented XML-RPC, and a simplistic DOM-like system to go along with it.

I had also created a type-system initially intended to assist with data serialization, and partly also to add some dynamic type-facilities needed to work effectively with XML-RPC. pretty much all types were raw-pointers to heap-allocated values, with an object header just before the data, and was initially separate from the memory manager (later on, they were merged). (in this system, if you wanted an integer-value, you would get an individually-allocated integer, ...).

later on, I implemented the first BGBScript interpreter (BS.A) (as a half-assed JavaScript knock-off), using essentially a horridly hacked/expanded version of XML-RPC logic as the back-end (it was actually initially a direct interpreter working by walking over the XML trees, and was *slow*...). later on, it had sort of half-way moved it to bytecode, but in a lame way (it was actually using 16-bit "word code", and things like loops and similar were handled using recursion and stacks). the type-system was reused from that above. (it also generated garbage at an absurd rate... you couldn't do "i++;" on an integer variable without the thing spewing garbage... making it ultimately almost unusable even really for light-duty scripting...).

the second BGBScript interpreter (BS.B) was built mostly by copying a bunch of the lower-end compiler and interpreter logic from the Scheme interpreter, and essentially just using a mutilated version of Scheme as the AST format, while retaining a fairly similar high-level syntax to the original. it used ("proper") bytecode up-front, and later experimented with a JIT. it ended up inheriting some of the Scheme interpreter's problems (and notably problematic was the use of precise reference-counted references from C code, which involved a lot of boilerplate, pain, and performance overhead).

the C compiler sub-project mostly used the parser from BS.A and parts of the bytecode and JIT from BS.B. it kept the use of an XML based AST format. this sub-project ultimately turned out to have a few ultimately fatal flaws (though some parts remain and were later re-purposed). this fate also befell my (never completed) Java and C# compiler efforts, which were built on the same infrastructure as the C compiler.

the 3rd BGBScript interpreter (BS.C) was basically just reworking BS.B to work onto the type-system from BS.A, mostly as it was significantly less of a  PITA to work-with. this resulted in some expansion of the BS.A typesystem (such as to include lists and cons cells, ...). (and, by this time, some of the worse offenses had already been cleaned up...).

the changes made broke the JIT in some fairly major ways (so, for the most part, it was interpreter only).

the BS VM has not undergone any single major rewrites since BS.C, but several notable internal changes have been made:

migration of the interpreter to threaded-code;

migration of the interpreter (and language) mostly to using static types;

implementation of a new JIT;

migrating to a new tagged-reference scheme, away from raw pointers (*1);

...

what would be a 4th BS rewrite has been considered, which would essentially be moving the VM primarily to static types and using a Dalvik-like backend (Register IR). this could potentially help with performance, but would take a lot of time/effort and would likely not be bytecode compatible with the current VM.

*1: unlike the prior type-system changes, this preserves 1:1 compatibility with the pointer-based system (via direct conversion), though there are some cases of conversion inefficiencies (mostly due to differences in terms of value ranges). both systems use conservative references and do not use reference-counting (avoiding a lot of the pain and overhead these bring).

or such...

Edited by cr88192

##### Share on other sites

Would like to throw my two cents in here as well.

I understand that people may not be crazy about XML and it was used, overused and abused to no end for many, many years. But, I personally find it a very useful format for encoding basic data that doesn't need to be in binary and is never really intended to be sent over a network. Effectively I use it to define animation states and object properties in games. I also use it to great effect for localization strings.

I find JSON problematic for these cases and frankly, YAML isn't as easy to put together particularly when you have a number of sub objects (not as intuitive, but that could simply be because it hasn't been in as great a use as XML).

Not to mention, you have really great libraries that are well tested and mature. I'm using TinyXML to great effect -- no need for the extra stuff like schemas and validation and whatnot, I just handle that myself because the definitions I'm using are so basic in nature.

##### Share on other sites

brief history (approximately of the past 14 years):

at one point, I wrote a Scheme interpreter, and it naturally uses S-Expressions.

later on, this project partly imploded (at the time, the code became mostly unmaintainable, and Scheme fell a bit short in a few areas).

by its later stages, it had migrated to a form of modified S-Expressions, where essentially:

macros were expanded; built-in operations used operation-numbers rather than symbols; lexical variables were replaced with variable-indices; ...

there was also a backend which would spit out Scheme code compiled to globs of C.

Quite the array of language projects you have there!

I too am fond of the use of S-expressions over that of XML, and have had experience using them for data and DSLs in a number of projects. You can't beat the terseness and expressive power, and it's not hard to roll your own parser to handle them.

I share many of the opinions from: http://c2.com/cgi/wiki?XmlIsaPoorCopyOfEssExpressions

As for my own projects, I've also built a custom R6RS parser in C++, and have done some interesting things with it. For specifying data as maps/sets/vectors, I added support for handling special forms which yield new data-structure semantics, added Closure-like syntactic sugar to the lexer/parser where braces and square brackets can be used to define such data structures, and added a quick tree-rewriting pass to the data compiler to convert from the internal list AST node representation to the appropriate container type.

For simple data, sometimes I just go with simple key-value text files if I can get away with it (less is more! strtok_r does the job good enough), and I've recently been experimenting with using parsing expression grammar generators to quickly create parser combinators for custom DSLs that generate more complex data or code as s-expressions or C++.

A shame that many of the "big iron" game studios still use XML for a lot of things, although I've managed to convince a number people that it's time to move on. I dread the days where I am tasked with working on anything touching the stuff.

In short, if you're still using XML, you're needlessly wading through an endless swamp of pain, suffering and obtuse complexity. Things can be better.

##### Share on other sites

I understand that people may not be crazy about XML and it was used, overused and abused to no end for many, many years. But, I personally find it a very useful format for encoding basic data that doesn't need to be in binary and is never really intended to be sent over a network. Effectively I use it to define animation states and object properties in games. I also use it to great effect for localization strings.

I find JSON problematic for these cases and frankly, YAML isn't as easy to put together particularly when you have a number of sub objects (not as intuitive, but that could simply be because it hasn't been in as great a use as XML).

S-expressions are just as powerful, yet more terse. Naughty Dog uses them in the Uncharted Engine for similar things.

##### Share on other sites
If the data format is going to, 99% of the time, be read in a tools pipeline and not a human then I don't consider terseness a virtue to be honest.

If your pipeline/tools are based around .Net then with XML, between the XDoc/XElement classes and LINQ, you've got 99% of your processing/tree walking requirements there - writing a bit of LINQ to parse a XDoc is pretty trivial.

##### Share on other sites

If the data format is going to, 99% of the time, be read in a tools pipeline and not a human then I don't consider terseness a virtue to be honest.

My experience has been that even when that is the intention, it's not the end result.

We seem to spend a lot of time hand-tweaking the (XML) output of our pipeline.

##### Share on other sites
Then, honestly, I'd argue your pipeline is a tad broken ;)

I can say with absolute certainty that no tool produced XML needs tweaking in our setup, we do have 1 config file which is XML but that is "legacy" as much as anything ("it works, we aren't going to change it"). The only other hand edited config file we have is the renderer setup one which is in JSON - although I'm not convinced that was the right call and wanted to use a 'JSON/Python inspired syntax', but that's a whole other barrel of bitterness