Why XML is all the rage now?

Started by
61 comments, last by TechnoGoth 10 years, 1 month ago

Plenty of talented people out there are still trying to do the best they can but don't get a chance because higher up the chain 'good enough' is what they want and its on to the next feature. I've lost track of the number of things I've had to check in where I know I could have improved it but the time wasn't there because 'feature Y' needs to be done in a week now. You fight the battle, some times you win and more often than not you lose.


They don't even really want "good enough." They want "Runs for me in the sales demo."

having had to deal with those kinds of people for a long time, I can honestly say that I've never "lazied" code. But I have fudged it and written shit just to get it "working" and had to leave that code behind. Feature creep is damn annoying, and happens on all projects. Having features change entirely? Happens all the time. Having to have had stuff done LAST WEEK, that just was brought up TODAY? Yep. Happens all the time.

Decent programmers aren't lazy, just swamped with a hundred other things on their plate.

In time the project grows, the ignorance of its devs it shows, with many a convoluted function, it plunges into deep compunction, the price of failure is high, Washu's mirth is nigh.

Advertisement

Being lazy isn't necessarily a bad thing, it can lead to the programmer writing simpler code for the sake of having to do less work (both at the moment and later during maintenance). The problem is being incompetent (seriously, most of the horrible code actually takes lots of effort to make, so it's hard to argue it's lazy), and in some cases, not knowing the implications of what the code does so things are done in a suboptimal way (high level languages sometimes can get rather bad about this).

But yeah, incompetent managers are way too common and a serious source of creep. Or maybe they think that by being too hard they can get a higher pay or something. Or possibly both. (depends on who's in charge really, as well as company culture)

Don't pay much attention to "the hedgehog" in my nick, it's just because "Sik" was already taken =/ By the way, Sik is pronounced like seek, not like sick.

Being lazy isn't necessarily a bad thing, it can lead to the programmer writing simpler code for the sake of having to do less work (both at the moment and later during maintenance). The problem is being incompetent (seriously, most of the horrible code actually takes lots of effort to make, so it's hard to argue it's lazy), and in some cases, not knowing the implications of what the code does so things are done in a suboptimal way (high level languages sometimes can get rather bad about this).

There's a lot of truth in this in my opinion, not only in respec of "managers" but also in respect of the original topic "why use XML".

Being "lazy" for doing less work means nothing but showing competence in using the available work time. That's at least true as long as the end user observable result is identical (which is the case).

Now, XML may be unsuitable for your tasks, then you should indeed use something different (for example, I would not use it to serialize data that goes over network, even though even this "works fine" as has been proven). But on the other hand, it might just be good enough with no real and serious disadvantage other than being less pretty than you'd like. You have working libraries that you know by heart to handle the format, it plays well with your revision control system, and in the final product it's either compiled into a binary format anyway, or the load time doesn't matter. Maybe you don't like one or the other fearure, but seriously, so what.

In the rather typical case of "no visible difference in end product", one needs to ask which one shows more competency. Using something that works or investing extra time so one can use something that... works.

Being lazy isn't necessarily a bad thing, it can lead to the programmer writing simpler code for the sake of having to do less work (both at the moment and later during maintenance). The problem is being incompetent (seriously, most of the horrible code actually takes lots of effort to make, so it's hard to argue it's lazy), and in some cases, not knowing the implications of what the code does so things are done in a suboptimal way (high level languages sometimes can get rather bad about this).

There's a lot of truth in this in my opinion, not only in respec of "managers" but also in respect of the original topic "why use XML".

Being "lazy" for doing less work means nothing but showing competence in using the available work time. That's at least true as long as the end user observable result is identical (which is the case).

Now, XML may be unsuitable for your tasks, then you should indeed use something different (for example, I would not use it to serialize data that goes over network, even though even this "works fine" as has been proven). But on the other hand, it might just be good enough with no real and serious disadvantage other than being less pretty than you'd like. You have working libraries that you know by heart to handle the format, it plays well with your revision control system, and in the final product it's either compiled into a binary format anyway, or the load time doesn't matter. Maybe you don't like one or the other fearure, but seriously, so what.

In the rather typical case of "no visible difference in end product", one needs to ask which one shows more competency. Using something that works or investing extra time so one can use something that... works.

yes.

this is partly a reason behind the current funkiness of using both XML and S-Expressions for a lot of stuff...

a lot comes back to my interpreter projects, as most of the other use-cases had been "horizontal outgrowths" of these, and most unrelated systems had typically ended up using line-oriented text-files (partly because, in simple cases, these tend to be the least implementation effort).

note that in my case, JSON is mostly treated as a horizontal side-case of the S-Expression system (different parser/printer interface, but they map to the same basic underlying data representations).

the main practical difference then (in program) is the main dominance of types:

S-Expression data primarily uses lists, and generally avoids maps/objects (non-standard extension);

JSON primarily uses maps/objects, and avoids lists, symbols, keywords, ... (non-standard extensions).

secondarily, this also means my "S-Expression" based network protocol naturally handles serializing JSON style data as well (it doesn't really care too much about the differences).

for largish data structures, the relative costs of various options tends to weigh-in as well, and (in my case) objects with only a few fields tend to be a bit more expensive than using a list or array (though objects are a better choice if there is likely to be a lot of fields, or if the data just naturally maps better to an object than it does to a list or array).

this leads to a drawback for JSON in this case that it tends to (by convention) rely fairly heavily on these more-expensive object types, and also for my (list-heavy) data-sets tends to produce slightly more "noisy" output (lots of extra / unnecessary characters). both formats can be either "dumped" or printed using formatting.

brief history (approximately of the past 14 years):

at one point, I wrote a Scheme interpreter, and it naturally uses S-Expressions.

later on, this project partly imploded (at the time, the code became mostly unmaintainable, and Scheme fell a bit short in a few areas).

by its later stages, it had migrated to a form of modified S-Expressions, where essentially:

macros were expanded; built-in operations used operation-numbers rather than symbols; lexical variables were replaced with variable-indices; ...

there was also a backend which would spit out Scheme code compiled to globs of C.

elsewhere, I had implemented XML-RPC, and a simplistic DOM-like system to go along with it.

I had also created a type-system initially intended to assist with data serialization, and partly also to add some dynamic type-facilities needed to work effectively with XML-RPC. pretty much all types were raw-pointers to heap-allocated values, with an object header just before the data, and was initially separate from the memory manager (later on, they were merged). (in this system, if you wanted an integer-value, you would get an individually-allocated integer, ...).

later on, I implemented the first BGBScript interpreter (BS.A) (as a half-assed JavaScript knock-off), using essentially a horridly hacked/expanded version of XML-RPC logic as the back-end (it was actually initially a direct interpreter working by walking over the XML trees, and was *slow*...). later on, it had sort of half-way moved it to bytecode, but in a lame way (it was actually using 16-bit "word code", and things like loops and similar were handled using recursion and stacks). the type-system was reused from that above. (it also generated garbage at an absurd rate... you couldn't do "i++;" on an integer variable without the thing spewing garbage... making it ultimately almost unusable even really for light-duty scripting...).

the second BGBScript interpreter (BS.B) was built mostly by copying a bunch of the lower-end compiler and interpreter logic from the Scheme interpreter, and essentially just using a mutilated version of Scheme as the AST format, while retaining a fairly similar high-level syntax to the original. it used ("proper") bytecode up-front, and later experimented with a JIT. it ended up inheriting some of the Scheme interpreter's problems (and notably problematic was the use of precise reference-counted references from C code, which involved a lot of boilerplate, pain, and performance overhead).

the C compiler sub-project mostly used the parser from BS.A and parts of the bytecode and JIT from BS.B. it kept the use of an XML based AST format. this sub-project ultimately turned out to have a few ultimately fatal flaws (though some parts remain and were later re-purposed). this fate also befell my (never completed) Java and C# compiler efforts, which were built on the same infrastructure as the C compiler.

the 3rd BGBScript interpreter (BS.C) was basically just reworking BS.B to work onto the type-system from BS.A, mostly as it was significantly less of a PITA to work-with. this resulted in some expansion of the BS.A typesystem (such as to include lists and cons cells, ...). (and, by this time, some of the worse offenses had already been cleaned up...).

the changes made broke the JIT in some fairly major ways (so, for the most part, it was interpreter only).

the BS VM has not undergone any single major rewrites since BS.C, but several notable internal changes have been made:

migration of the interpreter to threaded-code;

migration of the interpreter (and language) mostly to using static types;

implementation of a new JIT;

migrating to a new tagged-reference scheme, away from raw pointers (*1);

...

what would be a 4th BS rewrite has been considered, which would essentially be moving the VM primarily to static types and using a Dalvik-like backend (Register IR). this could potentially help with performance, but would take a lot of time/effort and would likely not be bytecode compatible with the current VM.

*1: unlike the prior type-system changes, this preserves 1:1 compatibility with the pointer-based system (via direct conversion), though there are some cases of conversion inefficiencies (mostly due to differences in terms of value ranges). both systems use conservative references and do not use reference-counting (avoiding a lot of the pain and overhead these bring).

or such...

Would like to throw my two cents in here as well.

I understand that people may not be crazy about XML and it was used, overused and abused to no end for many, many years. But, I personally find it a very useful format for encoding basic data that doesn't need to be in binary and is never really intended to be sent over a network. Effectively I use it to define animation states and object properties in games. I also use it to great effect for localization strings.

I find JSON problematic for these cases and frankly, YAML isn't as easy to put together particularly when you have a number of sub objects (not as intuitive, but that could simply be because it hasn't been in as great a use as XML).

Not to mention, you have really great libraries that are well tested and mature. I'm using TinyXML to great effect -- no need for the extra stuff like schemas and validation and whatnot, I just handle that myself because the definitions I'm using are so basic in nature.

-Lead developer for OutpostHD

http://www.lairworks.com


brief history (approximately of the past 14 years):

at one point, I wrote a Scheme interpreter, and it naturally uses S-Expressions.

later on, this project partly imploded (at the time, the code became mostly unmaintainable, and Scheme fell a bit short in a few areas).

by its later stages, it had migrated to a form of modified S-Expressions, where essentially:

macros were expanded; built-in operations used operation-numbers rather than symbols; lexical variables were replaced with variable-indices; ...

there was also a backend which would spit out Scheme code compiled to globs of C.

Quite the array of language projects you have there!

I too am fond of the use of S-expressions over that of XML, and have had experience using them for data and DSLs in a number of projects. You can't beat the terseness and expressive power, and it's not hard to roll your own parser to handle them.

I share many of the opinions from: http://c2.com/cgi/wiki?XmlIsaPoorCopyOfEssExpressions

As for my own projects, I've also built a custom R6RS parser in C++, and have done some interesting things with it. For specifying data as maps/sets/vectors, I added support for handling special forms which yield new data-structure semantics, added Closure-like syntactic sugar to the lexer/parser where braces and square brackets can be used to define such data structures, and added a quick tree-rewriting pass to the data compiler to convert from the internal list AST node representation to the appropriate container type.

For simple data, sometimes I just go with simple key-value text files if I can get away with it (less is more! strtok_r does the job good enough), and I've recently been experimenting with using parsing expression grammar generators to quickly create parser combinators for custom DSLs that generate more complex data or code as s-expressions or C++.

A shame that many of the "big iron" game studios still use XML for a lot of things, although I've managed to convince a number people that it's time to move on. I dread the days where I am tasked with working on anything touching the stuff.

In short, if you're still using XML, you're needlessly wading through an endless swamp of pain, suffering and obtuse complexity. Things can be better.


I understand that people may not be crazy about XML and it was used, overused and abused to no end for many, many years. But, I personally find it a very useful format for encoding basic data that doesn't need to be in binary and is never really intended to be sent over a network. Effectively I use it to define animation states and object properties in games. I also use it to great effect for localization strings.

I find JSON problematic for these cases and frankly, YAML isn't as easy to put together particularly when you have a number of sub objects (not as intuitive, but that could simply be because it hasn't been in as great a use as XML).

S-expressions are just as powerful, yet more terse. Naughty Dog uses them in the Uncharted Engine for similar things.

If the data format is going to, 99% of the time, be read in a tools pipeline and not a human then I don't consider terseness a virtue to be honest.

If your pipeline/tools are based around .Net then with XML, between the XDoc/XElement classes and LINQ, you've got 99% of your processing/tree walking requirements there - writing a bit of LINQ to parse a XDoc is pretty trivial.


If the data format is going to, 99% of the time, be read in a tools pipeline and not a human then I don't consider terseness a virtue to be honest.

My experience has been that even when that is the intention, it's not the end result.

We seem to spend a lot of time hand-tweaking the (XML) output of our pipeline.

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

Then, honestly, I'd argue your pipeline is a tad broken ;)

I can say with absolute certainty that no tool produced XML needs tweaking in our setup, we do have 1 config file which is XML but that is "legacy" as much as anything ("it works, we aren't going to change it"). The only other hand edited config file we have is the renderer setup one which is in JSON - although I'm not convinced that was the right call and wanted to use a 'JSON/Python inspired syntax', but that's a whole other barrel of bitterness smile.png

This topic is closed to new replies.

Advertisement