Why XML is all the rage now?

Started by
61 comments, last by TechnoGoth 10 years, 1 month ago


brief history (approximately of the past 14 years):

at one point, I wrote a Scheme interpreter, and it naturally uses S-Expressions.

later on, this project partly imploded (at the time, the code became mostly unmaintainable, and Scheme fell a bit short in a few areas).

by its later stages, it had migrated to a form of modified S-Expressions, where essentially:

macros were expanded; built-in operations used operation-numbers rather than symbols; lexical variables were replaced with variable-indices; ...

there was also a backend which would spit out Scheme code compiled to globs of C.

Quite the array of language projects you have there!

I too am fond of the use of S-expressions over that of XML, and have had experience using them for data and DSLs in a number of projects. You can't beat the terseness and expressive power, and it's not hard to roll your own parser to handle them.

I share many of the opinions from: http://c2.com/cgi/wiki?XmlIsaPoorCopyOfEssExpressions

As for my own projects, I've also built a custom R6RS parser in C++, and have done some interesting things with it. For specifying data as maps/sets/vectors, I added support for handling special forms which yield new data-structure semantics, added Closure-like syntactic sugar to the lexer/parser where braces and square brackets can be used to define such data structures, and added a quick tree-rewriting pass to the data compiler to convert from the internal list AST node representation to the appropriate container type.

For simple data, sometimes I just go with simple key-value text files if I can get away with it (less is more! strtok_r does the job good enough), and I've recently been experimenting with using parsing expression grammar generators to quickly create parser combinators for custom DSLs that generate more complex data or code as s-expressions or C++.

A shame that many of the "big iron" game studios still use XML for a lot of things, although I've managed to convince a number people that it's time to move on. I dread the days where I am tasked with working on anything touching the stuff.

In short, if you're still using XML, you're needlessly wading through an endless swamp of pain, suffering and obtuse complexity. Things can be better.

I was working at the time with R5RS.

by the time R6RS came out, I had mostly stopped using Scheme, and looking at it briefly, it looked like a bit of a jump from what R5RS was.

the AST format later used for BGBScript was based partly on R5RS, but differs in a lot of ways, namely in those which make it a better fit for an HLL with a more C/JS/AS3/... like syntax, like different special-forms for defining things, ones representing control-flow constructs (for/while/switch/...), ...

also, generally it moved to the use of explicit special-forms for things like function calls and using operators, ...

some elements of Scheme also were worked into the HLL design as well (tail-calls / tail-position, implicit return values, lists, ...).

early on, both Self and Erlang were also influences for the language design.

later on, Java, C#, and AS3 became influences.

basically, while it started out dynamic and prototype based, static-types, classes, packages, ... were later glued on, partly for performance reasons, and also because they are more effective for a lot of use-cases (can do stronger compile-time checking, ...).

though, the language still retains most of its dynamic funkiness (including a Self-derived scoping model, scoping semantics are fun in my language...). not going to try to explain the type-system and scoping model here though.

for parsers, I have most often used hand-written recursive-descent.

I started out with RD, and pretty much every non-trivial syntax I have encountered seems to work fine with RD.

XML and S-Expressions both have some use-cases.

granted, my XML APIs have since diverged somewhat from DOM, becoming generally a lot more operation-centric, and much less about treating XML nodes as objects (and generally, the "Document" metaphor is all but absent in-use). basically, the API focuses a lot more on composition and decomposition of data, rather than on node manipulation. ironically, it isn't used much at all with external tools (typically about the only time most of this is actually seen is in debugging dumps).

theoretically, it could also matter if/when I needed to interact with other things which use XML, or if by some off-chance I decide to use XML-RPC again (currently unlikely...).

granted, from an ease-of-use perspective, lists are hard to beat, as they are generally a lot easier to work with with a lot less code.

granted, my approach to this (C-side) has been to build a big chunk of Lisp-like APIs in C (basically, a bunch of Lisp and CLOS-like stuff glued onto C).

granted, it took several iterations before really settling on a usable set of tradeoffs (getting something that is both usable and performs well).

a lot of the infrastructure is shared between my script-language and C parts of the project.

I had considered (binary) XML for my network protocol, but ended opting instead with lists.

basically, my network protocol consists of basically large nested list structures, generally passed along to/from specific "targets" (such as between client-side and server-side versions of an entity, ...). initial versions had used Deflated textual serializations, but I later implemented a direct entropy-coded binary serialization.

this protocol is also used for my voxel-terrain, though it is sort of a hybrid (generally, the actual voxel-chunk data is passed using large byte arrays, with the chunk-data being flattened out and RLE compressed). partly this is because passing every voxel as a list-based message would be a bit of a stretch...

(chunk-delta (origin -240 416 48) (size 16 16 16) ... (voxeldata (voxel :type dirt :aux 0 :slight 240 :vlight 0 ...) (voxel :type dirt ...) ...))

it is basically a problem of 16x16x16 * 32*32*8 * 4 * ... which would take some fairly absurd numbers of cons-cells...

so, passing the chunk data in a byte-serialized format seemed like a "reasonable" compromise here.

so, instead it is something more like:

(wdelta

...

(voxdelta ...

(rgndelta ... #Ah( ... ))

(rgndetla ...)

...)

...)

where wdeta=world-delta, voxdelta=voxel-delta, rgndelta=region-delta, and #Ah(...) is a 1D byte array.

Advertisement


Then, honestly, I'd argue your pipeline is a tad broken ;)

It's more a matter of our process being broken :(

We don't sit next to our artists (nor even in the same time zone), so if an asset comes through buggy, you either learn to use the DCC tool, wait 6 hours for a fresh edition of the asset, or patch the XML up by hand. The latter option wins surprisingly often (hint: programmers mostly don't like using DCC tools).

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

Honestly, I think XML might be hated a bit too religiously these days. Many of the biggest things make the biggest targets for criticism. Unless you're going full-featured in your XML usage, it's plenty readable and if you make proper use of attributes, it isn't that much bigger than things like JSON.

XML was the CSV of Y2K

tone

I consider XML to be in the same category as COLLADA. ie, designed to reliably convey data between different systems, and nothing else.

  • They are both text-based and inefficient at storing data when compared to binary formats.
  • They can have complex structures that lead to slow parsing especially on larger files.
  • Despite what people say (and the original design intention), XML is absolutely not human readable.

The only difference I see is that people realised what COLLADA was intended for and treated it accordingly, whereas XML was (and still is) abused.

I've worked on a project where the original authors thought it would be a good idea to create an entire XML document on the fly using string concatentation, pass it to a stored procedure and query it as a table to pull out a few parameters.

Guess what brought down the entire system?... "&".

Granted, that's not XML's fault, but still...

[size="2"]Currently working on an open world survival RPG - For info check out my Development blog:[size="2"] ByteWrangler

Well, COLLADA is an XML application so it makes sense they are similar. I agree very much that in most respects XML is best left on the near side of your build, with the exception being when you actually need human-readable markup as part of your program's content (or arguably, its not the worst choice you could make for configuration data *if* you've already taken the dependency anyways).

I tend to disagree that XML isn't human readable though -- the language itself is plenty readable, but many of its applications are too complex and/or verbose for that to be true in practice. Another sin some XML applications commit is not using the language correctly -- using attributes when children would be more apropos (or vice-versa), introducing too-many/not-enough "container" elements, improper use of namespaces, or failing to provide a means of validation for the application.

A straight-forward, well-designed, and well-supported XML application is usually a joy to use, modify, and build tooling around.

throw table_exception("(? ???)? ? ???");

Edit: Nevermind. Accidentally replied to something months old. I'm an idiot.

Honestly, JSON is my preferred data serializer. I use it in everything in lieu of XML.

The existing YAML parsers are, from experience of trying to integrate with C++, pretty bad or incomplete haha but its also a good format when its working. YAML 1.2 actually falls back on JSON which is cool (but I havent tested it). I tend to prefer YAML for config type files, and JSON for just about everything else (data stores, data transfer, web services, etc)

Signed: Redacted

From a strictly professional/commercial standpoint - I do EDI development as my day job. Things such as EDIFACT, X12, HL7, FiX, etc.

Back several years ago, many of these large, business-type data standards decided to try and push the market from using Length-Encoded textual files to markup files via XML tags. It went horrible. Those that implemented it probably wished they hadn't, and those that didn't still have to deal with those that did. Here's an example of HL7v2 and HL7v3(XML-based). Can you pick which one you'd rather try and troubleshoot and view data in? I pick Option #1. I honestly wish XML would die.

HL7v2


MSH|^~\&|GHH LAB|ELAB-3|GHH OE|BLDG4|200202150930||ORU^R01|CNTRL-3456|P|2.4
PID|||555-44-4444||EVERYWOMAN^EVE^E^^^^L|JONES|19620320|F|||153 FERNWOOD DR.^
^STATESVILLE^OH^35292||(206)3345232|(206)752-121||||AC555444444||67-A4335^OH^20030520
OBR|1|845439^GHH OE|1045813^GHH LAB|15545^GLUCOSE|||200202150730|||||||||
555-55-5555^PRIMARY^PATRICIA P^^^^MD^^|||||||||F||||||444-44-4444^HIPPOCRATES^HOWARD H^^^^MD
OBX|1|SN|1554-5^GLUCOSE^POST 12H CFST:MCNC:PT:SER/PLAS:QN||^182|mg/dl|70_105|H|||F<cr>

HL7v3


 <POLB_IN224200 ITSVersion="XML_1.0" xmlns="urn:hl7-org:v3"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> 
<id root="2.16.840.1.113883.19.1122.7" extension="CNTRL-3456"/>
<creationTime value="200202150930-0400"/>
<!-- The version of the datatypes/RIM/vocabulary used is that of May 2006 -->
<versionCode code="2006-05"/>
<!-- interaction id= Observation Event Complete, w/o Receiver Responsibilities -->
<interactionId root="2.16.840.1.113883.1.6" extension="POLB_IN224200"/>
<processingCode code="P"/>
<processingModeCode nullFlavor="OTH"/>
<acceptAckCode code="ER"/>
<receiver typeCode="RCV">
   <device classCode="DEV" determinerCode="INSTANCE">
     <id extension="GHH LAB" root="2.16.840.1.113883.19.1122.1"/>
     <asLocatedEntity classCode="LOCE">
       <location classCode="PLC" determinerCode="INSTANCE">
         <id root="2.16.840.1.113883.19.1122.2" extension="ELAB-3"/>
       </location>
     </asLocatedEntity>
   </device>
</receiver>
<sender typeCode="SND">
   <device classCode="DEV" determinerCode="INSTANCE">
     <id root="2.16.840.1.113883.19.1122.1" extension="GHH OE"/>
     <asLocatedEntity classCode="LOCE">
       <location classCode="PLC" determinerCode="INSTANCE">
         <id root="2.16.840.1.113883.19.1122.2" extension="BLDG24"/>
       </location>
     </asLocatedEntity>
   </device>
</sender>
<! –- Trigger Event Control Act & Domain Content -- >
</POLB_IN224200>

From a strictly professional/commercial standpoint - I do EDI development as my day job. Things such as EDIFACT, X12, HL7, FiX, etc.

Back several years ago, many of these large, business-type data standards decided to try and push the market from using Length-Encoded textual files to markup files via XML tags. It went horrible. Those that implemented it probably wished they hadn't, and those that didn't still have to deal with those that did. Here's an example of HL7v2 and HL7v3(XML-based). Can you pick which one you'd rather try and troubleshoot and view data in? I pick Option #1. I honestly wish XML would die.

HL7v2


MSH|^~\&|GHH LAB|ELAB-3|GHH OE|BLDG4|200202150930||ORU^R01|CNTRL-3456|P|2.4
PID|||555-44-4444||EVERYWOMAN^EVE^E^^^^L|JONES|19620320|F|||153 FERNWOOD DR.^
^STATESVILLE^OH^35292||(206)3345232|(206)752-121||||AC555444444||67-A4335^OH^20030520
OBR|1|845439^GHH OE|1045813^GHH LAB|15545^GLUCOSE|||200202150730|||||||||
555-55-5555^PRIMARY^PATRICIA P^^^^MD^^|||||||||F||||||444-44-4444^HIPPOCRATES^HOWARD H^^^^MD
OBX|1|SN|1554-5^GLUCOSE^POST 12H CFST:MCNC:PT:SER/PLAS:QN||^182|mg/dl|70_105|H|||F<cr>

HL7v3


 <POLB_IN224200 ITSVersion="XML_1.0" xmlns="urn:hl7-org:v3"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> 
<id root="2.16.840.1.113883.19.1122.7" extension="CNTRL-3456"/>
<creationTime value="200202150930-0400"/>
<!-- The version of the datatypes/RIM/vocabulary used is that of May 2006 -->
<versionCode code="2006-05"/>
<!-- interaction id= Observation Event Complete, w/o Receiver Responsibilities -->
<interactionId root="2.16.840.1.113883.1.6" extension="POLB_IN224200"/>
<processingCode code="P"/>
<processingModeCode nullFlavor="OTH"/>
<acceptAckCode code="ER"/>
<receiver typeCode="RCV">
   <device classCode="DEV" determinerCode="INSTANCE">
     <id extension="GHH LAB" root="2.16.840.1.113883.19.1122.1"/>
     <asLocatedEntity classCode="LOCE">
       <location classCode="PLC" determinerCode="INSTANCE">
         <id root="2.16.840.1.113883.19.1122.2" extension="ELAB-3"/>
       </location>
     </asLocatedEntity>
   </device>
</receiver>
<sender typeCode="SND">
   <device classCode="DEV" determinerCode="INSTANCE">
     <id root="2.16.840.1.113883.19.1122.1" extension="GHH OE"/>
     <asLocatedEntity classCode="LOCE">
       <location classCode="PLC" determinerCode="INSTANCE">
         <id root="2.16.840.1.113883.19.1122.2" extension="BLDG24"/>
       </location>
     </asLocatedEntity>
   </device>
</sender>
<! –- Trigger Event Control Act & Domain Content -- >
</POLB_IN224200>

I'd have to pick none of the above. Honestly, that first one just looks like gibberish. Not that XML is any better, but still... Just total gibberish.

This topic is closed to new replies.

Advertisement