• Advertisement
Sign in to follow this  

Why XML is all the rage now?

This topic is 1422 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

 


brief history (approximately of the past 14 years):

at one point, I wrote a Scheme interpreter, and it naturally uses S-Expressions.

later on, this project partly imploded (at the time, the code became mostly unmaintainable, and Scheme fell a bit short in a few areas).

by its later stages, it had migrated to a form of modified S-Expressions, where essentially:

macros were expanded; built-in operations used operation-numbers rather than symbols; lexical variables were replaced with variable-indices; ...

there was also a backend which would spit out Scheme code compiled to globs of C.

 

Quite the array of language projects you have there!

 

I too am fond of the use of S-expressions over that of XML, and have had experience using them for data and DSLs in a number of projects. You can't beat the terseness and expressive power, and it's not hard to roll your own parser to handle them.

 

I share many of the opinions from: http://c2.com/cgi/wiki?XmlIsaPoorCopyOfEssExpressions

 

As for my own projects, I've also built a custom R6RS parser in C++, and have done some interesting things with it. For specifying data as maps/sets/vectors, I added support for handling special forms which yield new data-structure semantics, added Closure-like syntactic sugar to the lexer/parser where braces and square brackets can be used to define such data structures, and added a quick tree-rewriting pass to the data compiler to convert from the internal list AST node representation to the appropriate container type.

 

For simple data, sometimes I just go with simple key-value text files if I can get away with it (less is more! strtok_r does the job good enough), and I've recently been experimenting with using parsing expression grammar generators to quickly create parser combinators for custom DSLs that generate more complex data or code as s-expressions or C++.

 

A shame that many of the "big iron" game studios still use XML for a lot of things, although I've managed to convince a number people that it's time to move on. I dread the days where I am tasked with working on anything touching the stuff.

 

In short, if you're still using XML, you're needlessly wading through an endless swamp of pain, suffering and obtuse complexity. Things can be better.

 

 

I was working at the time with R5RS.

by the time R6RS came out, I had mostly stopped using Scheme, and looking at it briefly, it looked like a bit of a jump from what R5RS was.

 

the AST format later used for BGBScript was based partly on R5RS, but differs in a lot of ways, namely in those which make it a better fit for an HLL with a more C/JS/AS3/... like syntax, like different special-forms for defining things, ones representing control-flow constructs (for/while/switch/...), ...

also, generally it moved to the use of explicit special-forms for things like function calls and using operators, ...

 

some elements of Scheme also were worked into the HLL design as well (tail-calls / tail-position, implicit return values, lists, ...).

early on, both Self and Erlang were also influences for the language design.

later on, Java, C#, and AS3 became influences.

 

basically, while it started out dynamic and prototype based, static-types, classes, packages, ... were later glued on, partly for performance reasons, and also because they are more effective for a lot of use-cases (can do stronger compile-time checking, ...).

 

though, the language still retains most of its dynamic funkiness (including a Self-derived scoping model, scoping semantics are fun in my language...). not going to try to explain the type-system and scoping model here though.

 

 

for parsers, I have most often used hand-written recursive-descent.

I started out with RD, and pretty much every non-trivial syntax I have encountered seems to work fine with RD.

 

 

XML and S-Expressions both have some use-cases.

 

granted, my XML APIs have since diverged somewhat from DOM, becoming generally a lot more operation-centric, and much less about treating XML nodes as objects (and generally, the "Document" metaphor is all but absent in-use). basically, the API focuses a lot more on composition and decomposition of data, rather than on node manipulation. ironically, it isn't used much at all with external tools (typically about the only time most of this is actually seen is in debugging dumps).

 

theoretically, it could also matter if/when I needed to interact with other things which use XML, or if by some off-chance I decide to use XML-RPC again (currently unlikely...).

 

 

granted, from an ease-of-use perspective, lists are hard to beat, as they are generally a lot easier to work with with a lot less code.

granted, my approach to this (C-side) has been to build a big chunk of Lisp-like APIs in C (basically, a bunch of Lisp and CLOS-like stuff glued onto C).

 

granted, it took several iterations before really settling on a usable set of tradeoffs (getting something that is both usable and performs well).

 

a lot of the infrastructure is shared between my script-language and C parts of the project.

 

 

I had considered (binary) XML for my network protocol, but ended opting instead with lists.

 

basically, my network protocol consists of basically large nested list structures, generally passed along to/from specific "targets" (such as between client-side and server-side versions of an entity, ...). initial versions had used Deflated textual serializations, but I later implemented a direct entropy-coded binary serialization.

 

this protocol is also used for my voxel-terrain, though it is sort of a hybrid (generally, the actual voxel-chunk data is passed using large byte arrays, with the chunk-data being flattened out and RLE compressed). partly this is because passing every voxel as a list-based message would be a bit of a stretch...

 

(chunk-delta (origin -240  416 48) (size 16 16 16) ... (voxeldata (voxel :type dirt :aux 0 :slight 240 :vlight 0 ...) (voxel :type dirt ...) ...))

 

it is basically a problem of 16x16x16 * 32*32*8 * 4 * ... which would take some fairly absurd numbers of cons-cells...

 

so, passing the chunk data in a byte-serialized format seemed like a "reasonable" compromise here.

 

so, instead it is something more like:

(wdelta

    ...

    (voxdelta ...

        (rgndelta ... #Ah( ... ))

        (rgndetla ...)

        ...)

    ...)

 

where wdeta=world-delta, voxdelta=voxel-delta, rgndelta=region-delta, and #Ah(...) is a 1D byte array.

Edited by cr88192

Share this post


Link to post
Share on other sites
Advertisement


Then, honestly, I'd argue your pipeline is a tad broken ;)

It's more a matter of our process being broken :(

 

We don't sit next to our artists (nor even in the same time zone), so if an asset comes through buggy, you either learn to use the DCC tool, wait 6 hours for a fresh edition of the asset, or patch the XML up by hand. The latter option wins surprisingly often (hint: programmers mostly don't like using DCC tools).

Share this post


Link to post
Share on other sites

Honestly, I think XML might be hated a bit too religiously these days. Many of the biggest things make the biggest targets for criticism. Unless you're going full-featured in your XML usage, it's plenty readable and if you make proper use of attributes, it isn't that much bigger than things like JSON.

Share this post


Link to post
Share on other sites

I consider XML to be in the same category as COLLADA. ie, designed to reliably convey data between different systems, and nothing else.

  • They are both text-based and inefficient at storing data when compared to binary formats.
  • They can have complex structures that lead to slow parsing especially on larger files.
  • Despite what people say (and the original design intention), XML is absolutely not human readable.

The only difference I see is that people realised what COLLADA was intended for and treated it accordingly, whereas XML was (and still is) abused.

 

I've worked on a project where the original authors thought it would be a good idea to create an entire XML document on the fly using string concatentation, pass it to a stored procedure and query it as a table to pull out a few parameters.

 

Guess what brought down the entire system?... "&".

 

Granted, that's not XML's fault, but still...

Share this post


Link to post
Share on other sites

Well, COLLADA is an XML application so it makes sense they are similar. I agree very much that in most respects XML is best left on the near side of your build, with the exception being when you actually need human-readable markup as part of your program's content (or arguably, its not the worst choice you could make for configuration data *if* you've already taken the dependency anyways).

 

I tend to disagree that XML isn't human readable though -- the language itself is plenty readable, but many of its applications are too complex and/or verbose for that to be true in practice. Another sin some XML applications commit is not using the language correctly -- using attributes when children would be more apropos (or vice-versa), introducing too-many/not-enough "container" elements, improper use of namespaces, or failing to provide a means of validation for the application.

 

A straight-forward, well-designed, and well-supported XML application is usually a joy to use, modify, and build tooling around.

Share this post


Link to post
Share on other sites

Edit: Nevermind. Accidentally replied to something months old. I'm an idiot.

Edited by ambershee

Share this post


Link to post
Share on other sites

Honestly, JSON is my preferred data serializer. I use it in everything in lieu of XML.

 

The existing YAML parsers are, from experience of trying to integrate with C++, pretty bad or incomplete haha but its also a good format when its working. YAML 1.2 actually falls back on JSON which is cool (but I havent tested it). I tend to prefer YAML for config type files, and JSON for just about everything else (data stores, data transfer, web services, etc)

Share this post


Link to post
Share on other sites

From a strictly professional/commercial standpoint - I do EDI development as my day job.  Things such as EDIFACT, X12, HL7, FiX, etc. 

 

Back several years ago, many of these large, business-type data standards decided to try and push the market from using Length-Encoded textual files to markup files via XML tags.  It went horrible.  Those that implemented it probably wished they hadn't, and those that didn't still have to deal with those that did.  Here's an example of HL7v2 and HL7v3(XML-based).  Can you pick which one you'd rather try and troubleshoot and view data in?  I pick Option #1. I honestly wish XML would die.

 

HL7v2

MSH|^~\&|GHH LAB|ELAB-3|GHH OE|BLDG4|200202150930||ORU^R01|CNTRL-3456|P|2.4
PID|||555-44-4444||EVERYWOMAN^EVE^E^^^^L|JONES|19620320|F|||153 FERNWOOD DR.^
^STATESVILLE^OH^35292||(206)3345232|(206)752-121||||AC555444444||67-A4335^OH^20030520
OBR|1|845439^GHH OE|1045813^GHH LAB|15545^GLUCOSE|||200202150730|||||||||
555-55-5555^PRIMARY^PATRICIA P^^^^MD^^|||||||||F||||||444-44-4444^HIPPOCRATES^HOWARD H^^^^MD
OBX|1|SN|1554-5^GLUCOSE^POST 12H CFST:MCNC:PT:SER/PLAS:QN||^182|mg/dl|70_105|H|||F<cr>

HL7v3

 <POLB_IN224200 ITSVersion="XML_1.0" xmlns="urn:hl7-org:v3"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> 
<id root="2.16.840.1.113883.19.1122.7" extension="CNTRL-3456"/>
<creationTime value="200202150930-0400"/>
<!-- The version of the datatypes/RIM/vocabulary used is that of May 2006 -->
<versionCode code="2006-05"/>
<!-- interaction id= Observation Event Complete, w/o Receiver Responsibilities -->
<interactionId root="2.16.840.1.113883.1.6" extension="POLB_IN224200"/>
<processingCode code="P"/>
<processingModeCode nullFlavor="OTH"/>
<acceptAckCode code="ER"/>
<receiver typeCode="RCV">
   <device classCode="DEV" determinerCode="INSTANCE">
     <id extension="GHH LAB" root="2.16.840.1.113883.19.1122.1"/>
     <asLocatedEntity classCode="LOCE">
       <location classCode="PLC" determinerCode="INSTANCE">
         <id root="2.16.840.1.113883.19.1122.2" extension="ELAB-3"/>
       </location>
     </asLocatedEntity>
   </device>
</receiver>
<sender typeCode="SND">
   <device classCode="DEV" determinerCode="INSTANCE">
     <id root="2.16.840.1.113883.19.1122.1" extension="GHH OE"/>
     <asLocatedEntity classCode="LOCE">
       <location classCode="PLC" determinerCode="INSTANCE">
         <id root="2.16.840.1.113883.19.1122.2" extension="BLDG24"/>
       </location>
     </asLocatedEntity>
   </device>
</sender>
<! –- Trigger Event Control Act & Domain Content -- >
</POLB_IN224200>
Edited by DocBrown

Share this post


Link to post
Share on other sites

 

From a strictly professional/commercial standpoint - I do EDI development as my day job.  Things such as EDIFACT, X12, HL7, FiX, etc. 

 

Back several years ago, many of these large, business-type data standards decided to try and push the market from using Length-Encoded textual files to markup files via XML tags.  It went horrible.  Those that implemented it probably wished they hadn't, and those that didn't still have to deal with those that did.  Here's an example of HL7v2 and HL7v3(XML-based).  Can you pick which one you'd rather try and troubleshoot and view data in?  I pick Option #1. I honestly wish XML would die.

 

HL7v2

MSH|^~\&|GHH LAB|ELAB-3|GHH OE|BLDG4|200202150930||ORU^R01|CNTRL-3456|P|2.4
PID|||555-44-4444||EVERYWOMAN^EVE^E^^^^L|JONES|19620320|F|||153 FERNWOOD DR.^
^STATESVILLE^OH^35292||(206)3345232|(206)752-121||||AC555444444||67-A4335^OH^20030520
OBR|1|845439^GHH OE|1045813^GHH LAB|15545^GLUCOSE|||200202150730|||||||||
555-55-5555^PRIMARY^PATRICIA P^^^^MD^^|||||||||F||||||444-44-4444^HIPPOCRATES^HOWARD H^^^^MD
OBX|1|SN|1554-5^GLUCOSE^POST 12H CFST:MCNC:PT:SER/PLAS:QN||^182|mg/dl|70_105|H|||F<cr>

HL7v3

 <POLB_IN224200 ITSVersion="XML_1.0" xmlns="urn:hl7-org:v3"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> 
<id root="2.16.840.1.113883.19.1122.7" extension="CNTRL-3456"/>
<creationTime value="200202150930-0400"/>
<!-- The version of the datatypes/RIM/vocabulary used is that of May 2006 -->
<versionCode code="2006-05"/>
<!-- interaction id= Observation Event Complete, w/o Receiver Responsibilities -->
<interactionId root="2.16.840.1.113883.1.6" extension="POLB_IN224200"/>
<processingCode code="P"/>
<processingModeCode nullFlavor="OTH"/>
<acceptAckCode code="ER"/>
<receiver typeCode="RCV">
   <device classCode="DEV" determinerCode="INSTANCE">
     <id extension="GHH LAB" root="2.16.840.1.113883.19.1122.1"/>
     <asLocatedEntity classCode="LOCE">
       <location classCode="PLC" determinerCode="INSTANCE">
         <id root="2.16.840.1.113883.19.1122.2" extension="ELAB-3"/>
       </location>
     </asLocatedEntity>
   </device>
</receiver>
<sender typeCode="SND">
   <device classCode="DEV" determinerCode="INSTANCE">
     <id root="2.16.840.1.113883.19.1122.1" extension="GHH OE"/>
     <asLocatedEntity classCode="LOCE">
       <location classCode="PLC" determinerCode="INSTANCE">
         <id root="2.16.840.1.113883.19.1122.2" extension="BLDG24"/>
       </location>
     </asLocatedEntity>
   </device>
</sender>
<! –- Trigger Event Control Act & Domain Content -- >
</POLB_IN224200>

 I'd have to pick none of the above. Honestly, that first one just looks like gibberish. Not that XML is any better, but still... Just total gibberish.

Share this post


Link to post
Share on other sites

I'd have to pick none of the above. Honestly, that first one just looks like gibberish. Not that XML is any better, but still... Just total gibberish.

 

It's actually rather simplistic and all your information is viewable in a tight, concise manner.

 

Each message that comes across a network interface has an MSH segment, a Message Header.  The message header displays information like who sent the message, who's supposed to receive it, what type of information are you going to find in the message, when was it sent, etc. 

 

MSH|^~\&|GHH LAB|ELAB-3|GHH OE|BLDG4|200202150930||ORU^R01|CNTRL-3456|P|2.4
 

 

Each line of the message consists of a 3 letter identifier that tells you what information is going to be housed in the following line. Such as the MSH(Message Header), PID (Patient Identification), OBR(Oberservation Request), OBX (Oberservation Result), etc. These lines are known as Segments.

 

Each segment contains a host of fields, sub fields, sub-sub fields, etc.

 

Each field is separated by a | delimiter, which each field containing a particular set of standardized data.

 

 

MSH|^~\&|GHH LAB|ELAB-3|GHH OE|BLDG4|200202150930||ORU^R01|CNTRL-3456|P|2.4

 

Take for instance the MSH segment - Field number 9.  This field states the Message Type (what type of information you can expect to see in the rest of the message).  MSH-9 in this example has a field and subfield delimited by the ^ symbol.  The caret delimits fields and subfields.

 

So MSH-9.1 is the Message Type, and MSH 9.2 (Subfield) is the Message Event.

 

MSH|^~\&|GHH LAB|ELAB-3|GHH OE|BLDG4|200202150930||ORU^R01|CNTRL-3456|P|2.4

 

The message structure looks like this:

 

[Message (Type & Event]

[Segment]

[Field]

[Subfield]

[Field]

[Subfield]

[Segment]

[Field]

[Subfield]

[Field]

[Subfield]

[Segment]

[Field]

[Subfield]

[Field]

[Subfield]

...and so on...

 

Anyways, figured I'd share.  It's not often I get to speak about my profession as it's so niche. tongue.png

Edited by DocBrown

Share this post


Link to post
Share on other sites

So you have to memorize a bunch of three-letter acronyms as well as memorizing standard field layouts in order to make sense of it? I can see how that would be somewhat easier for the expert, but not so much for anyone who hasn't spent their career memorizing such things.

Share this post


Link to post
Share on other sites

So you have to memorize a bunch of three-letter acronyms as well as memorizing standard field layouts in order to make sense of it? I can see how that would be somewhat easier for the expert, but not so much for anyone who hasn't spent their career memorizing such things.

 

Not quite.  There are a load of tools out there that does this for you.  Most of the tools you use to build the interfaces have this automatically added to them.  Much like trying to figure out where everything is in the BCL in .NET, Intellisense works wonders, as well as the Library Explorer. 

 

That being said.  Most people revert back to the standard specifications that are released by HL7.org, as this positional data tends to change slightly depending on the version of the standard you're using. (Again, much like the different .NET Framework versions)

Edited by DocBrown

Share this post


Link to post
Share on other sites

 

So you have to memorize a bunch of three-letter acronyms as well as memorizing standard field layouts in order to make sense of it? I can see how that would be somewhat easier for the expert, but not so much for anyone who hasn't spent their career memorizing such things.

 

Not quite.  There are a load of tools out there that does this for you.  Most of the tools you use to build the interfaces have this automatically added to them.  Much like trying to figure out where everything is in the BCL in .NET, Intellisense works wonders, as well as the Library Explorer. 

 

That being said.  Most people revert back to the standard specifications that are released by HL7.org, as this positional data tends to change slightly depending on the version of the standard you're using. (Again, much like the different .NET Framework versions)

 

 

There.. there are also XML tools that can present the data in a clean hierarchical fashion minus all the tags and group related items together, so I'm not sure there's really any difference tongue.png I'm with JT on this one.

Share this post


Link to post
Share on other sites
I've gatta agree with JT and bact, the v2 is complete gibberish without some form of documentation to assist you reading it. At least the xml version has human readable tags that can somewhat assist in figuring out whats's going on.

Also the tool argument is pretty moot since the purpose of a tool is to abstarct away from the underlying serializing format.

Share this post


Link to post
Share on other sites

Those records reminded me of my days as an abinitio developer.   In AI you have record files that define the fields along with the field type and whether or not it is fixed length or deliminated. You could probably do the same thing in c# with annotations and a parser. 

 

A record description file might look like this:

 

record  monster

{

int id(3)

string name("\0")

int hp (5)

int attack (3)

int defense (3)

}("\n")

 

then in the data file you get the following:

001Goblin\000250020005\n

002Spider\000150010007\n

003Rat\000050002002\n

004Goblin King\059000350135\n

 

Generally I find that the data storage structure you use depends on your needs.  XML was popular because it was well defined and well structured.  Xml readers and writers are common and easy to use and so you can quickly parse the data you want with linq or xpath.  You can also use style sheets to define the data and check that its well formed which makes it useful for api's and other B2B applications.  Its also useful if you only new need a subset of the data in the file.

 

But these days JSON is pretty popular and widely used in the web which means there are plenty of tools to read and write it.  The files are also much smaller than xml files which is useful when you have to worry about the size of data.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement