XML parser question

Started by
3 comments, last by thre3dee 15 years, 9 months ago
Hi all, I'm writing an XML parser for my framework and I'm doing it myself so that it both ties in to my memory management and also so its consitent with my framework. I'm using TinyXML as a guide to how I should resolve any XML ambiguities. The only one I have at the moment is if a text element has nothnig but whitespace, should it be included in the DOM? For example, a simple chunk of XML like the following:
<root>
     <element>
     </element>
     <element2>Hello</element2>
</root>
Should I have:
+ $root
   - "         "
   + element
      - "  "
   - "   "
   + element2
      - "Hello"
Or should the blank 'gaps' in elements be discarded? I have a feeling that ActionScript 2.0 XML parser left them in which was a pain in the ass.
Advertisement
Quote:Original post by thre3dee
The only one I have at the moment is if a text element has nothnig but whitespace, should it be included in the DOM?


Yes. Whitespace is a perfectly legitimate and useful piece of text. The fact that you're using XML to specify a DOM as opposed to a piece of marked up text isn't really relevant to the XML parser, so it can't assume that whitespace is insignificant. However, if you want to strip out empty text nodes after parsing, because you know it has no semantic meaning to the consumer of the data, go for it.
However, it is kind of suspicious that, in your example, the whitespace happens to make the element opening and closing line up. Normally you'd see them on the same line, as in:

<element></element>

for a truly empty string.

If you decide to preserve the whitespace, you have to do it verbatim, with any included tabs and newlines.
Discarding whitespace, or deciding that an element is in some sense "empty" and should be treated differently, is an obviously application-specific and element-specific decision: a parser should conservatively preserve whitespace in order to support any possible usage.

As a partial solution, you might want to support the xml:space attribute (§2.10, "White Space Handling", in the XML recommendation):

Quote:
The value "default" signals that applications' default white-space processing modes are acceptable for this element; the value "preserve" indicates the intent that applications preserve all the white space. This declared intent is considered to apply to all elements within the content of the element where it is specified, unless overridden with another instance of the xml:space attribute.


Collapsing or removing whitespace-only text nodes might be the default for your library, with "preserve" allowing for an override.
Note that you can often add "xml:space='preserve'" to the appropriate elements implicitly with a DTD or XML Schema, whitout altering and bloating the documents.

Omae Wa Mou Shindeiru

Quote:Original post by LorenzoGatti
Discarding whitespace, or deciding that an element is in some sense "empty" and should be treated differently, is an obviously application-specific and element-specific decision: a parser should conservatively preserve whitespace in order to support any possible usage.

As a partial solution, you might want to support the xml:space attribute (§2.10, "White Space Handling", in the XML recommendation):

Quote:
The value "default" signals that applications' default white-space processing modes are acceptable for this element; the value "preserve" indicates the intent that applications preserve all the white space. This declared intent is considered to apply to all elements within the content of the element where it is specified, unless overridden with another instance of the xml:space attribute.


Collapsing or removing whitespace-only text nodes might be the default for your library, with "preserve" allowing for an override.
Note that you can often add "xml:space='preserve'" to the appropriate elements implicitly with a DTD or XML Schema, whitout altering and bloating the documents.


Thanks. Yeah I should have a good look at the XML spec.

This topic is closed to new replies.

Advertisement