Sign in to follow this  
benutne

Writing my own XML parser?

Recommended Posts

For a project I'm working on, I'd like to write my own XML parser. It really only needs to support innertext and attributes (along with node names of course). And it needs to be able to understand XPath for innertext and attribute lookups. I'm using XML to describe a series of locations for buttons (top, left, width, height, UV map, etc) and I dont want to require a specific XML parser with it (MSXML 3.0, blah, blah) So I figured, I'd just compile it into my program. Maybe someone can suggest another way to describe and store such data?

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
There are a bunch of alternatives. My question to you is have you looked at other xml alternatives (than msxml) first? Xerces and TinyXML come immediately to mind. Writing your own is an interesting exercise, but I wouldn't recommend it except as a learning experience.

For a major project these files can be very large as XML is quite wordy - on one of ours the total set of UI layouts was over 25 megs when in XML (okay it was a really big project). At this point we switched to a compressed binary format. For small projects though it can work out very well.

Share this post


Link to post
Share on other sites
Yes. Someone else suggested the binary format. Problem is, I have NO IDEA how to go about doing that. I'm comfortable with XML. This will be a commercial project, so anything I use will need to be pretty much free to use.

Tell me more of this "compressed binary format"

Share this post


Link to post
Share on other sites
Quote:
Original post by petewood
If it doesn't parse all of XML it's not an XML parser. It is not a trivial thing to write.



Maybe I'll pass on writing my own then. Anything else constructive to say?

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
XML is overrated. It has all the limitations of a text file and none of the advantages of a binary file.

Share this post


Link to post
Share on other sites
I have to weigh in here ...

1. If it doesn't do ALL of XML it is not an XML 1.0 (or later) compliant parser ... but that doesn't mean it is not USEFULL.

2. It is an extremely large project (HUGE) to make a compliant parser. But making a little dinky parser that loads basic child / value and attribute nodes is not that hard (my friend has done one in 4 days that suits HIS - and only his - needs).

3. However, writing your own is time that could be equally well spent learning about the abilities and usage of some existing ones, which would give you more features in the long run. (although a good exercise could be to spend 2-3 days writting one, THEN switch to a "real" one, to improve your programming a little and gain some knowledge of what goes on with a parser).

4. The XPath part of what you want to acomplish is not hard, but is not absolutely trivial, which makes me lean more toward a prepackaged solution.

Personally, I used to use Xerces back at version 1.2 or so ... and it was hard to learn at the time, and lacked some features - but it was better than the custom parser we had at my company at the time.

Then I used .NET for the last year, and in 2 days I had tons more features at my fingertips than with any custom solutions.

Then I went back to a cross platform C++ project - and tried to get my friend and me to use Xerces again ... he simply refused because we wasted 2 hours just getting the damn thing loading a file for traversing ... it is NOT easy to learn to start with - and does NOT let you ignore aspects of XML you don't care about at the moment.

If you try TinyXML please report back your opinion of it after just one half to 2 days ... I'd like to know if it's easy enough to try to recommend to others.

Share this post


Link to post
Share on other sites
Quote:
Original post by petewood
If it doesn't parse all of XML it's not an XML parser. It is not a trivial thing to write.


I suppose that's one way to look at it. But FWIW, on the last game I worked on we used a home grown XML parser that favored speed over being fully XML compliant. It worked just fine for parsing all our in-house XML files.

-John

Share this post


Link to post
Share on other sites
You should decide if your going to want SAX or DOM or both api's for parsing and writing. Both have their advantages but are orthogonal concepts, meaning you'd have to implement both of them independelty and this can double your implementation and testing time.


Unless you have a good reason, using an existing parser will save you deveopement time, giving you more time for implementing the "fun things"

Cheers
Chris

Share this post


Link to post
Share on other sites
Quote:
Original post by petewood
Quote:
Original post by benutne
Anything else constructive to say?


I'm not sure you would like it.



Dont see why I wouldn't. I wasnt being an ass or anything. I open to suggestions.

Share this post


Link to post
Share on other sites
I'm using TinyXML, although I made some modifications to it. It's not xml-compliant anymore; I made it case-insensitive because I want my xml-ish files to be editable and easy to understand for anyone.

The real disadvantages to TinyXML are, AFAIK, that it's feature limited and parses the entire xml file before you can manipulate it. For little configuration files that's not a problem but for big files parsing could take a while and eat lots of memory.

Share this post


Link to post
Share on other sites
Quote:
Original post by KurioesFor little configuration files that's not a problem but for big files parsing could take a while and eat lots of memory.

Thats what I'm worried about. The min system specs here are pretty low. 64MB of RAM to be exact. Does anyone know anything about serializing a binary file as was suggested earlier or where to look to learn?

Share this post


Link to post
Share on other sites
Quote:
Original post by benutne
Does anyone know anything about serializing a binary file as was suggested earlier or where to look to learn?


well, it's really very simple to serialize data. simply it means just taking the object and writing out the data to some stream or other (i.e. a file stream / output stream). basically just sending off the contents of an object in such a way as it can be retrieved later.

serializing in a file then is just reading in the file and restoring the objects. so for instance you could organize your master data file like so:

int numObjects
object1
object2
etc...

then to read it in, you would read the numObjects, create an array of that size, or just set a counter variable that will control the termination of a loop. then loop through reading in one object sized chunk at a time from the file. if you are doing some kind of database program in which the size of the total data is larger than the amount of available memory, you can just reconstruct one object at a time -> parse it -> see if it's the one you need -> then if not, discard the object and read in the next object. that way you only ever have one object resident in memory at any time.

does this help at all, or have i totally misunderstood your question.

I'm not going to comment on the usefulnees of XML other than to say that it's original intent was to allow humans the ability to read/understand and edit data that can be directly useable by a machine. If you have no need for humans to manually edit the data directly I honestly don't see the use of XML. For me, it's always just been more efficient to creat a little app through which people can edit the data (which is saved off as a binary file). XML is sometimes a nice way to get up and running quickly (i.e. you can always replace the data with a binary format later so you can skip writing the data entry app in the short term). XML is just a bit of a performance hog, with binary data you go from data to object in a very direct process. with XML you have to parse the data first. text parsing is evil. in the end it just means slightly longer load times (depending on what data you are storing this can range from completely insignificant to a horrible app killer)

-me

Share this post


Link to post
Share on other sites
Thanks Palidine. Thats EXACTLY what I was looking for. There will be an app that someone can enter information into, and since I know XML, I can manually construct the data for the first few objects to see how they work.

Thats pretty much the only reason I looked at XML. And yes, what you said did confuse the hell out of me, but dont let it worry you. A lot of the more abstract concepts do the first time I see them.

So the data file will have like a "header" to tell me how many objects it contains and how large each object is, and then the objects themselves. Right?

Share this post


Link to post
Share on other sites
Basic approach (edit to suit your needs)...

struct mybinaryheader {
unsigned long headerSize; /* How many bytes is this header? */
unsigned long nodeFirst; /* Seek to here to find first node header. */
};

struct mybinarynodeheader {
unsigned long nodeSize; /* Number of bytes in this node */
unsigned long nodeNext; /* Seek to here to find next node header */
};



Note the nice and regular format. Read 8 bytes from the file, and you have the header. This tells you how large the element is, and where to find the next one. This actual element can follow its header in the file. I usually follow that if "nodeNext" is zero, then we are looking at the last node in the file.

I use this format for structures that have to be somewhat forewards-compatible; no matter how large the data structure of each node/element is, they always have the same first 8 bytes, which are these header fields. No matter how large the element is, we can reliably find the next element.

Share this post


Link to post
Share on other sites
Quote:
Original post by Wyrframe
Basic approach (edit to suit your needs)...
*** Source Snippet Removed ***

Note the nice and regular format. Read 8 bytes from the file, and you have the header. This tells you how large the element is, and where to find the next one. This actual element can follow its header in the file. I usually follow that if "nodeNext" is zero, then we are looking at the last node in the file.

I use this format for structures that have to be somewhat forewards-compatible; no matter how large the data structure of each node/element is, they always have the same first 8 bytes, which are these header fields. No matter how large the element is, we can reliably find the next element.



I think I see now. Still a little confusing, but I guess I just need to start coding something to get the hang of it.

Share this post


Link to post
Share on other sites
Instead of having "nodenext" encapsulate the recursion, you could go the RIFF way, and store a type with each node. Then, nodes of type LIST have a "size" that's the sum of all their children, and the contents of the chunk is a sub-type plus the children. Other node types are not recursive, and thus they just contain their content. This is how WAV and AVI files are defined.

Anyway, when it comes to small, dinky, FAST and EASY XML parsers, I put mine on the web just the other day (touched up the README today): XMLSCAN.

You can use it in a forward-scanning way, or in a DOM-building way, and it's fast both ways. It's NOT a full XML parser; in fact, the goal was to make something that's simpler and smaller than TinyXML -- it works, it's fast, and it's simple, so mission accomplished. It took me one evening (!) to write and another evening to test, debug, and use for my particle system.

Share this post


Link to post
Share on other sites
A common binary representation involves writing out a type, a length of the following data, and then the data. The format of the data is dictated by the type. This lets you add additional commands without breaking software that can't recognize it (just make sure you code your parser to skip unrecognized types). This is similar to Wyrframe's example but suppresses the "next node" position, making it suitable for network transmission.

So your parser would do the following:

unsigned short tag = 0;
file.read(reinterpret_cast<char*>(&tag), 2);

unsigned short len = 0;
file.read(reinterpret_cast<char*>(&len), 2);

int headerEnd = file.tellg() + len;
// read data however you want

// at the end fix yourself up to where you should be:
file.seekg(headerEnd);


Share this post


Link to post
Share on other sites
Quote:
Original post by Teknofreek
Quote:
Original post by petewood
If it doesn't parse all of XML it's not an XML parser. It is not a trivial thing to write.


I suppose that's one way to look at it. But FWIW, on the last game I worked on we used a home grown XML parser that favored speed over being fully XML compliant. It worked just fine for parsing all our in-house XML files.


You had a file format which happened to look like XML.

Was the speed issue really important?

Share this post


Link to post
Share on other sites
pete, you're nit picking mate.. so they use the words "XML" liberaly, but hey man chill out, bad day at work or something?
<offtopic>i had a shocker, cut my nose on the edge of a coffee mug *DON'T ASK*</offtopic>

peace bro :)
-Dan

Share this post


Link to post
Share on other sites
Quote:
Original post by petewood
You had a file format which happened to look like XML.
Was the speed issue really important?


Speed was important. And to be quite honest, I personally think it should've been more important. I would have preferred using binary files instead, but alas, that decision was not up to me.

If you're asking if we could have shipped our game successfully using a more elaborate parser then, of course, the answer would be yes. If you're asking if that would've had a substantial affect on our development iteration time or on the end user's load times, then the answer would be "I'm not sure".

In any event, I did understand what you were trying to get at in your initial post, but the way you said it made it sound a little grim. It made it sound as if using anything less than a fully-compliant XML parser for your game files would lead to inevitable failure. I just wanted to point out that, on the pragmatic side, it's certainly possible to ship a game that uses 'XML' files with a parser that's less than fully compliant. As long as you're in control of both the creation and parsing of all your files you can restrict yourself however you like :)

-John

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
I used XML in my strategy game for it's rules, meaning what units are there, how fast are they, how strong, how do they attack and so on. I wanted to support different eras of combat, and I wanted to enable mods. Now anyone with a text editor has all the tools, and if I write a schema, they should be able to validate their changes before running the game.

My saved games are still in binary format. I've thought about adding a XML format simply to make loading old games easy. Changing data structures basically hoses my simple binary format.

I followed the "write my own parser" route because they all seemed too hard to use. I wanted scanf like simplicity. And the big boys dwarf my game, which runs on a Palm OS PDA. Eventually I realized I should just extract it and make it available to save other people time. It's called Ali and it's at SourceForge at http://ali.sourceforge.net.

With it, to read a monster's hp stored in the element "100" you can write something like

ali_in(doc, monsterN, "^e%d", 0, "hp", &monster->hp);

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this