# parsing text?

## Recommended Posts

McZ    139
well... I can't figure out a good way to parse a textfile I'm now working on a simple xml parser.. and yes I do know they exists free .lib's on the net but I still want to make my own for the purpose of learning how to parse a textfile.. maybe I extend the parser to be able to parse other textfiles how do I scan the textfile for things to look at and how do I know when I have found it? eg. <?xml version="1.0" ?> <test> <testchild ID="4" /> <testchild2>something</testchild2> </test> well... I know I should first start and look for the < sign and then I store the characters after that untill I find the > sign, but the testchild has an 'ID="4"' in the tag how do I know that? and how do I know if I should convert the ID attribute value to an integer or a float value or a textstring.

##### Share on other sites
jods    367
Well, parsing (and doing it well) is a complex subject. You will find a lot of books about it. Let me point you a few tips to get you started (an XML parser is not that difficult):

You should use a layered architecture.

Layer 1 manages the input stream. It's responsible to get the chars out of the file on the disk, or out of the string in memory. (If you work with files on disk, don't forget to buffer it).

Layer 2 is a tokenizer. It transforms the input stream of chars into a stream of tokens. Example XML tokens are '<', '>', '=', identifiers, '<?', '?>'. Usually the tokens can be described with regular expressions, which implies that they can be recognized using a deterministic finite state machine (DFSM). You can code it by hand if the tokens are simple (XML should be ok) or use tables for it (there are tools to generate such tables).

Layer 3 is the semantic analyzer. It transforms the stream of tokens into a data structure that fits your DOM. In the case of XML you would transform the tokens into a tree (probably, or maybe you would just visit the tree but not generate it). Here the technique to use depends on the language to recognize. XML is simple enough to be recognized with a DFSM with a stack, which you should be able to code by hand. More complex languages (programming languages for example) use LL, LR or LALR parsers. Google it if you are interested, it's too big for me to sum it up here.

Layer 4 is application specific. Now you have parsed your text, do what you want with it!

Layer 2 and 3 can be generated from a language description, using tools. Maybe the best known examples are lex and yacc. Google it if you want to know more about it (but be aware that if you have no idea what these tools are, they're not easy to get used to at first).

One difficult part is the error handling (which means: reporting and maybe trying to continue the parsing). I'll let you think about it by yourself!

##### Share on other sites
joelmartinez    338
Quote:
 Original post by jodsOne difficult part is the error handling (which means: reporting and maybe trying to continue the parsing). I'll let you think about it by yourself!

Just remember not to break XML by continuing to parse if you find a malformed XML construct ... if it's malformed, throw an exception and get out.

:-D

##### Share on other sites
McZ    139
Thanks alot jods :) that was a really good explanation, now I understand how to start :)

just one question

when sending the buffer to the tokenizer how does that one work, and the layer 3 thing?

something like this?

Tokens: '<' '>' '<?' '?>' and so on..
StartTag: <SomeLabel someattribute="0">
EndTag:</SomeLabel>
Value: The string between the begin tag and the end tag
Entry: StartTag Value EndTag

struct xml_attribute { string name; string value;};struct xml_tag { string name; list<xml_attribute*> attributes;};struct xml_entry { xml_tag tag_name; string value; list<xml_entry*> children;};xml_entry *getEntry( char *buffer ){ cTokenizer *tokenizer = new cTokenizer( buffer, tokens ); // find the first token while( !tokenizer->findtoken() ) {  tokenizer->next(); }; // check if token is a correct one if( bad token ) {  // error } // create the xml entry to return xml_entry *entry = new xml_entry; // read the tag string char tagbuffer[256]; while( !tokenizer->findtoken() ) {  // store the character until the next token  tagbuffer[i] = tokenizer->next(); }; // get the name of the tag and the attributes and their values analyzetag( tagbuffer, &entry->tag_name );  // store the value string while( !tokenizer->findtoken() ) {  // store the character until the next token  entry->value = tokenizer->next(); }; // read in the new tag ... // check the tag if( tag is a start tag of a child to this entry ) {  // read child entries  ... } if( tag is the correct endtag for the current start tag ) {  // return the entry } else {  // return error }}

##### Share on other sites
Verg    450
Since it appears you're using C++, why not check out boost::spirit. It would take care of the grunt-work of parsing for you, allowing you to concentrate on the XML portion.

##### Share on other sites
agh. Do not.

Boost.spirit is all well and fabulous when it works, but if something in your code is wrong, the error message will more often than not be completely unreadable.

If anybody knows of a straightforward, drop-in regular expression library, that would probably be much more helpful. (boost.regex's documentation also seems to be written exclusively for consumption by programmers with a thorough understanding of C++'s nuances)

##### Share on other sites
evillive2    779
IMHO if you are doing this to learn then it might be a good idea to do some of this using some standard C file I/O until you realize what is going on behind the scenes. For the most part a text file is made up of single words (or tokens), strings (multiple words), and numbers (integer, float whatever). Lets take a word or token for example. A word is broken down into individual characters encompassed by a space on each side. Now using standard C file I/O we can use the getc function to read in a single character at a time. The isspace function can help out too. Take a look at this which was modified from a text based MUD which uses text files for all of it's data storage:

/* * Read one word (make sure not to overrun the static sized buffer). */char *fread_word( FILE *fp ){    static char word[MAX_WORD_LENGTH];    char *pword;    char cEnd;    /*     * skip white space to find the first letter.     */    do    {	cEnd = getc( fp );    }    while ( isspace( cEnd ) );    /*     * This part makes it so a word can be multiple words     * as long as it is surrounded by ' or ":     * example:     * 'one word'     * "user name"     * This could of course be changed to include < & >     */    if ( cEnd == '\'' || cEnd == '"' )    {	pword   = word;    }    else    {	word[0] = cEnd;	pword   = word+1;	cEnd    = ' ';    }    /*     * read in one char at a time until we find the     * ending char. again, easilly changed to < & >     */    for ( ; pword < word + MAX_WORD_LENGTH; pword++ )    {	*pword = getc( fp );	if ( cEnd == ' ' ? isspace(*pword) : *pword == cEnd )	{	    if ( cEnd == ' ' )		ungetc( *pword, fp );	    *pword = '\0';	    return word;	}    }    // error report that word is too long and probably dump out    // of the file. Hope and pray you don't get here :)        return NULL;}

Now I know there are better ways to do this that use dynamic sized strings etc. but this will work for getting to know whats going on. Reading in a variable lengthed string can be very similar to this method. Just use a larger buffer and decide on a string ending character. The text based MUD uses ~ to terminate it's strings in the files. Once read into the buffer it then allocates memory for the string and copies the string into that memory space. The function then returns a pointer to the newly allocated space. Using this information, you could have a text file that looks like:
STARTUSER your name here~PASSWORD your password here~UID 10928END

Obviously you could use atoi/atol etc. to convert the UID to a number depending on the variable type of the UID or you could write your own number parsing functions.

Hope that helps a little. Text based data files can be a pain to implement because of their size but I think they are rewarding in the end because they tend to be in my experience more flexible than binary files. With text you can have a token then a value and allow the program to figure out where that value belongs or to throw it out if it doesn't recognize the token. With binary, you tend to force yourself into a file format that is pretty rigid and not forgiving of errors or change from one version to the next but that is purely opinion there and i don't know how factual it actually is.

##### Share on other sites
antareus    576
OP:
What I did in one of my junior-level classes was we had to implement a top-down parser, e.g. you're given a grammar:
// bad example...I blame my friend's wedding...name := firstName lastName | lastNamefirstName := alphanumericChar [alphanumericChar*]lastName := alphanumericChar alphanumericChar [alphanumericChar*]

So to parse that you'd write three functions, one for each production (rule) in the grammar. Each production's function is very short and simple:
string ParseName(const string& s){ string firstName = ParseFirstName(s); string lastName = ParseLastName(s);}string ParseFirstName(string& s){ string result = ""; while(isalnum(char c = GetNextToken()) {  result += c; } if(s.empty())  throw ParseException(); return s;}

The tokenizer moves forward one token at a time (this could be a character, or it could be an entire word -- it depends on what you're parsing). You can write functions to branch based on what is waiting in the tokenizer's buffer to handle more complex scenarios. Top-down parsing isn't always appropriate but it is an excellent introduction to the rather interesting (IMO) world of parsing.

evillive2:
It is 2004 and you're suggesting people parse text while opening up the possibility of buffer overflows?

On a lighter note, I need to buy the Dragon book.

##### Share on other sites
evillive2    779
Quote:
 It is 2004 and you're suggesting people parse text while opening up the possibility of buffer overflows?

I only mentioned it because it is a good way to learn whats going on and he mentioned he wants to do this himself for a learning experience. While I imagine he will end up using an XML parser since they tend to be quite easy to use and much safer, at least this will give him some insight as to what may be going on behind the scenes. I didn't mean to misinform anyone or give bad information. I know this has worked for me in the past and helped me to understand a lot about file I/O so I figured I would share. I did mention the possibility of a buffer overrun but I also mentioned that there are better ways to do the same thing. Hope I didn't ruin your whole year by posting that, it is still 2004 you know.

##### Share on other sites
furby100    102
Quote:
 Original post by antareusIt is 2004 and you're suggesting people parse text while opening up the possibility of buffer overflows?

I fail to see how the year is relevant. And, err, crapface.

##### Share on other sites
Matei    190

Sorry, didn't read the post carefully enough :(.

[Edited by - Matei on August 8, 2004 12:09:36 PM]

##### Share on other sites
persil    199
Didn't he say he wanted to learn by doing it?

I find it personally useful to implement simple systems like that, once they work, you have some interesting experience.

But if I'm not mistaken, that lib suggestion is good :)