html parser

Started by
7 comments, last by Roboguy 18 years, 5 months ago
Theorically speaking, how is an html parser writen. I have my ideas about it, but was wondering if anyone can give a general high level explanation of how one functions.
Advertisement
A parser receives as input a list of tokens, and keeps internally a stack of states and tokens. It can only perform two actions: to shift, or to reduce.

To shift means to read the next token from the input, and push it on top of the stack. To reduce means to remove the top states from the stack and replace them by a new state.

The parser has a complete series of rules that tell it what to do in any situation, based only on the list of states on the stack (and usually, only the top ones). If the top states allow a reduction, then it happens, otherwise the parser shifts, expecting the next token to solve the situation a little bit.

Whenever a reduction occurs, the parser executes some code (specified by the parser builder) that reads the data associated to the removed states, performs operations on this data, and attaches it to the newly added state.

HTML documents follow a very simple structure (and even more so for XHTML), which allows parsers to be easier to create. But the general principle is still the same (push tokens onto a stack, once you can group tokens into a tag/text/whatever, reduce them, otherwise shift).
Cool thanks, that is what I was thinking. Same way the equation 2 * (4 - 5) would be handled. I am guessing I should put the rules in an xml type of file? (I am building my own html parser)
Quote:Original post by Kommi10
Cool thanks, that is what I was thinking. Same way the equation 2 * (4 - 5) would be handled, only with a list of rules.


2 * (4 - 5) would probably handled with rules also.
yeah just realised that and edited my post
Quote:Original post by Kommi10
Cool thanks, that is what I was thinking. Same way the equation 2 * (4 - 5) would be handled. I am guessing I should put the rules in an xml type of file? (I am building my own html parser)


What do you mean? XML is not HTML.
You could take a look at how existing parser generators (such as yacc) work to transform an LR(1) grammar into a working parser (basically by creating the rules).
I sort of doubt HTML is LR(1). Or even LR(x) for any value of x. An ad-hoc parser would probably be more appropriate for something like HTML.
--AnkhSVN - A Visual Studio .NET Addin for the Subversion version control system.[Project site] [IRC channel] [Blog]
Quote:Original post by Arild Fines
I sort of doubt HTML is LR(1). Or even LR(x) for any value of x. An ad-hoc parser would probably be more appropriate for something like HTML.


Why? IIRC, even C can be parsed by an LARL parser (which, I believe, can't parse as many grammars as LR(1)).

This topic is closed to new replies.

Advertisement