Sign in to follow this  
Kommi10

html parser

Recommended Posts

A parser receives as input a list of tokens, and keeps internally a stack of states and tokens. It can only perform two actions: to shift, or to reduce.

To shift means to read the next token from the input, and push it on top of the stack. To reduce means to remove the top states from the stack and replace them by a new state.

The parser has a complete series of rules that tell it what to do in any situation, based only on the list of states on the stack (and usually, only the top ones). If the top states allow a reduction, then it happens, otherwise the parser shifts, expecting the next token to solve the situation a little bit.

Whenever a reduction occurs, the parser executes some code (specified by the parser builder) that reads the data associated to the removed states, performs operations on this data, and attaches it to the newly added state.

HTML documents follow a very simple structure (and even more so for XHTML), which allows parsers to be easier to create. But the general principle is still the same (push tokens onto a stack, once you can group tokens into a tag/text/whatever, reduce them, otherwise shift).

Share this post


Link to post
Share on other sites
Cool thanks, that is what I was thinking. Same way the equation 2 * (4 - 5) would be handled. I am guessing I should put the rules in an xml type of file? (I am building my own html parser)

Share this post


Link to post
Share on other sites
Quote:
Original post by Kommi10
Cool thanks, that is what I was thinking. Same way the equation 2 * (4 - 5) would be handled, only with a list of rules.


2 * (4 - 5) would probably handled with rules also.

Share this post


Link to post
Share on other sites
Quote:
Original post by Kommi10
Cool thanks, that is what I was thinking. Same way the equation 2 * (4 - 5) would be handled. I am guessing I should put the rules in an xml type of file? (I am building my own html parser)


What do you mean? XML is not HTML.

Share this post


Link to post
Share on other sites
Quote:
Original post by Arild Fines
I sort of doubt HTML is LR(1). Or even LR(x) for any value of x. An ad-hoc parser would probably be more appropriate for something like HTML.


Why? IIRC, even C can be parsed by an LARL parser (which, I believe, can't parse as many grammars as LR(1)).

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this