Back to For Beginners

html parser

For Beginners

Started by Kommi10 November 04, 2005 12:45 PM

7 comments, last by Roboguy 18 years, 5 months ago

Kommi10

122

Author

November 04, 2005 12:45 PM

Theorically speaking, how is an html parser writen. I have my ideas about it, but was wondering if anyone can give a general high level explanation of how one functions.

ToohrVyk

1,596

November 04, 2005 01:03 PM

A parser receives as input a list of tokens, and keeps internally a stack of states and tokens. It can only perform two actions: to shift, or to reduce.

To shift means to read the next token from the input, and push it on top of the stack. To reduce means to remove the top states from the stack and replace them by a new state.

The parser has a complete series of rules that tell it what to do in any situation, based only on the list of states on the stack (and usually, only the top ones). If the top states allow a reduction, then it happens, otherwise the parser shifts, expecting the next token to solve the situation a little bit.

Whenever a reduction occurs, the parser executes some code (specified by the parser builder) that reads the data associated to the removed states, performs operations on this data, and attaches it to the newly added state.

HTML documents follow a very simple structure (and even more so for XHTML), which allows parsers to be easier to create. But the general principle is still the same (push tokens onto a stack, once you can group tokens into a tag/text/whatever, reduce them, otherwise shift).

Blog — Facebook

Kommi10

122

Author

November 04, 2005 01:25 PM

Cool thanks, that is what I was thinking. Same way the equation 2 * (4 - 5) would be handled. I am guessing I should put the rules in an xml type of file? (I am building my own html parser)

Roboguy

794

November 04, 2005 01:27 PM

Quote:Original post by Kommi10
Cool thanks, that is what I was thinking. Same way the equation 2 * (4 - 5) would be handled, only with a list of rules.

2 * (4 - 5) would probably handled with rules also.

http://www.roboguy.net(WIP) - lisperati - SICP - Haskell - Python - OCaml - Lambda the Ultimate - Good Math, Bad Math - Wiki (not Wikipedia) - Pure - Term-Rewriting Functional Language

Kommi10

122

Author

November 04, 2005 01:28 PM

yeah just realised that and edited my post

Roboguy

794

November 04, 2005 04:07 PM

Quote:Original post by Kommi10
Cool thanks, that is what I was thinking. Same way the equation 2 * (4 - 5) would be handled. I am guessing I should put the rules in an xml type of file? (I am building my own html parser)

What do you mean? XML is not HTML.

http://www.roboguy.net(WIP) - lisperati - SICP - Haskell - Python - OCaml - Lambda the Ultimate - Good Math, Bad Math - Wiki (not Wikipedia) - Pure - Term-Rewriting Functional Language

ToohrVyk

1,596

November 04, 2005 04:17 PM

You could take a look at how existing parser generators (such as yacc) work to transform an LR(1) grammar into a working parser (basically by creating the rules).

Blog — Facebook

Arild Fines

968

November 05, 2005 01:25 PM

I sort of doubt HTML is LR(1). Or even LR(x) for any value of x. An ad-hoc parser would probably be more appropriate for something like HTML.

--AnkhSVN - A Visual Studio .NET Addin for the Subversion version control system.[Project site] [IRC channel] [Blog]

Roboguy

794

November 05, 2005 01:57 PM

Quote:Original post by Arild Fines
I sort of doubt HTML is LR(1). Or even LR(x) for any value of x. An ad-hoc parser would probably be more appropriate for something like HTML.

Why? IIRC, even C can be parsed by an LARL parser (which, I believe, can't parse as many grammars as LR(1)).

http://www.roboguy.net(WIP) - lisperati - SICP - Haskell - Python - OCaml - Lambda the Ultimate - Good Math, Bad Math - Wiki (not Wikipedia) - Pure - Term-Rewriting Functional Language

html parser

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

html parser

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines