HTML Parser

Started by
13 comments, last by blanky 18 years, 7 months ago
Does anyone know a good HTML parser library for C/C++, if there even is one. Something like TinyXML, I'd like to do something as in counter strike when you join a server it has an HTML file. Thanks.
Advertisement
I'd look at Gecko, Mozilla's layout engine.
You can use CRegExp regular expressions library. In some cases it can help you.
ai-blog.org: AI is discussed here.
Writing an HTML parser is quite easy in theory; in practice you have to take in account not well formed pages (that is pages full of errors like not closed tags or mismatched pairs). Incredibly modern browser can accept everything!!!

As you can imagine you can interpret a bad page in different ways.

On the other hand, a valid XHTML strict document (the right standard for web pages) is itself an XML document (so you could implement your XHTML parser via an XML parser)

If you are only searching for a specific section you can use regular expressions but I dont know good libraries in C++ ( perhaps boost:: ).
CRegExp is just good. Sources 25kB, plain C++.
ai-blog.org: AI is discussed here.
Thanks, however, I've searched my butt off and I can't find out where to download either libraries. Can you please give me links? Thanks, I would really appreciate it. I'm interested in both gecko and CRegExp.
HTML is recursive, so at best CRegExp could tokenize it. You still have to parse the tokens into a tree and deal with poorly formed documents, which leaves a LOT of work.
Quote:Original post by cpp forever
You can use CRegExp regular expressions library. In some cases it can help you.

Quote:Original post by cpp forever
CRegExp is just good. Sources 25kB, plain C++.

HTML is not a regular language.
Quote:Original post by Dobbs
HTML is recursive, so at best CRegExp could tokenize it. You still have to parse the tokens into a tree and deal with poorly formed documents, which leaves a LOT of work.


Try boost::spirit. Closures take care of recursion. Very simple and elegant. If you read the docs and play with trivial examples first, it's quite easy to learn.

my_life:          nop          jmp my_life
[ Keep track of your TDD cycle using "The Death Star" ] [ Verge Video Editor Support Forums ] [ Principles of Verg-o-nomics ] [ "t00t-orials" ]
Okay, thanks. However, I'm quite lost, I dont understand what this all is. I want a parser so that I could display an HTML page in a window, kind of like a browser but just show the page. It might be a control what I'm thinking about but I'm not sure. Also, the boost::spirit thing, is that an HTML parser? Where can I find gecko and CRegExp?

This topic is closed to new replies.

Advertisement