Find Elements of a HTML List using TinyXML and C++

Engines and Middleware Programming

Started by CrimsonGT April 27, 2010 05:24 AM

3 comments, last by savagelook 13 years, 11 months ago

130

Author

April 27, 2010 05:24 AM

I just added TinyXml to my Project, but I am not very familiar with XML and parsing. I am downloading the website source to a std string (done). The website is basically one large html list (http://www.sourcemod.net/smdrop/1.3) and I need the last two elements from the list. All of the examples and tutorials I have found for TinyXml are based around reading XML out of a file, or writing to one. I am really at a loss here as to how I can find the last two elements of the list (which are webpage links). Any help is appreciated!

Omid Ghavami

1,007

April 27, 2010 06:25 AM

I'm not entirely sure which part you're having problems with, is it parsing the XML string?

The TiXmlDocument class has a method called Parse so you should be able to do something like:
TiXmlDocument doc;
doc.Parse(myString.c_str());

Note: You may need to remove
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
from the XML string before attempting to parse it, I'm not sure if it can handle that part.

Once you've parsed the string you can use the available methods (FirstChild, IterateChildren, etc) to traverse the XML-tree just as usual.

Best regards, Omid

CrimsonGT

130

Author

April 27, 2010 06:58 AM

Hey, Thanks for the response.

My problem is that I am extremely unfamiliar with this, and I am just not sure where to go after I parse. What exactly are the child elements? I am assuming all html tags such as li, a href, etc are child elements.

Qt has something like this QDomNodeList e = d.elementsByTagName("li");

So I am just wondering after I parse the string, what do I need to do to find the last two elements of the list? I see FirstChild(), LastChild(), etc. Just not really sure on how to use them to do this.

This is what the html source looks like...

http://ampaste.net/m54367016 (Sorry, tried to put it here but the code tags didnt work, keep formatting the HTML tags)

I basically need to get sourcemod-1.3.2-hg2947.zip and sourcemod-1.3.2-hg2947.tar.gz from that page. (These change everyday which is why I need to do this)

Omid Ghavami

1,007

April 27, 2010 08:46 AM

Ok, let's see, you should be able to do something like this (I can't test this at the moment, so it may not work):

The tree path in of the XML we are interested in is:

<html>  <body>    <ul>     <li>       <a>         @href

TiXmlDocument doc;doc.Parse(myString.c_str());/* Traverse down the tree html->body->ul and then get the last <li> element under <ul>. */ TiXmlNode* pLastNode = doc.FirstChild().FirstChild("body").FirstChild("ul").LastChild("li");/* Now that we have the last one, we can get the previous sibling  which gives us the second to last one */TiXmlNode* pSecondToLastNode = pLastNode ->PreviousSibling();/* Now that we have the <li> elements we get the first child of each, which is the <a> element, and then we get the attribute "href" on that element */const char* lastUrl = pLastNode->FirstChid()->ToElement()->Attribute("href");const char* secondToLastUrl = pSecondToLastNode->FirstChid()->ToElement()->Attribute("href");

Something along those lines should work. I hope you manage to solve it!

Best regards, Omid

savagelook

138

May 17, 2010 12:04 PM

Using an xml parser like tinyxml is not a great idea for parsing HTML. HTML is not a strictly typed XML, so it will likely break any normal XML parser, unless the person who created the sire chose to strictly type it.

for example, the <br> tag does not need ot be closed in HTML, this will screw up parsing. Any ampersands or <> signs will likely break it as well unless they are contained in CDATA blocks. Your best best will likely be to use a simple regex or something similar to pull out the data you need.

Pro C++ programmer looking to expand his horizons into the field of 3d graphics.

Find Elements of a HTML List using TinyXML and C++

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Find Elements of a HTML List using TinyXML and C++

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines