Find Elements of a HTML List using TinyXML and C++

Started by
3 comments, last by savagelook 13 years, 11 months ago
I just added TinyXml to my Project, but I am not very familiar with XML and parsing. I am downloading the website source to a std string (done). The website is basically one large html list (http://www.sourcemod.net/smdrop/1.3) and I need the last two elements from the list. All of the examples and tutorials I have found for TinyXml are based around reading XML out of a file, or writing to one. I am really at a loss here as to how I can find the last two elements of the list (which are webpage links). Any help is appreciated!
Advertisement
I'm not entirely sure which part you're having problems with, is it parsing the XML string?

The TiXmlDocument class has a method called Parse so you should be able to do something like:
TiXmlDocument doc;
doc.Parse(myString.c_str());


Note: You may need to remove
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
from the XML string before attempting to parse it, I'm not sure if it can handle that part.

Once you've parsed the string you can use the available methods (FirstChild, IterateChildren, etc) to traverse the XML-tree just as usual.
Best regards, Omid
Hey, Thanks for the response.

My problem is that I am extremely unfamiliar with this, and I am just not sure where to go after I parse. What exactly are the child elements? I am assuming all html tags such as li, a href, etc are child elements.

Qt has something like this QDomNodeList e = d.elementsByTagName("li");

So I am just wondering after I parse the string, what do I need to do to find the last two elements of the list? I see FirstChild(), LastChild(), etc. Just not really sure on how to use them to do this.

This is what the html source looks like...

http://ampaste.net/m54367016 (Sorry, tried to put it here but the code tags didnt work, keep formatting the HTML tags)

I basically need to get sourcemod-1.3.2-hg2947.zip and sourcemod-1.3.2-hg2947.tar.gz from that page. (These change everyday which is why I need to do this)
Ok, let's see, you should be able to do something like this (I can't test this at the moment, so it may not work):

The tree path in of the XML we are interested in is:
<html>  <body>    <ul>     <li>       <a>         @href


TiXmlDocument doc;doc.Parse(myString.c_str());/* Traverse down the tree html->body->ul and then get the last <li> element under <ul>. */ TiXmlNode* pLastNode = doc.FirstChild().FirstChild("body").FirstChild("ul").LastChild("li");/* Now that we have the last one, we can get the previous sibling  which gives us the second to last one */TiXmlNode* pSecondToLastNode = pLastNode ->PreviousSibling();/* Now that we have the <li> elements we get the first child of each, which is the <a> element, and then we get the attribute "href" on that element */const char* lastUrl = pLastNode->FirstChid()->ToElement()->Attribute("href");const char* secondToLastUrl = pSecondToLastNode->FirstChid()->ToElement()->Attribute("href");



Something along those lines should work. I hope you manage to solve it!
Best regards, Omid
Using an xml parser like tinyxml is not a great idea for parsing HTML. HTML is not a strictly typed XML, so it will likely break any normal XML parser, unless the person who created the sire chose to strictly type it.

for example, the <br> tag does not need ot be closed in HTML, this will screw up parsing. Any ampersands or <> signs will likely break it as well unless they are contained in CDATA blocks. Your best best will likely be to use a simple regex or something similar to pull out the data you need.
Pro C++ programmer looking to expand his horizons into the field of 3d graphics.

This topic is closed to new replies.

Advertisement