Sign in to follow this  
CrimsonGT

Find Elements of a HTML List using TinyXML and C++

Recommended Posts

I just added TinyXml to my Project, but I am not very familiar with XML and parsing. I am downloading the website source to a std string (done). The website is basically one large html list (http://www.sourcemod.net/smdrop/1.3) and I need the last two elements from the list. All of the examples and tutorials I have found for TinyXml are based around reading XML out of a file, or writing to one. I am really at a loss here as to how I can find the last two elements of the list (which are webpage links). Any help is appreciated!

Share this post


Link to post
Share on other sites
I'm not entirely sure which part you're having problems with, is it parsing the XML string?

The TiXmlDocument class has a method called Parse so you should be able to do something like:
TiXmlDocument doc;
doc.Parse(myString.c_str());


Note: You may need to remove
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
from the XML string before attempting to parse it, I'm not sure if it can handle that part.

Once you've parsed the string you can use the available methods (FirstChild, IterateChildren, etc) to traverse the XML-tree just as usual.

Share this post


Link to post
Share on other sites
Hey, Thanks for the response.

My problem is that I am extremely unfamiliar with this, and I am just not sure where to go after I parse. What exactly are the child elements? I am assuming all html tags such as li, a href, etc are child elements.

Qt has something like this QDomNodeList e = d.elementsByTagName("li");

So I am just wondering after I parse the string, what do I need to do to find the last two elements of the list? I see FirstChild(), LastChild(), etc. Just not really sure on how to use them to do this.

This is what the html source looks like...

http://ampaste.net/m54367016 (Sorry, tried to put it here but the code tags didnt work, keep formatting the HTML tags)

I basically need to get sourcemod-1.3.2-hg2947.zip and sourcemod-1.3.2-hg2947.tar.gz from that page. (These change everyday which is why I need to do this)

Share this post


Link to post
Share on other sites
Ok, let's see, you should be able to do something like this (I can't test this at the moment, so it may not work):

The tree path in of the XML we are interested in is:

<html>
<body>
<ul>
<li>
<a>
@href




TiXmlDocument doc;
doc.Parse(myString.c_str());


/* Traverse down the tree html->body->ul and then get the last <li> element under <ul>. */
TiXmlNode* pLastNode = doc.FirstChild().FirstChild("body").FirstChild("ul").LastChild("li");

/* Now that we have the last one, we can get the previous sibling
which gives us the second to last one */

TiXmlNode* pSecondToLastNode = pLastNode ->PreviousSibling();

/* Now that we have the <li> elements we get the first child of each,
which is the <a> element, and then we get the attribute "href" on that element */

const char* lastUrl = pLastNode->FirstChid()->ToElement()->Attribute("href");
const char* secondToLastUrl = pSecondToLastNode->FirstChid()->ToElement()->Attribute("href");




Something along those lines should work. I hope you manage to solve it!

Share this post


Link to post
Share on other sites
Using an xml parser like tinyxml is not a great idea for parsing HTML. HTML is not a strictly typed XML, so it will likely break any normal XML parser, unless the person who created the sire chose to strictly type it.

for example, the <br> tag does not need ot be closed in HTML, this will screw up parsing. Any ampersands or <> signs will likely break it as well unless they are contained in CDATA blocks. Your best best will likely be to use a simple regex or something similar to pull out the data you need.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this