Find Elements of a HTML List using TinyXML and C++
I just added TinyXml to my Project, but I am not very familiar with XML and parsing. I am downloading the website source to a std string (done). The website is basically one large html list (http://www.sourcemod.net/smdrop/1.3) and I need the last two elements from the list.
All of the examples and tutorials I have found for TinyXml are based around reading XML out of a file, or writing to one.
I am really at a loss here as to how I can find the last two elements of the list (which are webpage links).
Any help is appreciated!
I'm not entirely sure which part you're having problems with, is it parsing the XML string?
The TiXmlDocument class has a method called Parse so you should be able to do something like:
TiXmlDocument doc;
doc.Parse(myString.c_str());
Note: You may need to remove
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
from the XML string before attempting to parse it, I'm not sure if it can handle that part.
Once you've parsed the string you can use the available methods (FirstChild, IterateChildren, etc) to traverse the XML-tree just as usual.
The TiXmlDocument class has a method called Parse so you should be able to do something like:
TiXmlDocument doc;
doc.Parse(myString.c_str());
Note: You may need to remove
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
from the XML string before attempting to parse it, I'm not sure if it can handle that part.
Once you've parsed the string you can use the available methods (FirstChild, IterateChildren, etc) to traverse the XML-tree just as usual.
Hey, Thanks for the response.
My problem is that I am extremely unfamiliar with this, and I am just not sure where to go after I parse. What exactly are the child elements? I am assuming all html tags such as li, a href, etc are child elements.
Qt has something like this QDomNodeList e = d.elementsByTagName("li");
So I am just wondering after I parse the string, what do I need to do to find the last two elements of the list? I see FirstChild(), LastChild(), etc. Just not really sure on how to use them to do this.
This is what the html source looks like...
http://ampaste.net/m54367016 (Sorry, tried to put it here but the code tags didnt work, keep formatting the HTML tags)
I basically need to get sourcemod-1.3.2-hg2947.zip and sourcemod-1.3.2-hg2947.tar.gz from that page. (These change everyday which is why I need to do this)
My problem is that I am extremely unfamiliar with this, and I am just not sure where to go after I parse. What exactly are the child elements? I am assuming all html tags such as li, a href, etc are child elements.
Qt has something like this QDomNodeList e = d.elementsByTagName("li");
So I am just wondering after I parse the string, what do I need to do to find the last two elements of the list? I see FirstChild(), LastChild(), etc. Just not really sure on how to use them to do this.
This is what the html source looks like...
http://ampaste.net/m54367016 (Sorry, tried to put it here but the code tags didnt work, keep formatting the HTML tags)
I basically need to get sourcemod-1.3.2-hg2947.zip and sourcemod-1.3.2-hg2947.tar.gz from that page. (These change everyday which is why I need to do this)
Ok, let's see, you should be able to do something like this (I can't test this at the moment, so it may not work):
The tree path in of the XML we are interested in is:
Something along those lines should work. I hope you manage to solve it!
The tree path in of the XML we are interested in is:
<html> <body> <ul> <li> <a> @href
TiXmlDocument doc;doc.Parse(myString.c_str());/* Traverse down the tree html->body->ul and then get the last <li> element under <ul>. */ TiXmlNode* pLastNode = doc.FirstChild().FirstChild("body").FirstChild("ul").LastChild("li");/* Now that we have the last one, we can get the previous sibling which gives us the second to last one */TiXmlNode* pSecondToLastNode = pLastNode ->PreviousSibling();/* Now that we have the <li> elements we get the first child of each, which is the <a> element, and then we get the attribute "href" on that element */const char* lastUrl = pLastNode->FirstChid()->ToElement()->Attribute("href");const char* secondToLastUrl = pSecondToLastNode->FirstChid()->ToElement()->Attribute("href");
Something along those lines should work. I hope you manage to solve it!
Using an xml parser like tinyxml is not a great idea for parsing HTML. HTML is not a strictly typed XML, so it will likely break any normal XML parser, unless the person who created the sire chose to strictly type it.
for example, the <br> tag does not need ot be closed in HTML, this will screw up parsing. Any ampersands or <> signs will likely break it as well unless they are contained in CDATA blocks. Your best best will likely be to use a simple regex or something similar to pull out the data you need.
for example, the <br> tag does not need ot be closed in HTML, this will screw up parsing. Any ampersands or <> signs will likely break it as well unless they are contained in CDATA blocks. Your best best will likely be to use a simple regex or something similar to pull out the data you need.
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement