Jump to content
  • Advertisement
Sign in to follow this  
barakat

extract data form an html file

This topic is 4858 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

how can i extract data from .html files .... i am looking for data in tables <TABLE .... ?? using c++ Thanks in Advance

Share this post


Link to post
Share on other sites
Advertisement
  • Open the file (fopen or fstream)
  • Read in data until you discover "<table"
  • Buffer what you're reading until "</table>"

    If that doesn't help, what exactly are you trying to do?

    Share this post


    Link to post
    Share on other sites
    i already know this

    i am just trying to do it in a good effecient way ... or i am kind of looking for sth std .. because i havent parsed files before and want to get started ...

    Share this post


    Link to post
    Share on other sites
    An HTML file is very complex, XHTML is slightly easier to parse. Your best bet is using the method I described (As far as I know that's the standard way). I don't know if any 3rd part libs exist to parse HTML files.

    Share this post


    Link to post
    Share on other sites
    You could use a library like 'TinyXML' to parse the HTML, since XML is pretty much a general structure. I've had good experience with doing this myself. You'd have to make certain that your HTML is well formed though (make sure all opened element tags are closed etc...)

    Share this post


    Link to post
    Share on other sites
    Quote:
    Original post by FReY
    You could use a library like 'TinyXML' to parse the HTML, since XML is pretty much a general structure. I've had good experience with doing this myself. You'd have to make certain that your HTML is well formed though (make sure all opened element tags are closed etc...)
    Only if it's XHTML. HTML can't be parsed with an XML parser because of tags like <br> that don't need a closing tag.
    Even with XHTML, you'll still have to insert your own header somehow, and strip comments I think (I'm not sure if [!-- blah --] is valid XML (Using less-than and greather-than, the forums parse that right out, even using &lt;...)

    Share this post


    Link to post
    Share on other sites
    I have had good results with JTidy. It does a good job of handling really bad HTML, and even that crap MS Word passes off as HTML, and builds a DOM tree that you can extract data from.

    Share this post


    Link to post
    Share on other sites
    Sign in to follow this  

    • Advertisement
    ×

    Important Information

    By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

    We are the game development community.

    Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

    Sign me up!