Sign in to follow this  

Writing screen-scrapers

This topic is 3484 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I've never written one of these and I'm interested to try populating a DB by getting data from a bunch of web-pages. Are there any useful tools out there to make it a bit easier? The site I'm looking at doesn't make it easy by having nice ids on all HTML elements or anything like that... is it possible to scrape from what's actually on the screen and avoid navigating a lot of the HTML crap? Additionally, the site requires me to be logged in to access these pages. I've got some experience with server-side tech like JSP/ASP but security/authentication isn't an area I ever worked on. Is it a problem to login programatically and then make HTTP requests? My preference is to use Java/C# for this.

Share this post


Link to post
Share on other sites
Quote:
The site I'm looking at doesn't make it easy by having nice ids on all HTML elements
Not sure I understood what exactly you want to do. You can simply ignore attributes such as id inside tags if you're not interested in them. Actually, if you're not interested in attributes, you can skip anything from the first whitespace to the end of the tag.

If you want to strip the text from all (most) HTML tags, this is probably easiest with a regular expression. Alternatively, and faster, you can of course scan the text for < and >. Since these characters have to be quoted in a HTML document when they appear in the text, any occurrence of them delimits a tag (unless the page is invalid). Just skip whatever is between them.

PHP even has a dedicated function (strip_tags) just for the purpose of stripping off HTML, and it even lets you specify a list of "allowable" tags.

Logging into a site programmatically is usually not a big problem. Most of the time it involves a POST and storing a cookie or forwarding a session identifier that you get from the URL. Most languages with "bigger" standard libraries let you do that without much trouble.

On a different note, did you consider that storing content from a password protected website in a database is most likely illegal? If the content was in the public domain, they surely wouldn't require you to log in to access it. :-)

Share this post


Link to post
Share on other sites
Quote:
Original post by samoth
Quote:
The site I'm looking at doesn't make it easy by having nice ids on all HTML elements
Not sure I understood what exactly you want to do.
I mean if they used ids on HTML elements, I can look up those elements directly rather than have to literally scrape everything.

Quote:
On a different note, did you consider that storing content from a password protected website in a database is most likely illegal? If the content was in the public domain, they surely wouldn't require you to log in to access it. :-)
I have a login and am able to access the data in my browser by making HTTP requests... I just want to do the same thing from my own custom app rather than through a web browser. I could manually load all the pages and strip the data into a DB but I want to automate this task.

Share this post


Link to post
Share on other sites
There was this great interpreter for functional language called WebL, written by Compaq.

For some reason however, it no longer seems to be available, although the articles describing it are still around.

Share this post


Link to post
Share on other sites
Quote:
Original post by samoth
Logging into a site programmatically is usually not a big problem. Most of the time it involves a POST and storing a cookie or forwarding a session identifier that you get from the URL. Most languages with "bigger" standard libraries let you do that without much trouble.
Can anyone give me any further information on that? I don't know what to search for... I guess both Java and C# should have "bigger standard libraries" such as you speak of?

Share this post


Link to post
Share on other sites

This topic is 3484 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this