Sign in to follow this  

Extracting information from a website?

This topic is 4414 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I would just like to have a general idea on how one would do this. For example, given this website http://www.allmusic.com/cg/amg.dll?p=amg&sql=10:pwx8b5m4tsqe, how would you extract the artist, album, track list, etc and then write it out to some local file? What languages are necessary to know? Would it be very difficult to do? This is just one of my many questions about programming. Thanks in advance! -Gabe

Share this post


Link to post
Share on other sites
The kind of software you are looking for is called a "screen scraper". Use that as a google term to begin with.

As far as languages go - you can really do it in anything that can access the internet (which is almost anything). Preferably use something that is good at parsing and quick to develop in. Perhaps Python or C#? Otherwise go with what you know.

Share this post


Link to post
Share on other sites
In order to load a web page, your browser send the server a GET request.
In response, the server will return the requested page (if all is well).

Your program can do it alone.
doing this would involve sockets, the server's IP and... that's it actually. :)
then - you can parse the data and extract what you want. not complicated at all.
I recommend that you read a little bit about the HTTP protocol, it will clear some things up (RFC 2616). i hope ;)

Share this post


Link to post
Share on other sites
Quote:
Original post by Andrew Russell
Preferably use something that is good at parsing and quick to develop in. Perhaps Python or C#? Otherwise go with what you know.


One word: Perl. This is where Perl REALLY shines.

Share this post


Link to post
Share on other sites
The approach that odi suggests is right on: sockets and the HTTP protocol - in essence you're building a super specialized browser (although you probably don't need links and images and plugins and etc...).

There are two problems you need to overcome.

One: what you want to do is illegal. Now while I'm not one to worry too much about the legality of things (and certainly would not think less of someone who has done such things (considering I have)) the operators of the site know that it is, and a good web master will be able to detect suspicious activity. I got caught scraping stock prices from yahoo pages. First step to avoid this (and I hope I don't get in trouble) is to use a real UserAgent in your GET request such as the one sent with IE or Firefox. Although this still might not save you.

The second problem is that the site operators are protecting against leeching of their information when generating their html. It is going to be real hard to parse the resulting HTML because you won't know what to look for consistantly. For instance you might start by finding an <a> tag that encompasses every track entry. The problem is that the <a> tags for track listings resemble nearly every other link on the page. Worse than that, even for one specific page the links change.

For instance on your example page the link that surrounds "Everybody's Talkin'" is /cg/amg.dll?p=amg&sql=33:szp1z88a6yv5 when you refresh the page and look at that link again it is /cg/amg.dll?p=amg&sql=33:5fnsa9tgr23g -- This is not to say it can't be parsed (lots of PERL and fancy RegEx could do the trick), it will just be more difficult and since this is a dynamically generated page, the owners can switch up the result syntax at any time, possibly rendering your program useless...

But good luck.

Share this post


Link to post
Share on other sites
I've done almost exactly what you want.
I used the InternetOpenUrl method.
This will give you the HTML source of the page (as shown when chosing View source from the context menu).
Then you need to parse the source to extract the information you need.

Share this post


Link to post
Share on other sites
Quote:
Original post by random_acts
The approach that odi suggests is right on: sockets and the HTTP protocol - in essence you're building a super specialized browser (although you probably don't need links and images and plugins and etc...).

There are two problems you need to overcome.

One: what you want to do is illegal. Now while I'm not one to worry too much about the legality of things (and certainly would not think less of someone who has done such things (considering I have)) the operators of the site know that it is, and a good web master will be able to detect suspicious activity. I got caught scraping stock prices from yahoo pages. First step to avoid this (and I hope I don't get in trouble) is to use a real UserAgent in your GET request such as the one sent with IE or Firefox. Although this still might not save you.

The second problem is that the site operators are protecting against leeching of their information when generating their html. It is going to be real hard to parse the resulting HTML because you won't know what to look for consistantly. For instance you might start by finding an <a> tag that encompasses every track entry. The problem is that the <a> tags for track listings resemble nearly every other link on the page. Worse than that, even for one specific page the links change.

For instance on your example page the link that surrounds "Everybody's Talkin'" is /cg/amg.dll?p=amg&sql=33:szp1z88a6yv5 when you refresh the page and look at that link again it is /cg/amg.dll?p=amg&sql=33:5fnsa9tgr23g -- This is not to say it can't be parsed (lots of PERL and fancy RegEx could do the trick), it will just be more difficult and since this is a dynamically generated page, the owners can switch up the result syntax at any time, possibly rendering your program useless...

But good luck.


May I ask why the heck Yahoo cared? Was it because you were using their site without viewing their ads?

Share this post


Link to post
Share on other sites
Quote:
Original post by Daniel Miller
May I ask why the heck Yahoo cared? Was it because you were using their site without viewing their ads?


Probably. In any case, if you decide to write a spidering/scraping bot, you should obey to the site's TOS aswell as robots.txt.

Share this post


Link to post
Share on other sites
Quote:
Original post by Konfusius
Quote:
Original post by Daniel Miller
May I ask why the heck Yahoo cared? Was it because you were using their site without viewing their ads?


Probably. In any case, if you decide to write a spidering/scraping bot, you should obey to the site's TOS aswell as robots.txt.



Yep, I'm not saying you should break their rules. I was wondering why Yahoo would make that rule.

Share this post


Link to post
Share on other sites
Thanks for the responses. Unfortunately, I only really know C++ and some C#. The InternetOpenUrl seems like that's what I need, unfortunately I have no knowledge of the Win32 API. I guess what I'll end up doing is saving the html source of the page I want and the parsing the source for the data I want since that's pretty much all I can do at the moment. Freedb.org also seems like a nice service. I may read more into it.

I'm still unclear, what's the most common way to retrieve information from the web? Can C# do it easily? Do I really need to know sockets/http protocal thoroughly to do this? After reading this thread, I think I'm more confused as to what to do so I'm just going to do the saving the souce.

Thanks
-Gabe

Share this post


Link to post
Share on other sites

This topic is 4414 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this