Jump to content
  • Advertisement
Sign in to follow this  
v0dKA

Polite data mining

This topic is 3747 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I'm working on a project that requires some data mining. I'll be sending well over a thousand requests to a single website. I understand this is probably looked down upon -- if I operated a web server, I certainly would not want anyone to bombard it with two dozen requests per second. For this reason, I'm trying to be as polite as I can while still getting the data I need. I'm wondering what the proper "etiquette" is for programs of this nature. Suppose the goal is to retrieve dictionary definitions from dictionary.com for a long list of words (in actuality, the goal is very similar to this, but not quite). What is the maximum number of HTTP requests I could send per second without being rude? Should I include longer pauses between batches of words? Is there anything else I should do? Thanks in advance for any advice.

Share this post


Link to post
Share on other sites
Advertisement
I'm not sure what the "right thing to do" is in regard with request frequency. If you start getting HTTP 500 responses, then you're frequency is too high though ;)

It is polite to obey the robots.txt file if one exists (e.g. http://dictionary.reference.com/robots.txt). If the robots.txt specifies a Crawl-delay value, then that's the amount of time the webmaster wants you to wait.

Share this post


Link to post
Share on other sites
Most sites where this is a problem will actually just ban you. Google bans your IP from searches until you've entered a captcha.

Share this post


Link to post
Share on other sites
We data-mined a very large company's website for another very large company and their legal department gave us a rule of 1 request every 2 seconds.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!