Web Crawler

Started by
10 comments, last by swiftcoder 14 years, 10 months ago
I'm trying to program a web crawler in java but I have run into a problem. Every time I try to get the index page of google I always get a 302 response message. This is my code:

final int HTTP_PORT = 80;

Socket socket;
try
{
   socket = new Socket("www.google.com", HTTP_PORT);
			
   BufferedWriter out = new BufferedWriter(new 
      OutputStreamWriter(socket.getOutputStream()));
			
   BufferedReader in = new BufferedReader(
      new InputStreamReader(socket.getInputStream()));
		      
   out.write("GET /index.html HTTP/1.0\n\n");
   out.flush();

   String line;
   while((line = in.readLine()) != null)
   {
      System.out.println(line);
   }
			
   out.close();
   in.close();
}
catch (UnknownHostException e)
{
   e.printStackTrace();
}
catch (IOException e)
{
   e.printStackTrace();
}
And this is the output: HTTP/1.1 302 Found Location: http://www.google.ca/index.html Cache-Control: private Content-Type: text/html; charset=UTF-8 Set-Cookie: PREF=ID=a2755879d5ff1604:TM=1245776068:LM=1245776068:S=2dpHFT7Wtee_HGUJ; expires=Thu, 23-Jun-2011 16:54:28 GMT; path=/; domain=.google.com Date: Tue, 23 Jun 2009 16:54:28 GMT Server: gws Content-Length: 228 <HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8"> <TITLE>302 Moved</TITLE></HEAD><BODY> <H1>302 Moved</H1> The document has moved <A HREF="http://www.google.ca/index.html">here</A>. </BODY></HTML> [Edited by - rajend3 on June 23, 2009 12:06:56 PM]
Advertisement
I am quite certain I remember reading in Google's Terms and conditions that it is forbiddened to use software to interact with its site which does not use the provided api solutions.
I just wanted a simple site to test my code. I don't know if I made an invalid get request or not.

EDIT: It seems to work with yahoo.com and monster.com and I get a different problem with redcross.ca:

HTTP/1.1 400 Bad Request
Content-Type: text/html
Date: Tue, 23 Jun 2009 17:28:04 GMT
Connection: close
Content-Length: 39

<h1>Bad Request (Invalid Hostname)</h1>


But it works in my browser with a changed link location of http://www.redcross.ca/article.asp?id=000005&tid=003

Any reasons why?
Quote:Original post by rajend3
I just wanted a simple site to test my code. I don't know if I made an invalid get request or not.
You are getting the exact response you should - a redirect to the actual index page. Webcrawlers have to handle redirect responses just like a browser would.

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

Use HTTP/1.1 header, and specify the host.

 GET /index.html HTTP/1.1 Host: www.google.ca


And yes, HTTP is far from trivial, browsers hide a lot of underlying complexity. Google and most other large service providers make extensive use of various HTTP and DNS mechanisms to direct their traffic.
If I changed it to HTTP/1.0 I won't need the host part right?
HTTP/1.0 is very old and mostly not supported anymore. You need to use HTTP/1.1, and you need the Host: header. All of the current "shared web hosting" providers use 1.1 with Host: headers, for example -- if you tried to go to www.enchantedage.com or www.kwxport.org (my two sites) using simply the IP, you'd end up at the DreamHost main page.
enum Bool { True, False, FileNotFound };
Quote:Original post by rajend3
If I changed it to HTTP/1.0 I won't need the host part right?


That's correct, but if you do the server will likely send you an HTTP/1.1 reply as Google did. Honestly, adding the Host: to your GET is probably one of the more trivial issues. For example, you'll need to parse those responses and handle them as well as chunk encoding and compression like gzip if you roll your own that is. Then you'll need to parse at least HTML 4.1 so your crawler can follow links and actually crawl.

RFC 2616 HTTP 1.1
RFC 822
Patrick
Instead of using a socket you should let Java handle all of that for many reasons. See if this will do what you need:

protected void doRedirect(HttpServletRequest req,                             HttpServletResponse res)      throws IOException, ServletException {            String name = req.getParameter("name");      // Look up the site by name      String url = (String)_p.get(name);      if (url == null) {         url = "errorPage.html";      }      // Redirect request      res.sendRedirect(url);   }

http://java.sun.com/developer/EJTechTips/2003/tt0513.html


Here is another one: http://www.javapractices.com/topic/TopicAction.do?Id=181

[EDIT]
By the way, when you are writing the redirect code don't forget to accommodate a few of the gotchas out there. For one, some web developers check agents (like Microsoft) and if it's not valid then they will block you. Secondly, some sites will also have traps setup to redirect you an unlimited numbers of times to crash your bot. So set a max redirect and if your program redirects that many times then bail out. Finally, make sure you respect the web developers robots.txt file! Some site will ban your IP if your bot accesses certain links (like this one: http://danielwebb.us/software/bot-trap/).

Have fun!
Quote:Original post by prh99
Quote:Original post by rajend3
If I changed it to HTTP/1.0 I won't need the host part right?


That's correct, but if you do the server will likely send you an HTTP/1.1 reply as Google did. Honestly, adding the Host: to your GET is probably one of the more trivial issues. For example, you'll need to parse those responses and handle them as well as chunk encoding and compression like gzip if you roll your own that is. Then you'll need to parse at least HTML 4.1 so your crawler can follow links and actually crawl.

RFC 2616 HTTP 1.1
RFC 822


Writing an HTML parser isn't too difficult, but it can be tricky (accommodating for lazy developers, developers who don't follow the standard, etc.). A simple regular expression can parse out the links relatively easy.

This topic is closed to new replies.

Advertisement