Web Crawler

Networking and Multiplayer Programming

Started by rajend3 June 23, 2009 11:46 AM

10 comments, last by swiftcoder 14 years, 10 months ago

122

Author

June 23, 2009 11:46 AM

I'm trying to program a web crawler in java but I have run into a problem. Every time I try to get the index page of google I always get a 302 response message. This is my code:


final int HTTP_PORT = 80;

Socket socket;
try
{
   socket = new Socket("www.google.com", HTTP_PORT);
			
   BufferedWriter out = new BufferedWriter(new 
      OutputStreamWriter(socket.getOutputStream()));
			
   BufferedReader in = new BufferedReader(
      new InputStreamReader(socket.getInputStream()));
		      
   out.write("GET /index.html HTTP/1.0\n\n");
   out.flush();

   String line;
   while((line = in.readLine()) != null)
   {
      System.out.println(line);
   }
			
   out.close();
   in.close();
}
catch (UnknownHostException e)
{
   e.printStackTrace();
}
catch (IOException e)
{
   e.printStackTrace();
}

And this is the output: HTTP/1.1 302 Found Location: http://www.google.ca/index.html Cache-Control: private Content-Type: text/html; charset=UTF-8 Set-Cookie: PREF=ID=a2755879d5ff1604:TM=1245776068:LM=1245776068:S=2dpHFT7Wtee_HGUJ; expires=Thu, 23-Jun-2011 16:54:28 GMT; path=/; domain=.google.com Date: Tue, 23 Jun 2009 16:54:28 GMT Server: gws Content-Length: 228 <HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8"> <TITLE>302 Moved</TITLE></HEAD><BODY> <H1>302 Moved</H1> The document has moved <A HREF="http://www.google.ca/index.html">here</A>. </BODY></HTML> [Edited by - rajend3 on June 23, 2009 12:06:56 PM]

Daerax

1,207

June 23, 2009 11:57 AM

I am quite certain I remember reading in Google's Terms and conditions that it is forbiddened to use software to interact with its site which does not use the provided api solutions.

rajend3

122

Author

June 23, 2009 12:06 PM

I just wanted a simple site to test my code. I don't know if I made an invalid get request or not.

EDIT: It seems to work with yahoo.com and monster.com and I get a different problem with redcross.ca:

HTTP/1.1 400 Bad Request
Content-Type: text/html
Date: Tue, 23 Jun 2009 17:28:04 GMT
Connection: close
Content-Length: 39

<h1>Bad Request (Invalid Hostname)</h1>

But it works in my browser with a changed link location of http://www.redcross.ca/article.asp?id=000005&tid=003

Any reasons why?

swiftcoder

18,997

June 23, 2009 12:10 PM

Quote:Original post by rajend3
I just wanted a simple site to test my code. I don't know if I made an invalid get request or not.

You are getting the exact response you should - a redirect to the actual index page. Webcrawlers have to handle redirect responses just like a browser would.

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

Antheus

2,410

June 23, 2009 12:13 PM

Use HTTP/1.1 header, and specify the host.

 GET /index.html HTTP/1.1 Host: www.google.ca

And yes, HTTP is far from trivial, browsers hide a lot of underlying complexity. Google and most other large service providers make extensive use of various HTTP and DNS mechanisms to direct their traffic.

rajend3

122

Author

June 23, 2009 12:18 PM

If I changed it to HTTP/1.0 I won't need the host part right?

hplus0603

11,916

June 23, 2009 02:08 PM

HTTP/1.0 is very old and mostly not supported anymore. You need to use HTTP/1.1, and you need the Host: header. All of the current "shared web hosting" providers use 1.1 with Host: headers, for example -- if you tried to go to www.enchantedage.com or www.kwxport.org (my two sites) using simply the IP, you'd end up at the DreamHost main page.

enum Bool { True, False, FileNotFound };

prh99

520

June 23, 2009 02:23 PM

Quote:Original post by rajend3
If I changed it to HTTP/1.0 I won't need the host part right?

That's correct, but if you do the server will likely send you an HTTP/1.1 reply as Google did. Honestly, adding the Host: to your GET is probably one of the more trivial issues. For example, you'll need to parse those responses and handle them as well as chunk encoding and compression like gzip if you roll your own that is. Then you'll need to parse at least HTML 4.1 so your crawler can follow links and actually crawl.

RFC 2616 HTTP 1.1
RFC 822

Patrick

UltimaX

468

June 23, 2009 02:32 PM

Instead of using a socket you should let Java handle all of that for many reasons. See if this will do what you need:

protected void doRedirect(HttpServletRequest req,                             HttpServletResponse res)      throws IOException, ServletException {            String name = req.getParameter("name");      // Look up the site by name      String url = (String)_p.get(name);      if (url == null) {         url = "errorPage.html";      }      // Redirect request      res.sendRedirect(url);   }

http://java.sun.com/developer/EJTechTips/2003/tt0513.html

Here is another one: http://www.javapractices.com/topic/TopicAction.do?Id=181

[EDIT]
By the way, when you are writing the redirect code don't forget to accommodate a few of the gotchas out there. For one, some web developers check agents (like Microsoft) and if it's not valid then they will block you. Secondly, some sites will also have traps setup to redirect you an unlimited numbers of times to crash your bot. So set a max redirect and if your program redirects that many times then bail out. Finally, make sure you respect the web developers robots.txt file! Some site will ban your IP if your bot accesses certain links (like this one: http://danielwebb.us/software/bot-trap/).

Have fun!

UltimaX

468

June 23, 2009 02:42 PM

Quote:Original post by prh99
Quote:Original post by rajend3
If I changed it to HTTP/1.0 I won't need the host part right?

That's correct, but if you do the server will likely send you an HTTP/1.1 reply as Google did. Honestly, adding the Host: to your GET is probably one of the more trivial issues. For example, you'll need to parse those responses and handle them as well as chunk encoding and compression like gzip if you roll your own that is. Then you'll need to parse at least HTML 4.1 so your crawler can follow links and actually crawl.

RFC 2616 HTTP 1.1
RFC 822

Writing an HTML parser isn't too difficult, but it can be tricky (accommodating for lazy developers, developers who don't follow the standard, etc.). A simple regular expression can parse out the links relatively easy.

Web Crawler

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Web Crawler

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines