Sign in to follow this  
rajend3

Web Crawler

Recommended Posts

I'm trying to program a web crawler in java but I have run into a problem. Every time I try to get the index page of google I always get a 302 response message. This is my code:
final int HTTP_PORT = 80;

Socket socket;
try
{
   socket = new Socket("www.google.com", HTTP_PORT);
			
   BufferedWriter out = new BufferedWriter(new 
      OutputStreamWriter(socket.getOutputStream()));
			
   BufferedReader in = new BufferedReader(
      new InputStreamReader(socket.getInputStream()));
		      
   out.write("GET /index.html HTTP/1.0\n\n");
   out.flush();

   String line;
   while((line = in.readLine()) != null)
   {
      System.out.println(line);
   }
			
   out.close();
   in.close();
}
catch (UnknownHostException e)
{
   e.printStackTrace();
}
catch (IOException e)
{
   e.printStackTrace();
}
And this is the output: HTTP/1.1 302 Found Location: http://www.google.ca/index.html Cache-Control: private Content-Type: text/html; charset=UTF-8 Set-Cookie: PREF=ID=a2755879d5ff1604:TM=1245776068:LM=1245776068:S=2dpHFT7Wtee_HGUJ; expires=Thu, 23-Jun-2011 16:54:28 GMT; path=/; domain=.google.com Date: Tue, 23 Jun 2009 16:54:28 GMT Server: gws Content-Length: 228 <HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8"> <TITLE>302 Moved</TITLE></HEAD><BODY> <H1>302 Moved</H1> The document has moved <A HREF="http://www.google.ca/index.html">here</A>. </BODY></HTML> [Edited by - rajend3 on June 23, 2009 12:06:56 PM]

Share this post


Link to post
Share on other sites
I just wanted a simple site to test my code. I don't know if I made an invalid get request or not.

EDIT: It seems to work with yahoo.com and monster.com and I get a different problem with redcross.ca:

HTTP/1.1 400 Bad Request
Content-Type: text/html
Date: Tue, 23 Jun 2009 17:28:04 GMT
Connection: close
Content-Length: 39

<h1>Bad Request (Invalid Hostname)</h1>


But it works in my browser with a changed link location of http://www.redcross.ca/article.asp?id=000005&tid=003

Any reasons why?

Share this post


Link to post
Share on other sites
Quote:
Original post by rajend3
I just wanted a simple site to test my code. I don't know if I made an invalid get request or not.
You are getting the exact response you should - a redirect to the actual index page. Webcrawlers have to handle redirect responses just like a browser would.

Share this post


Link to post
Share on other sites
Use HTTP/1.1 header, and specify the host.

 GET /index.html HTTP/1.1
Host: www.google.ca


And yes, HTTP is far from trivial, browsers hide a lot of underlying complexity. Google and most other large service providers make extensive use of various HTTP and DNS mechanisms to direct their traffic.

Share this post


Link to post
Share on other sites
HTTP/1.0 is very old and mostly not supported anymore. You need to use HTTP/1.1, and you need the Host: header. All of the current "shared web hosting" providers use 1.1 with Host: headers, for example -- if you tried to go to www.enchantedage.com or www.kwxport.org (my two sites) using simply the IP, you'd end up at the DreamHost main page.

Share this post


Link to post
Share on other sites
Quote:
Original post by rajend3
If I changed it to HTTP/1.0 I won't need the host part right?


That's correct, but if you do the server will likely send you an HTTP/1.1 reply as Google did. Honestly, adding the Host: to your GET is probably one of the more trivial issues. For example, you'll need to parse those responses and handle them as well as chunk encoding and compression like gzip if you roll your own that is. Then you'll need to parse at least HTML 4.1 so your crawler can follow links and actually crawl.

RFC 2616 HTTP 1.1
RFC 822

Share this post


Link to post
Share on other sites
Instead of using a socket you should let Java handle all of that for many reasons. See if this will do what you need:


protected void doRedirect(HttpServletRequest req,
HttpServletResponse res)
throws IOException, ServletException {

String name = req.getParameter("name");

// Look up the site by name
String url = (String)_p.get(name);
if (url == null) {
url = "errorPage.html";
}

// Redirect request
res.sendRedirect(url);
}



http://java.sun.com/developer/EJTechTips/2003/tt0513.html


Here is another one: http://www.javapractices.com/topic/TopicAction.do?Id=181

[EDIT]
By the way, when you are writing the redirect code don't forget to accommodate a few of the gotchas out there. For one, some web developers check agents (like Microsoft) and if it's not valid then they will block you. Secondly, some sites will also have traps setup to redirect you an unlimited numbers of times to crash your bot. So set a max redirect and if your program redirects that many times then bail out. Finally, make sure you respect the web developers robots.txt file! Some site will ban your IP if your bot accesses certain links (like this one: http://danielwebb.us/software/bot-trap/).

Have fun!

Share this post


Link to post
Share on other sites
Quote:
Original post by prh99
Quote:
Original post by rajend3
If I changed it to HTTP/1.0 I won't need the host part right?


That's correct, but if you do the server will likely send you an HTTP/1.1 reply as Google did. Honestly, adding the Host: to your GET is probably one of the more trivial issues. For example, you'll need to parse those responses and handle them as well as chunk encoding and compression like gzip if you roll your own that is. Then you'll need to parse at least HTML 4.1 so your crawler can follow links and actually crawl.

RFC 2616 HTTP 1.1
RFC 822


Writing an HTML parser isn't too difficult, but it can be tricky (accommodating for lazy developers, developers who don't follow the standard, etc.). A simple regular expression can parse out the links relatively easy.

Share this post


Link to post
Share on other sites
Quote:
Original post by UltimaX
Writing an HTML parser isn't too difficult, but it can be tricky (accommodating for lazy developers, developers who don't follow the standard, etc.).


This is one of those things with huge geek appeal. Especially if one tries to go for 100% standard compliant parser.

But if it was difficult several years ago, it's basically impossible today. The HTTP part (even after accounting for all variations) is small part in practice. A large portion of sites today are problematic to parse without javascript and proper session handling. And once you add that, you have a browser.

The most it makes sense is to parse plain DOM, then work with tags. Even that can be complicated by various encodings....


Then again, Chrome and WebKit (Firefox too, but its code is annoying) are both state of the art, open source projects. Why not use them?

Perhaps write the spider in javascript running inside a browser. It comes with full DOM access, and if there are any security restrictions in place, just hack your own copy to get around it.


Firefox would work as well, after all, Scrapbook addon works like that.

Share this post


Link to post
Share on other sites
Quote:
Original post by Antheus
Perhaps write the spider in javascript running inside a browser. It comes with full DOM access, and if there are any security restrictions in place, just hack your own copy to get around it.
On the other hand, most of the custom spiders I have seen could be replaced by a simple driver program that runs wget in a sub-process.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this