Do you know the exact URL for googles search?

Started by
4 comments, last by Ron AF Greve 13 years, 6 months ago
Hello

I am writting a program to select a term like "Pony" then search google with that term, then obtain the number of search results there are & the time the search took.

But I have a problem:
I am unsure if I have the correct URL to search google for a term. I have been using the URL...
http://www.google.com/search?q="search term goes here"


But this URL doesn't work WHEN I try to retrieve the HTML source code so I can parse for the search result data.

If you place this URL in a browser, it will work, BUT if you use this link inside a program(I have tried in Python & Java) I get an IOException(in java) & IOError (in python).

So my question is...does google maybe stop programs from querying their engine like this so they dont get their search results stolen(another engine may piggy back off google instead of coming up with their own search algorithms)
OR
is there a proper google URL that I dont know about?


To see that it doesn't work you can run this python code & you will see that the IOError gets thrown:
import urllib2def get_source(URL):    """ Retrieve & return HTML source code from website URL """    try:                source_buffer = urllib2.urlopen(URL)        source_code = source_buffer.read()        source_buffer.close()        print source_code    except IOError:        print """              'get_source()' Function Failed:                  Reasons could be:                  - Invalid URL name                  OR                  - HTML protocol message transfer failure;                    Internet Connection does not exist."""        return Noneget_source( "http://www.google.com/search?q='pony'" )
Advertisement
Quote:Original post by gretty
So my question is...does google maybe stop programs from querying their engine like this so they dont get their search results stolen(another engine may piggy back off google instead of coming up with their own search algorithms)
I'm guessing this is what's going on. They're detecting that you're not a browser, and ignoring you. This makes sense, as you're asking them to generate a whole web-page, just to get a small bit of info from them.

They'd probably prefer you used one of their APIs directly, like the AJAX Search API.

Alternatively, you can adjust your HTTP request header to include the right agent string, etc, to fool them into thinking that your app is Firefox/IE/Chrome/etc...
Thanks :)

PS, that changing the header information to mimic a browser sounds really cool, but do you really think google...THE GOOGLE... and its programmers would have missed/allowed such a giant hole?

I'd be interested to try it tho :P What header flag/part would I need to change? Something like "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.10) Gecko/20100914 Firefox/3.6.10" or something?


gonna look into the api

[Edited by - gretty on October 24, 2010 5:00:10 AM]
If you app behaves like a browser in terms of headers then there's nothing google can do to prevent this.
Quote:Original post by gretty
Thanks :)

PS, that changing the header information to mimic a browser sounds really cool, but do you really think google...THE GOOGLE... and its programmers would have missed/allowed such a giant hole?

I'd be interested to try it tho :P What header flag/part would I need to change? Something like "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.10) Gecko/20100914 Firefox/3.6.10" or something?


gonna look into the api


Since it happens over a network there is no way to avoid that giant hole, google can only see the information you give them and if you give them exactly the same information a normal webbrowser would there is nothing they can do about it. (They will most likely block your ip if you start spamming them with requests though)
[size="1"]I don't suffer from insanity, I'm enjoying every minute of it.
The voices in my head may not be real, but they have some good ideas!
Hi,

This is what I am getting, just using the command line (i.e. no information about client transferred at all to google). I just get an answer. Note that the reuslt is 'chunked' i.e. you do not get one large block on request but several smaller ones:

informationsuperhighway:~$ telnet www.google.com 80Trying 74.125.79.99...Connected to www.l.google.com.Escape character is '^]'.GET http://www.google.com/search?q='pony' HTTP/1.1host: www.google.comHTTP/1.1 200 OKCache-Control: private, max-age=0Date: Sun, 24 Oct 2010 10:36:15 GMTExpires: -1Content-Type: text/html; charset=ISO-8859-1Set-Cookie: PREF=ID=e5ab8452cb36de6b:FF=0:TM=1287916575:LM=1287916575:S=pOIMdm4TonlLb5gT; expires=Tue, 23-Oct-2012 10:36:15 GMT; path=/; domain=.google.comSet-Cookie: NID=40=Sp5NwQOwn8A7NTyJP6kyjN3-dkI-_jKp3DIzaTC1clkwfeZ9BsO0gKnUmP753QNMaS_NcRkM2Q7gKAeEC7IuBJjgatpcZXSyEqorusekcHkUxXFyiGiJKrNGbNzu1kzf; expires=Mon, 25-Apr-2011 10:36:15 GMT; path=/; domain=.google.com; HttpOnlyServer: gwsX-XSS-Protection: 1; mode=blockTransfer-Encoding: chunked1000<!doctype html><head><title>'pony' - Google Search</title><script>window.google={kEI:"HwzETJ32J4GWOpj6oP0L",kEXPI:"25907,26637,26992,27095,27178",kCSI:{e:"25907,26637,26992,27095,27178",ei:"HwzETJ32J4GWOpj6oP0L",expi:"25907,26637,26992,27095,27178"},ml:function(){},kHL:"en",time:function(){return(new Date).getTime()},log:function(b,d,c){var a=new Image,e=google,g=e.lc,f=e.li;a.o

Last part trimmed (because there was a lot of data)



Edit:

Ok, wasted some more time on it:
@informationsuperhighway:~/test$ cat zimport urllib2def get_source(URL):    try:        Req = urllib2.Request(URL)        Req.add_header( 'q','pony' );        f = urllib2.urlopen( Req );        print f.read(100000)    except IOError:        print "Exception"        return Noneget_source( "http://www.google.com/search" )


Works :-)
e@informationsuperhighway:~/test$ python z<!doctype html><html><head><meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"><title>Google</title><script>window.google={kEI:"wBTETImYL5CB-gaQzZnpCw",kEXPI:"26637,26992,27095",kCSI:{e:"26637,26992,27095",ei:"wBTETImYL5CB-gaQzZnpCw",expi:"26637,26992,27095"},ml:function(){},kHL:"en",time:function(){return(new Date).getTime()},log:function(b,d,c){var a=new Image,e=google,g=e.lc,f=e.li;a.onerror=(a.onload=(a.onabort=function(){delete g[f]}));g[f]=a;c=c||"/gen_204?atyp=i&ct="+b+"&cad="+d+"&zx="+google.time();a.src=c;e.li=f+1},lc:[],li:0,Toolbelt:{}};window.google.sn="webhp";window.google.timers={load:{t:{start:(new Date).getTime()}}};try{}catch(u){}window.google.jsrt_kill=1;var _gjwl=location;function _gjuc(){var e=_gjwl.href.indexOf("#");if(e>=0){var a=_gjwl.href.substring(e);if(a.indexOf("&q=")>0||a.indexOf("#q=")>=0){a=a.substring(1);if(a.indexOf("#")==-1){for(var c=0;c<a.length;){var d=c;if(a.charAt(d)=="&")++d;var b=a.indexOf("&",d);if(b==-1)b=a.length;var f=a.substring(d,b);if(f.indexOf("fp=")==0){a=a.substring(0,c)+a.substring(b,a.length);b=c}else if(f=="cad=h")return 0;c=b}_gjwl.href="/search?"+a+"&cad=h";return 1}}}return 0}function _gjp(){!(window._gjwl.hash&&


[Edited by - Ron AF Greve on October 24, 2010 6:38:41 AM]
Ron AF Greve

This topic is closed to new replies.

Advertisement