Jump to content
  • Advertisement
Sign in to follow this  
gretty

Do you know the exact URL for googles search?

This topic is 2826 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hello

I am writting a program to select a term like "Pony" then search google with that term, then obtain the number of search results there are & the time the search took.

But I have a problem:
I am unsure if I have the correct URL to search google for a term. I have been using the URL...
http://www.google.com/search?q="search term goes here"


But this URL doesn't work WHEN I try to retrieve the HTML source code so I can parse for the search result data.

If you place this URL in a browser, it will work, BUT if you use this link inside a program(I have tried in Python & Java) I get an IOException(in java) & IOError (in python).

So my question is...does google maybe stop programs from querying their engine like this so they dont get their search results stolen(another engine may piggy back off google instead of coming up with their own search algorithms)
OR
is there a proper google URL that I dont know about?


To see that it doesn't work you can run this python code & you will see that the IOError gets thrown:


import urllib2


def get_source(URL):
""" Retrieve & return HTML source code from website URL """

try:

source_buffer = urllib2.urlopen(URL)
source_code = source_buffer.read()
source_buffer.close()
print source_code

except IOError:
print """
'get_source()' Function Failed:
Reasons could be:
- Invalid URL name
OR
- HTML protocol message transfer failure;
Internet Connection does not exist."""
return None


get_source( "http://www.google.com/search?q='pony'" )



Share this post


Link to post
Share on other sites
Advertisement
Quote:
Original post by gretty
So my question is...does google maybe stop programs from querying their engine like this so they dont get their search results stolen(another engine may piggy back off google instead of coming up with their own search algorithms)
I'm guessing this is what's going on. They're detecting that you're not a browser, and ignoring you. This makes sense, as you're asking them to generate a whole web-page, just to get a small bit of info from them.

They'd probably prefer you used one of their APIs directly, like the AJAX Search API.

Alternatively, you can adjust your HTTP request header to include the right agent string, etc, to fool them into thinking that your app is Firefox/IE/Chrome/etc...

Share this post


Link to post
Share on other sites
Thanks :)

PS, that changing the header information to mimic a browser sounds really cool, but do you really think google...THE GOOGLE... and its programmers would have missed/allowed such a giant hole?

I'd be interested to try it tho :P What header flag/part would I need to change? Something like "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.10) Gecko/20100914 Firefox/3.6.10" or something?


gonna look into the api

[Edited by - gretty on October 24, 2010 5:00:10 AM]

Share this post


Link to post
Share on other sites
Quote:
Original post by gretty
Thanks :)

PS, that changing the header information to mimic a browser sounds really cool, but do you really think google...THE GOOGLE... and its programmers would have missed/allowed such a giant hole?

I'd be interested to try it tho :P What header flag/part would I need to change? Something like "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.10) Gecko/20100914 Firefox/3.6.10" or something?


gonna look into the api


Since it happens over a network there is no way to avoid that giant hole, google can only see the information you give them and if you give them exactly the same information a normal webbrowser would there is nothing they can do about it. (They will most likely block your ip if you start spamming them with requests though)

Share this post


Link to post
Share on other sites
Hi,

This is what I am getting, just using the command line (i.e. no information about client transferred at all to google). I just get an answer. Note that the reuslt is 'chunked' i.e. you do not get one large block on request but several smaller ones:


informationsuperhighway:~$ telnet www.google.com 80
Trying 74.125.79.99...
Connected to www.l.google.com.
Escape character is '^]'.
GET http://www.google.com/search?q='pony' HTTP/1.1
host: www.google.com

HTTP/1.1 200 OK
Cache-Control: private, max-age=0
Date: Sun, 24 Oct 2010 10:36:15 GMT
Expires: -1
Content-Type: text/html; charset=ISO-8859-1
Set-Cookie: PREF=ID=e5ab8452cb36de6b:FF=0:TM=1287916575:LM=1287916575:S=pOIMdm4TonlLb5gT; expires=Tue, 23-Oct-2012 10:36:15 GMT; path=/; domain=.google.com
Set-Cookie: NID=40=Sp5NwQOwn8A7NTyJP6kyjN3-dkI-_jKp3DIzaTC1clkwfeZ9BsO0gKnUmP753QNMaS_NcRkM2Q7gKAeEC7IuBJjgatpcZXSyEqorusekcHkUxXFyiGiJKrNGbNzu1kzf; expires=Mon, 25-Apr-2011 10:36:15 GMT; path=/; domain=.google.com; HttpOnly
Server: gws
X-XSS-Protection: 1; mode=block
Transfer-Encoding: chunked

1000
<!doctype html><head><title>'pony' - Google Search</title><script>window.google={kEI:"HwzETJ32J4GWOpj6oP0L",kEXPI:"25907,26637,26992,27095,27178",kCSI:{e:"25907,26637,26992,27095,27178",ei:
"HwzETJ32J4GWOpj6oP0L",expi:"25907,26637,26992,27095,27178"},ml:function(){},kHL:"en",time:function(){return(new Date).getTime()}
,log:function(b,d,c){var a=new Image,e=google,g=e.lc,f=e.li;a.o






Last part trimmed (because there was a lot of data)



Edit:

Ok, wasted some more time on it:

@informationsuperhighway:~/test$ cat z
import urllib2

def get_source(URL):
try:
Req = urllib2.Request(URL)
Req.add_header( 'q','pony' );
f = urllib2.urlopen( Req );
print f.read(100000)
except IOError:
print "Exception"
return None


get_source( "http://www.google.com/search" )






Works :-)

e@informationsuperhighway:~/test$ python z
<!doctype html><html><head><meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"><title>Google</title><script>window.google={kEI:"wBTETImYL5CB-gaQzZnpCw",kEXPI:"26637,26992,27095",kCSI:{e:"26637,26992,27095",ei:"wBTETImYL5CB-gaQzZnpCw",expi:"26637,26992,27095"},ml:function(){},kHL:"en",time:function(){return(new Date).getTime()},log:function(b,d,c){var a=new Image,e=google,g=e.lc,f=e.li;a.onerror=(a.onload=(a.onabort=function(){delete g[f]}));g[f]=a;c=c||"/gen_204?atyp=i&ct="+b+"&cad="+d+"&zx="+google.time();a.src=c;e.li=f+1},lc:[],li:0,Toolbelt:{}};
window.google.sn="webhp";window.google.timers={load:{t:{start:(new Date).getTime()}}};try{}catch(u){}window.google.jsrt_kill=1;
var _gjwl=location;function _gjuc(){var e=_gjwl.href.indexOf("#");if(e>=0){var a=_gjwl.href.substring(e);if(a.indexOf("&q=")>0||a.indexOf("#q=")>=0){a=a.substring(1);if(a.indexOf("#")==-1){for(var c=0;c<a.length;){var d=c;if(a.charAt(d)=="&")++d;var b=a.indexOf("&",d);if(b==-1)b=a.length;var f=a.substring(d,b);if(f.indexOf("fp=")==0){a=a.substring(0,c)+a.substring(b,a.length);b=c}else if(f=="cad=h")return 0;c=b}_gjwl.href="/search?"+a+"&cad=h";return 1}}}return 0}function _gjp(){!(window._gjwl.hash&&




[Edited by - Ron AF Greve on October 24, 2010 6:38:41 AM]

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!