Sign in to follow this  
giugio

crawler c++

Recommended Posts

Hy. I'm still creating a crawler ,is composed by tree layer ,first layer create the string of the address of the pages to be loaded, a layer create a socket and get the html page content,the last layer parse the pages. The bottleneck is creating the request and wait for the download of the page. I have a 4 core and the parse connot be a problem , but for loading the page i think at some thread(haw much number?) that do the request , save the response in a circular list . The circular list can be accessible both from the download layer(for insert) and from the parser layer(for read)? Can be a significantly increase of prestation? How increase the prestation?because now is very slow.

Share this post


Link to post
Share on other sites
Quote:
Original post by giugio
The bottleneck is creating the request and wait for the download of the page.


Of course.

Quote:
but for loading the page i think at some thread(haw much number?) that do the request , save the response in a circular list .


Threading can't help you very much; you only have one internet connection, and any download threads will have to share it.

Try checking to make sure you don't download the same page more than once.

Quote:
Can be a significantly increase of prestation?
How increase the prestation?because now is very slow.


Crawlers are supposed to be slow. The owner of the web server you are crawling will be annoyed if you crawl too fast ;) And some servers will automatically limit you if you try to check too many pages in a short amount of time.

Share this post


Link to post
Share on other sites
Some time ago I wrote a tool to search through Google results using regular expressions and I came cross the same problem as you. I solved the problem by using a thread pool and increasing the number of threads until the bandwidth limit was reached. I think I ended up using between 10 and 20 threads. So, yes simultaneous downloads are the way to go and they can significantly increase performance.

Share this post


Link to post
Share on other sites
Quote:
Original post by giugio
The bottleneck is creating the request and wait for the download of the page.

Run multiple copies of your application.

Otherwise, look into boost::asio, which is asynchronous library intended for efficient networking.

Or better yet - why not use an existing application, such as wget?

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this