crawler c++

Started by
2 comments, last by Antheus 14 years, 10 months ago
Hy. I'm still creating a crawler ,is composed by tree layer ,first layer create the string of the address of the pages to be loaded, a layer create a socket and get the html page content,the last layer parse the pages. The bottleneck is creating the request and wait for the download of the page. I have a 4 core and the parse connot be a problem , but for loading the page i think at some thread(haw much number?) that do the request , save the response in a circular list . The circular list can be accessible both from the download layer(for insert) and from the parser layer(for read)? Can be a significantly increase of prestation? How increase the prestation?because now is very slow.
Advertisement
Quote:Original post by giugio
The bottleneck is creating the request and wait for the download of the page.


Of course.

Quote:but for loading the page i think at some thread(haw much number?) that do the request , save the response in a circular list .


Threading can't help you very much; you only have one internet connection, and any download threads will have to share it.

Try checking to make sure you don't download the same page more than once.

Quote:Can be a significantly increase of prestation?
How increase the prestation?because now is very slow.


Crawlers are supposed to be slow. The owner of the web server you are crawling will be annoyed if you crawl too fast ;) And some servers will automatically limit you if you try to check too many pages in a short amount of time.
Some time ago I wrote a tool to search through Google results using regular expressions and I came cross the same problem as you. I solved the problem by using a thread pool and increasing the number of threads until the bandwidth limit was reached. I think I ended up using between 10 and 20 threads. So, yes simultaneous downloads are the way to go and they can significantly increase performance.
Quote:Original post by giugio
The bottleneck is creating the request and wait for the download of the page.

Run multiple copies of your application.

Otherwise, look into boost::asio, which is asynchronous library intended for efficient networking.

Or better yet - why not use an existing application, such as wget?

This topic is closed to new replies.

Advertisement