Back to General and Gameplay Programming

crawler c++

General and Gameplay Programming Programming

Started by giugio June 01, 2009 02:12 AM

2 comments, last by Antheus 14 years, 10 months ago

giugio

246

Author

June 01, 2009 02:12 AM

Hy. I'm still creating a crawler ,is composed by tree layer ,first layer create the string of the address of the pages to be loaded, a layer create a socket and get the html page content,the last layer parse the pages. The bottleneck is creating the request and wait for the download of the page. I have a 4 core and the parse connot be a problem , but for loading the page i think at some thread(haw much number?) that do the request , save the response in a circular list . The circular list can be accessible both from the download layer(for insert) and from the parser layer(for read)? Can be a significantly increase of prestation? How increase the prestation?because now is very slow.

Zahlman

1,682

June 01, 2009 11:12 AM

Quote:Original post by giugio
The bottleneck is creating the request and wait for the download of the page.

Of course.

Quote:but for loading the page i think at some thread(haw much number?) that do the request , save the response in a circular list .

Threading can't help you very much; you only have one internet connection, and any download threads will have to share it.

Try checking to make sure you don't download the same page more than once.

Quote:Can be a significantly increase of prestation?
How increase the prestation?because now is very slow.

Crawlers are supposed to be slow. The owner of the web server you are crawling will be annoyed if you crawl too fast ;) And some servers will automatically limit you if you try to check too many pages in a short amount of time.

Kambiz

759

June 01, 2009 11:50 AM

Some time ago I wrote a tool to search through Google results using regular expressions and I came cross the same problem as you. I solved the problem by using a thread pool and increasing the number of threads until the bandwidth limit was reached. I think I ended up using between 10 and 20 threads. So, yes simultaneous downloads are the way to go and they can significantly increase performance.

Antheus

2,410

June 01, 2009 12:11 PM

Quote:Original post by giugio
The bottleneck is creating the request and wait for the download of the page.

Run multiple copies of your application.

Otherwise, look into boost::asio, which is asynchronous library intended for efficient networking.

Or better yet - why not use an existing application, such as wget?

crawler c++

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

crawler c++

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines