Jump to content
  • Advertisement
Sign in to follow this  
giugio

crawler c++

This topic is 3404 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hy. I'm still creating a crawler ,is composed by tree layer ,first layer create the string of the address of the pages to be loaded, a layer create a socket and get the html page content,the last layer parse the pages. The bottleneck is creating the request and wait for the download of the page. I have a 4 core and the parse connot be a problem , but for loading the page i think at some thread(haw much number?) that do the request , save the response in a circular list . The circular list can be accessible both from the download layer(for insert) and from the parser layer(for read)? Can be a significantly increase of prestation? How increase the prestation?because now is very slow.

Share this post


Link to post
Share on other sites
Advertisement
Quote:
Original post by giugio
The bottleneck is creating the request and wait for the download of the page.


Of course.

Quote:
but for loading the page i think at some thread(haw much number?) that do the request , save the response in a circular list .


Threading can't help you very much; you only have one internet connection, and any download threads will have to share it.

Try checking to make sure you don't download the same page more than once.

Quote:
Can be a significantly increase of prestation?
How increase the prestation?because now is very slow.


Crawlers are supposed to be slow. The owner of the web server you are crawling will be annoyed if you crawl too fast ;) And some servers will automatically limit you if you try to check too many pages in a short amount of time.

Share this post


Link to post
Share on other sites
Some time ago I wrote a tool to search through Google results using regular expressions and I came cross the same problem as you. I solved the problem by using a thread pool and increasing the number of threads until the bandwidth limit was reached. I think I ended up using between 10 and 20 threads. So, yes simultaneous downloads are the way to go and they can significantly increase performance.

Share this post


Link to post
Share on other sites
Quote:
Original post by giugio
The bottleneck is creating the request and wait for the download of the page.

Run multiple copies of your application.

Otherwise, look into boost::asio, which is asynchronous library intended for efficient networking.

Or better yet - why not use an existing application, such as wget?

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!