Hi,
I'm developing an app to crawl the web using c# and the HtmlAgilityPack dll. It basically loads a html document and scans for links then loads all the linked documents, and this happens recursively.
Initially it was all single threaded (just to prove the concept worked), but was only processing around 600 pages per hour. The obvious bottleneck was that it took say 2 seconds to download a page and then 2 seconds to process the data and it would be far mor efficient to be able to download while the current page is being processed. This lead me to do a little multithreading.
OK, so attempt 1 was something like this
//Psuedo-codey
class document
{
public document(string url)
{
thread = new thread(new threadstart(load(url));
thread.start();
}
public void load(url)
{
LoadDocumentFromUrl;
ScanDocForLinks;
foreach (link in document_links)
{
document doc = new document(link.url);
}
}
}
Now this was waaaaaaaaay faster. BUT as you can imagine this spawned loads of unterminated thread and around the 1500 mark I get an exception saying that Start could not start a new thread. So I modified the code above thus:
//Psuedo-codey
class document
{
public document(string url)
{
thread = new thread(new threadstart(load(url));
thread.start();
while (thread.IsAlive);
foreach (link in document_links)
{
document doc = new document(link.url);
}
}
public void load(url)
{
LoadDocumentFromUrl;
ScanDocForLinks;
}
}
The difference being that the thread finishes before spawning new ones - obviosuly this will still ultimately spawn a lot of threads, but at least they will terminate and the rate of self-destruction will be much slower. however like this the app is the same speed as being single threaded, as were doing way more in the main thread, to wait for the thread to end.
After using the 1st attempt (and seeing the possible speed) I'm not really convinced I can live with the 2nd version or even a single threaded version. BUT I cant seem to find a way to make it work. What would be really nice would be if I could throw an event when the thread ends, (I was going to try this by inheriting from the thread class, but its sealed). Ive also tried puttin in abort() in the load() method, but the exeption is raised,but the thread still seems to go on (stepping through and monitoring the thread count in taskmgr).
Does anyone have any ideas?
Neil
WHATCHA GONNA DO WHEN THE LARGEST ARMS IN THE WORLD RUN WILD ON YOU?!?!