[.net] Threading

Started by
7 comments, last by thedo 18 years, 7 months ago
Hi, I'm developing an app to crawl the web using c# and the HtmlAgilityPack dll. It basically loads a html document and scans for links then loads all the linked documents, and this happens recursively. Initially it was all single threaded (just to prove the concept worked), but was only processing around 600 pages per hour. The obvious bottleneck was that it took say 2 seconds to download a page and then 2 seconds to process the data and it would be far mor efficient to be able to download while the current page is being processed. This lead me to do a little multithreading. OK, so attempt 1 was something like this

//Psuedo-codey
class document
{
  public document(string url)
  {
       thread = new thread(new threadstart(load(url));
       thread.start();
  }

  public void load(url)
  {
    LoadDocumentFromUrl;
    ScanDocForLinks;
    foreach (link in document_links)
    {
        document doc = new document(link.url);
    }
  }
}
Now this was waaaaaaaaay faster. BUT as you can imagine this spawned loads of unterminated thread and around the 1500 mark I get an exception saying that Start could not start a new thread. So I modified the code above thus:

//Psuedo-codey
class document
{
  public document(string url)
  {
       thread = new thread(new threadstart(load(url));
       thread.start();
       while (thread.IsAlive);
       foreach (link in document_links)
       {
          document doc = new document(link.url);
       }
  }

  public void load(url)
  {
    LoadDocumentFromUrl;
    ScanDocForLinks;
  }
}
The difference being that the thread finishes before spawning new ones - obviosuly this will still ultimately spawn a lot of threads, but at least they will terminate and the rate of self-destruction will be much slower. however like this the app is the same speed as being single threaded, as were doing way more in the main thread, to wait for the thread to end. After using the 1st attempt (and seeing the possible speed) I'm not really convinced I can live with the 2nd version or even a single threaded version. BUT I cant seem to find a way to make it work. What would be really nice would be if I could throw an event when the thread ends, (I was going to try this by inheriting from the thread class, but its sealed). Ive also tried puttin in abort() in the load() method, but the exeption is raised,but the thread still seems to go on (stepping through and monitoring the thread count in taskmgr). Does anyone have any ideas? Neil
WHATCHA GONNA DO WHEN THE LARGEST ARMS IN THE WORLD RUN WILD ON YOU?!?!
Advertisement
I would make only two threads.

Make two queues. One for the urls and one for die downloaded documents.

One thread looks for new urls in the queue, starts downloading and puts them in the document queue when finished.

One thread looks for new documents in the document queue and adds new urls to the url queue.


It sounds like you get a high improvement by downloading several documents at the same time. Just try to make several threads of the first kind until you don't get an improvement.

use lock( ... ) when accessing the queue to prevent errors or multiple downloading of the same document.


Well this is how I use multi threading. Separating the bottleneck in a thread to not slow down the rest.
OK, This seems to work a treat BUT (and this is a limitation on my part not on the part of the proposed solution).

In my original MT solution each thread ran for a small slice of time then died. This meant that the PC was nice and resposive when there was little to do. In this solution I do the following in each thread

void worker_thread(){   while(brunning)   {      //Do interesting things   }}


This brunts up the processor usage up to 100% all the time. Is there a more efficient way to accomplish what I have done? Not done a huge amount of MT in c#, and the little I have has been runnign very short threads so this has never been an issue before.

Cheers

Neil
WHATCHA GONNA DO WHEN THE LARGEST ARMS IN THE WORLD RUN WILD ON YOU?!?!
Throw in a Thread.Sleep(0) into your loop to give some control back to the processor.
Damn, I am glad that this works for you. Otherwise I had to review a lot of code ;)


Sleep will solve your remaining problem. But I feel that Sleep(0) does not free a lot time. Don't mind using Sleep(100). Most of the time the process thread will just find an empty queue and do nothing but eating processor time.
I mean the delay time does not sum up. In no case there will be a document pending for more than 100 ms, what is negligible for that purpose.


I guess you can analize it statistically and use a smart dynamic delay time but this is quite far beyond reasonable ;)
You could also just have an event that you clear when the queue is empty, and set when the queue has an element in it. The worker thread would then wait on the event, this would put the thread into the wait state when the queue is empty, It also allows you to scale a bit better because you can extend the solution to multiple queues across multiple threads easily enough (also, 100ms sleep is not guaranteed to be 100ms, you just surender your chunk of the CPUs time for AT LEAST that amount of time. It could be significantly longer.)

In time the project grows, the ignorance of its devs it shows, with many a convoluted function, it plunges into deep compunction, the price of failure is high, Washu's mirth is nigh.

what about one thread finding all the links out of a queue, and one using 'System.Threading.ThreadPool.QueueUserWorkItem(...)' to follow the link and add it's items to the queue? that should work fairly well yeah? Just use a ManualResetEvent to stop and start the worker thread when the queue is full or not?
I think i'm going to redesign it a little again. From running it last nightthe downlaods were a real bottleneck. The Queue of documents waiting to be processed was 99% of the time fully processed, but the list of documents waiting to be downloaded was still huge. Alos because it was only downloading the 1 doc at a time I dont htink I was getting maximum efficiency.

I'm thinking of going back to short download threads, as 1 page can set off X links all downloading simulatneously (much like my 1st MT implementation, but this time they wont fire off sub-threads so hopefully they'll eventually quit). This thread will feed the queue of downloaded documents waiting to be processed, and as its downloading multiple docs simultaneously the downloaded doc queue should be much more plump.

I'll report back to say if this works OK.

Neil
WHATCHA GONNA DO WHEN THE LARGEST ARMS IN THE WORLD RUN WILD ON YOU?!?!
Well the final idea worked a treat.

The main program thread processes html documents put into a queue by downloader threads. I have had to put in some manual thread management, as outside of the debugger the app was running way too fast and I had 1000 downloads running simultansously. I've pretty much expanded it as far as I can as a simple app as the memory requirements go up to 2 gig of virtual memory after a few seconds, so I'm going to have to sit down and think about a sensible database design now.

Thanks to all who helped me along the way!

Neil
WHATCHA GONNA DO WHEN THE LARGEST ARMS IN THE WORLD RUN WILD ON YOU?!?!

This topic is closed to new replies.

Advertisement