Performance

Performance evaluation of the automated subject classification component is treated in section 5. Performance in terms of number of URLs treated per minute is of course highly dependent on a number of circumstances like network load, capacity of the machine, the selection of URLs to crawl, configuration details, number of crawlers used, etc. In general, within rather wide limits, you could expect the Combine system to handle up to 200 URLs per minute. Handle here means everything from scheduling of URLs, fetching of pages over the network, parsing the page, automated subject classification, recycling of new links, to storing the structured record in a relational database. This holds for small simple crawls starting from scratch to large complicated topic specific crawls with millions of records.

Figure 4: Combine crawler performance.
\includegraphics[height=0.4\textheight]{CrawlerSpeed.ps}

The prime way of increasing performance is to use more than one crawler for a job. This is handled by the -harvesters switch used together with the combineCtrl start command (for example
combineCtrl -jobname MyCrawl -harvesters 5 start
will start 5 crawlers working together on the job 'MyCrawl'. The effect of using more than one crawler on crawling speed is illustrated in figure 4 below.

Configuration also have an effect on performance. In figure 5 performance improvements based on configuration changes are shown. The choice of algorithm for automated classification turns out to have biggest influence on performance, where algorithm 2 (classifyPlugIn = Combine::PosCheck_record - Pos in figure 5) is much faster than algorithm 1 (classifyPlugIn = Combine::Check_record - Std in figure 5). Tweaking of other configuration variables also have an effect on performance but to a lesser degree. Tweaking consisted of not using Tidy to clean HTML (useTidy = 0) and not storing the original page in the database (saveHTML = 0).

Figure 5: Effect of configuration changes on focused crawler performance.
\includegraphics[height=0.4\textheight]{Config.ps}

root 2006-11-08