The Combine system is an open, free, and highly configurable system for focused crawling of Internet resources.
Main features include
Naturally it obeys the Robots Exclusion Protocol and behaves nice to Web-servers. Besides focused crawls (generating topic specific databases) Combine supports configurable rules on what's crawled based on regular expressions on URLs (URL focus filter). The crawler is designed to run continuously in order to keep the topic-specific database as up to date as possible.
The operation of Combine (overview in figure 1) as a focused crawler is based on a combination of a general Web crawler and an automated subject classifier. The topic focus is provided by a focus filter using a topic definition implemented as a thesaurus, where each term is connected to a topic class.
Crawled data are stored as a structured records in a local relational database.
Section 2 outlines how to download, install and test the Combine system and includes use scenarios.
Section 3 discuss configuration structure and highlights a few important configuration variables.
Section 4 describes policies and methods used by the crawler.
The system has a number of components, the main user visible ones being combineCtrl which is used to start and stop crawling, and view crawler status and combineExport that extracts crawled data from the internal database and exports it as XML records.
Further details (lots and lots of them) can be found in 'Gory details' and in the Appendix.
root 2006-11-08