The focused crawler has been restructured and packaged as a Debian package in order to ease distribution and installation. The package contains dependency information to make sure that all software that is needed to run the crawler is installed at the same time. In connection with this we have also packaged a number of necessary Perl-modules as Debian packages.
All software and packages are available from a number of places:
In addition to the distribution sites there is a public discussion list at SourceForge7.
This distribution is developed and tested on Linux systems. It is implemented entirely in Perl and uses the MySQL8 database system, both of which are supported on many other operating systems. Porting to other UNIX dialects should be easy.
The system is distributed either as source or as a Debian package.
Unless you are on a system supporting Debian packages (in which case look at Automated installation (section 2.1.3)), you should download and unpack the source. The following command sequence will then install Combine:
perl Makefile.PL
make make test make install mkdir /etc/combine cp conf/* /etc/combine/ mkdir /var/run/combine |
Test that it all works (run as root)
./doc/InstallationTest.pl
In order to port the system to another platform, you have to verify the availability, for this platform, of the two main systems:
If they are supported you stand a good chance to port the system.
Furthermore, the external Perl modules (listed in 10.3) should be verified to work on the new platform.
Perl modules are most easily installed using the Perl CPAN automated system
(perl -MCPAN -e shell).
Optionally the following external programs will be used if they are installed on your system:
This also installs all dependencies such as MySQL and a lot of necessary Perl modules.
Download the latest distribution11.
Install all software that Combine depends on (see above).
Unpack the archive with tar zxf
This will create a directory named combine-XX with a number of subdirectories including bin,
Combine, doc, and conf.
’bin’ contains the executable programs.
’Combine’ contains needed Perl modules. They should be copied to where Perl will find them, typically /usr/share/perl5/Combine/.
’conf’ contains the default configuration files. Combine looks for them in /etc/combine/ so they need to be copied there.
’doc’ contains documentation.
The following command sequence will install Combine:
perl Makefile.PL
make make test make install mkdir /etc/combine cp conf/* /etc/combine/ mkdir /var/run/combine |
A simple way to test your newly installed Combine system is to crawl just one Web-page and export it as an XML-document. This will exercise much of the code and guarantee that basic focused crawling will work.
sudo combineINIT --jobname aatest --topic /etc/combine/Topic_carnivor.txt
|
combine --jobname aatest
--harvest http://combine.it.lth.se/CombineTests/InstallationTest.html |
combineExport --jobname aatest --profile dc
|
<?xml version="1.0" encoding="UTF-8"?>
<documentCollection version="1.1" xmlns:dc="http://purl.org/dc/elements/1.1/"> <metadata xmlns:dc="http://purl.org/dc/elements/1.1/"> <dc:format>text/html</dc:format> <dc:format>text/html; charset=iso-8859-1</dc:format> <dc:subject>Carnivorous plants</dc:subject> <dc:subject>Drosera</dc:subject> <dc:subject>Nepenthes</dc:subject> <dc:title transl="yes">Installation test for Combine</dc:title> <dc:description></dc:description> <dc:date>2006-05-19 9:57:03</dc:date> <dc:identifier>http://combine.it.lth.se/CombineTests/InstallationTest.html</dc:identifier> <dc:language>en</dc:language> </metadata> |
Or run – as root – the script ./doc/InstallationTest.pl (see A.1 in the Appendix) which essentially does the same thing.
A simple example work-flow for a trivial crawl job name ’aatest’ might look like:
Once a job is initialized it is controlled using combineCtrl. Crawled data is exported using combineExport.
The latest, updated, detailed documentation is always available online12.
Use the same procedure as in section 2.2. This way of crawling is not recommended for the Combine system since it will generate really huge databases without any focus.
Create a focused database with all pages from a Web-site. In this use scenario we will crawl the Combine site and the ALVIS site. The database is to be continuously updated, i.e. all pages have to be regularly tested for changes, deleted pages should be removed from the database, and newly created pages added.
#use either URL or HOST: (obs ’:’) to match regular expressions to either the
#full URL or the HOST part of a URL. <allow> #Allow crawl of URLs or hostnames that matches these regular expressions HOST: .*$ </allow> |
to
#use either URL or HOST: (obs ’:’) to match regular expressions to either the
#full URL or the HOST part of a URL. <allow> #Allow crawl of URLs or hostnames that matches these regular expressions HOST: www\.alvis\.info$ HOST: combine\.it\.lth\.se$ </allow> |
The escaping of ’.’ by writing ’\.’ is necessary since the patterns actually are Perl regular expressions. Similarly the ending ’$’ indicates that the host string should end here, so for example a Web server on www.alvis.info.com (if such exists) will not be crawled.
Create and maintain a topic specific crawled database for the topic ’Carnivorous plants’.
http://www.sarracenia.com/faq.html
http://dmoz.org/Home/Gardening/Plants/Carnivorous_Plants/ http://www.omnisterra.com/bot/cp_home.cgi http://www.vcps.au.com/ http://www.murevarn.se/links.html |
This enables topic checking and focused crawl mode by setting configuration
variable doCheckRecord = 1 and copying a topic definition file (cpTopic.txt)
to
/etc/combine/cptest/topicdefinition.txt.
Running this crawler for an extended period will result in more than 200 000 records.
Use the same procedure as in section 2.4.3 (Focused crawling – topic specific) except for the last
point. Exporting should be done incrementally into an Alvis pipeline (in this example listening at
port 3333 on the machine nlp.alvis.info):
combineExport --jobname cptest --pipehost nlp.alvis.info --pipeport 3333 --incremental
This scenario requires the crawler to:
I.e. all of http://my.targetsite.com/*, plus any other URL that is linked to from a page in http://my.targetsite.com/*.
#use either URL or HOST: (obs ’:’) to match regular expressions to either the
#full URL or the HOST part of a URL. <allow> #Allow crawl of URLs or hostnames that matches these regular expressions HOST: my\.targetsite\.com$ </allow> |
<allow>
#Allow crawl of URLs or hostnames that matches these regular expressions HOST: .*$ </allow> |
and maybe (depending or your other requirements) change:
#User agent handles redirects (1) or treat redirects as new links (0)
UserAgentFollowRedirects = 0