Subsections


Open Source distribution, Installation

The focused crawler have been restructured and packaged as a Debian package in order to ease distribution and installation. The package contains dependency information to make sure that all software that is needed to run the crawler is installed at the same time. In connection with this we have also packaged a number of necessary Perl-modules as Debian packages.

All software and packages are available from the Combine focused crawler Web-site.

Installation

This distribution is developed and tested on Linux systems. It is implemented entirely in Perl and uses the MySQL database system, both of which are supported on many other operating systems. Porting to other UNIX dialects should be easy.

The system is distributed either as source or as a Debian package.

Installation from source for the impatient

Unless you are on a system supporting Debian packages (in which case look at Automated installation) you should download and unpack the source. The following command sequence will then install Combine:
perl Makefile.PL
make
make test
make install
mkdir /etc/combine
cp conf/* /etc/combine/
mkdir /var/run/combine

Test that it all works (run as root)
./doc/InstallationTest.pl

Porting to not supported operating systems - dependencies

In order to port the system to another platform, you have to verify the availability, for this platform, of the two main systems: If they are supported you stand a good chance to port the system.

Furthermore the external Perl modules should be verified to work on the new platform.

Perl modules are most easily installed using the Perl CPAN automated system
(perl -MCPAN -e shell).

Optionally these external programs will be used if they are installed on your system.


Automated Debian/Ubuntu installation

This also installs all dependencies such as MySQL and a lot of necessary Perl modules.

Manual installation

Download the latest distribution.

Install all software that Combine depends on (see above).

Unpack the archive with tar zxf
This will create a directory named combine-XX with a number of subdirectories including bin, Combine, doc, and conf.

'bin' contains the executable programs.

'Combine' contains needed Perl modules. Should be copied to somewhere Perl will find them, typically /usr/share/perl5/Combine/.

'conf' contains the default configuration files. Combine looks for them in /etc/combine/ so they need to be copied there.

'doc' contains documentation.

The following command sequence will install Combine:

perl Makefile.PL
make
make test
make install
mkdir /etc/combine
cp conf/* /etc/combine/
mkdir /var/run/combine

Installation test

Harvest 1 URL by doing:
sudo combineINIT --jobname aatest --topic /etc/combine/Topic_carnivor.txt 
combine --jobname aatest --harvest http://combine.it.lth.se/CombineTests/InstallationTest.html
combineExport --jobname aatest --profile dc
and verify that the output, except for dates and order, looks like
<?xml version="1.0" encoding="UTF-8"?>
<documentCollection version="1.1" xmlns:dc="http://purl.org/dc/elements/1.1/">
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:format>text/html</dc:format>
<dc:format>text/html; charset=iso-8859-1</dc:format>
<dc:subject>Carnivorous plants</dc:subject>
<dc:subject>Drosera</dc:subject>
<dc:subject>Nepenthes</dc:subject>
<dc:title transl="yes">Installation test for Combine</dc:title>
<dc:description></dc:description>
<dc:date>2006-05-19 9:57:03</dc:date>
<dc:identifier>http://combine.it.lth.se/CombineTests/InstallationTest.html</dc:identifier>
<dc:language>en</dc:language>
</metadata>

Or run - as root - the script ./doc/InstallationTest.pl which essentially does the same thing.


Getting started

A simple example work-flow for a trivial crawl job name 'aatest' might look like:

  1. Initialize database and configuration (needs root privileges)
    sudo combineINIT --jobname aatest
  2. Load some seed URLs like (you can repeat this command with different URLs as many times as you wish)
    echo 'http://combine.it.lth.se/' | combineCtrl load --jobname aatest
  3. Start 2 harvesting processes
    combineCtrl start --jobname aatest --harvesters 2

  4. Let it run for some time. Status and progress can be checked using the program 'combineCtrl --jobname aatest' with various parameters.

  5. When satisfied kill the crawlers
    combineCtrl kill --jobname aatest
  6. Export data records in the ALVIS XML format
    combineExport --jobname aatest --profile alvis

  7. If you want to schedule a recheck for all the crawled pages stored in the database
    combineCtrl reharvest --jobname aatest
  8. go back to 3 for continuous operation.

Once a job is initialized it is controlled using combineCtrl. Crawled data is exported using combineExport.

Detailed documentation

The latest, updated, detailed documentation is always available online.

Use scenarios

General crawling without restrictions

Same procedure as in section 2.2. This way of crawling is not recommended for the Combine system since it will generate really huge databases without any focus.

Focused crawling - domain restrictions

Create a focused database with all pages from a Web-site. In this use scenario we will crawl the Combine site and the ALVIS site. The database is to be continuously updated, ie all pages have to be regularly tested for changes, deleted pages should be removed from the database, and newly created pages added.
  1. Initialize database and configuration
    sudo combineINIT --jobname focustest

  2. Edit the configuration to provide the desired focus
    Change the <allow> part in /etc/combine/focustest/combine.cfg from
    #use either URL or HOST: (obs ':') to match regular expressions to either the
    #full URL or the HOST part of a URL.
    <allow>
    #Allow crawl of URLs or hostnames that matches these regular expressions
    HOST: .*$
    </allow>
    
    to
    #use either URL or HOST: (obs ':') to match regular expressions to either the
    #full URL or the HOST part of a URL.
    <allow>
    #Allow crawl of URLs or hostnames that matches these regular expressions
    HOST: www\.alvis\.info$
    HOST: combine\.it\.lth\.se$
    </allow>
    
    The escaping of '.' by writing '\.' is necessary since the patterns actually are Perl regular expressions. Similarly the ending '$' indicates that the host string should end here, so for example a Web server on www.alvis.info.com (if such a one exists) will not be crawled.

  3. Load seed URLs
    echo 'http://combine.it.lth.se/' | combineCtrl load --jobname focustest
    echo 'http://www.alvis.info/' | combineCtrl load --jobname focustest

  4. Start 1 harvesting process
    combineCtrl start --jobname focustest

  5. Daily export all data records in the ALVIS XML format
    combineExport --jobname focustest --profile alvis
    and schedule all pages for re-harvesting
    combineCtrl reharvest --jobname focustest

Focused crawling - topic specific

Create and maintain a topic specific crawled database for the topic 'Carnivorous plants'.

  1. Create a topic definition (see section 4.5.1) in a local file named cpTopic.txt. (Can be done by copying /etc/combine/Topic_carnivor.txt since it happens to be just that.)

  2. Create a file named cpSeedURLs.txt with seed URLs for this topic, containing the URLs:
    http://www.sarracenia.com/faq.html
    http://dmoz.org/Home/Gardening/Plants/Carnivorous_Plants/
    http://www.omnisterra.com/bot/cp_home.cgi
    http://www.vcps.au.com/
    http://www.murevarn.se/links.html
    

  3. Initialization
    sudo combineINIT --jobname cptest --topic cpTopic.txt

    This enables topic checking and focused crawl mode by setting configuration variable doCheckRecord = 1 and copying a topic definition file (cpTopic.txt) to
    /etc/combine/cptest/topicdefinition.txt.

  4. Load seed URLs
    combineCtrl load --jobname cptest < cpSeedURLs.txt

  5. Start 3 harvesting process
    combineCtrl start --jobname cptest --harvesters 3

  6. Regularly export all data records in the ALVIS XML format
    combineExport --jobname cptest --profile alvis
    and schedule all pages for re-harvesting
    combineCtrl reharvest --jobname cptest

root 2006-11-08