Subsections


Configuration

Configuration files use a simple format consisting of either name/value pairs or complex variables in sections. Name/value pairs are encoded as single lines formated like 'name = value'. Complex variables are encoded as multiple lines in named sections delimited as in XML, using '<name> ... </name>'. Sections may be nested for related configuration variables. Empty lines and lines starting with '#' (comments) are ignored.

The most important configuration variables are the complex variables <url><allow> (allows certain URLs to be harvested) and <url><exclude> (excludes certain URLs from harvesting) which are used to limit your crawl to just a section of the WWW, based on the URL. Loading URLs to be crawled into the system checks each URL first against the Perl regular expressions of <url><allow> and if it matches goes on to match it against <url><exclude> where it's discarded if it matches, otherwise it's scheduled for crawling. (See 'URL filtering').

Configuration files

All configuration files are stored in the /etc/combine/ directory tree. All configuration variables have reasonable defaults (section 9).

Templates

The values in
job_default.cfg
contains job specific defaults. It is copied to a subdirectory named after the job by combineINIT.

SQLstruct.sql
contains structure of the internal SQL database used both for administration and for holding data records. Details.

Topic_*
contains various contributed topic definitions.

Global configuration files

Files used for global parameters for all crawler jobs.
default.cfg
is the global defaults. It is loaded first. Consult 'Configuration Variables' and 'Default configuration files' for details. Values can be overridden from the job-specific configuration file combine.cfg.

tidy.cfg
configuration for Tidy cleaning of HTML code.

Job specific configuration files

The program combineINIT creates a job specific sub-directory in /etc/combine and populates it with some files including combine.cfg initialized with a copy of job_default.cfg. You should always change the value of the variable Operator-Email in this file and set it to something reasonable. It is used by Combine to identify you to the crawled Web-servers.

The job-name have to be given to all programs when started using the switch.

combine.cfg
the job specific configuration. It is loaded second and overrides the global defaults. Consult section 'Configuration Variables' and 'Default configuration files' for details.

topicdefinition.txt
contains the topic definition for focused crawl if the switch is given to combineINIT. The format of this file is described in 'Topic definition'.

stopwords.txt
a file with words to be excluded from the automatic topic classification processing. One word per line. Can be empty (default) but must be present.

config_exclude
contains more exclude patterns. Optional, automatically included by combine.cfg. Updated by combineUtil.

config_serveralias
contains patterns for resolving Web server aliases. Optional, automatically included by combine.cfg. Updated by combineUtil.
sitesOK.txt
optionally used by the built-in automated classification algorithms to bypass the topic filter for certain sites.

Details and default values

Further details are found in 'Configuration variables' which lists all variables and their default values.
root 2008-10-02