The following simple script is available in the doc/InstallationTest.pl file. It must be run as 'root' and tests that basic functions of the Combine installation works.
Basicly it creates and initializes a new jobname, crawls one specific test page and exports it as XML. This XML is then compared to a correct XML-record for that page.
use strict; if ( $> != 0 ) { die("You have to run this test as root"); } my $orec=''; while (<DATA>) { chop; $orec .= $_; } $orec =~ s|<checkedDate>.*</checkedDate>||; $orec =~ tr/\n\t //d; my $olen=length($orec); my $onodes=0; while ( $orec =~ m/</g ) { $onodes++; } print "ORIG Nodes=$onodes; Len=$olen\n"; our $jobname; require './t/defs.pm'; system("combineINIT --jobname $jobname --topic /etc/combine/Topic_carnivor.txt >& /dev/null"); system("combine --jobname $jobname --harvest http://combine.it.lth.se/CombineTests/InstallationTest.html"); open(REC,"combineExport --jobname $jobname |"); my $rec=''; while (<REC>) { chop; $rec .= $_; } close(REC); $rec =~ s|<checkedDate>.*</checkedDate>||; $rec =~ tr/\n\t //d; my $len=length($rec); my $nodes=0; while ( $rec =~ m/</g ) { $nodes++; } print "NEW Nodes=$nodes; Len=$len\n"; my $OK=0; if ($onodes == $nodes) { print "Number of XML nodes match\n"; } else { print "Number of XML nodes does NOT match\n"; $OK=1; } if ($olen == $len) { print "Size of XML match\n"; } else { $orec =~ s|<originalDocument.*</originalDocument>||s; $rec =~ s|<originalDocument.*</originalDocument>||s; if (length($orec) == length($rec)) { print "Size of XML match (after removal of 'originalDocument')\n";} else { print "Size of XML does NOT match\n"; $OK=1; } } if (($OK == 0) && ($orec eq $rec)) { print "All tests OK\n"; } else { print "There might be some problem with your Combine Installation\n"; } __END__ <?xml version="1.0" encoding="UTF-8"?> <documentCollection version="1.1" xmlns="http://alvis.info/enriched/"> <documentRecord id="17E4E1138D15C3C19866D0C563E7F35F"> <acquisition> <acquisitionData> <modifiedDate>2006-05-19 9:57:03</modifiedDate> <checkedDate>2006-10-03 9:06:42</checkedDate> <httpServer>Apache/1.3.29 (Debian GNU/Linux) PHP/4.3.3</httpServer> <urls> <url>http://combine.it.lth.se/CombineTests/InstallationTest.html</url> </urls> </acquisitionData> <originalDocument mimeType="text/html" compression="gzip" encoding="base64" charSet="UTF-8"> H4sIAAAAAAAAA4WQsU7DMBCG9zzF4bmpBV2QcDKQVKJSKR2CEKObXBSrjm3sSyFvT0yCQGJgusG/ //u+E1flU1G9HrfwUD3u4fh8v98VwFLOXzYF52VVzg+b9Q3n2wPLE9FRr+NA2UyDFGnMdyaQ1FqS sgYIA0FrPRS2PymDgs+hRPRIEozsMWNnHN+tbwKD2hpCQxkrpDfqYr0dAjgtDYUVlN4G9HIFB3RT qMPAvns6Ipfi26Au09e5I61Gh78aCT+IR947qDvpA1I2UJvexg6+CJxsM0ad6/8kpkQiXB5XSWUC BNsj/GGG4LBWrarhSw+0OiOIidZjmzGPeh15WL6ICS7zFUjT/AiuBXeRbwHj870/AeRYaTupAQAA </originalDocument> <canonicalDocument> <section> <section title="Installation test for Combine"> <section>Installation test for Combine</section> <section>Contains some Carnivorous plant specific words like <ulink url="rel.html">Drosera </ulink>, and Nepenthes.</section></section></section></canonicalDocument> <metaData> <meta name="title">Installation test for Combine</meta> <meta name="dc:format">text/html</meta> <meta name="dc:format">text/html; charset=iso-8859-1</meta> <meta name="dc:subject">Carnivorous plants</meta> <meta name="dc:subject">Drosera</meta> <meta name="dc:subject">Nepenthes</meta> </metaData> <links> <outlinks> <link type="a"> <anchorText>Drosera</anchorText> <location>http://combine.it.lth.se/CombineTests/rel.html</location> </link> </outlinks> </links> <analysis> <property name="topLevelDomain">se</property> <property name="univ">1</property> <property name="language">en</property> <topic absoluteScore="1000" relativeScore="110526"> <class>ALL</class> </topic> <topic absoluteScore="375" relativeScore="41447"> <class>CP.Drosera</class> <terms>drosera</terms> </topic> <topic absoluteScore="375" relativeScore="41447"> <class>CP.Nepenthes</class> <terms>nepenthe</terms> </topic> <topic absoluteScore="250" relativeScore="27632"> <class>CP</class> <terms>carnivorous plant</terms> <terms>carnivor</terms> </topic> </analysis> </acquisition> </documentRecord> </documentCollection>
This example gives more details on how to write a topic filter Plug-In.
#Template for writing a classify PlugIn for Combine #See documentation at http://combine.it.lth.se/documentation/ package classifyPlugInTemplate; #Change to your own module name use Combine::XWI; #Mandatory use Combine::Config; #Optional if you want to use the Combine configuration system #API: # a subroutine named 'classify' taking a XWI-object as in parameter # return values: 0/1 # 0: record fails to meet the classification criteria, ie ignore this record # 1: record is OK and should be stored in the database, and links followed by the crawler sub classify { my ($self,$xwi) = @_; #utility routines to extract information from the XWI-object #URL (can be several): # $xwi->url_rewind; # my $url_str=""; # my $t; # while ($t = $xwi->url_get) { $url_str .= $t . ", "; } #Metadata: # $xwi->meta_rewind; # my ($name,$content); # while (1) { # ($name,$content) = $xwi->meta_get; # last unless $name; # next if ($name eq 'Rsummary'); # next if ($name =~ /^autoclass/); # $meta .= $content . " "; # } #Title: # $title = $xwi->title; #Headings: # $xwi->heading_rewind; # my $this; # while (1) { # $this = $xwi->heading_get or last; # $head .= $this . " "; # } #Text: # $this = $xwi->text; # if ($this) { # $text = $$this; # } ############################### #Apply your classification algorithm here # assign $result a value (0/1) ############################### #utility routines for saving detailed results (optional) in the database. These data may appear # in exported XML-records #Topic takes 4 parameters # $xwi->topic_add(topic_class_notation, topic_absolute_score, topic_normalized_score, topic_terms, algorithm_id); # topic_class_notation, topic_terms, and algorithm_id are strings # max length topic_class_notation: 50, algorithm_id: 25 # topic_absolute_score, and topic_normalized_score are integers # topic_normalized_score and topic_terms are optional and may be replaced with 0, '' respectively #Analysis takes 2 parameters # $xwi->robot_add(name,value); # both are strings with max length name: 15, value: 20 # return true (1) if you want to keep the record # otherwise return false (0) return $result; } 1;
#@#Default configuration values Combine system #Direct connection to Zebra indexing - for SearchEngine-in-a-box (default no connection) #@#ZebraHost = NoDefaultValue ZebraHost = #Use a proxy server if this is defined (default no proxy) #@#httpProxy = NoDefaultValue httpProxy = #Enable(1)/disable(0) automatic recycling of new links AutoRecycleLinks = 1 #User agent handles redirects (1) or treat redirects as new links (0) UserAgentFollowRedirects = 0 #Number of pages to process before restarting the harvester HarvesterMaxMissions = 500 #Logging level (0 (least) - 10 (most)) Loglev = 10 #Enable(1)/disable(0) analysis of genre, language doAnalyse = 1 #How long the summary should be. Use 0 to disable the summarization code SummaryLength = 0 #Store(1)/do not store(0) the raw HTML in the database saveHTML = 1 #Use(1)/do not use(0) Tidy to clean the HTML before parsing it useTidy = 1 #Use(1)/do not use(0) OAI record status keeping in SQL database doOAI = 1 #Extract(1)/do not extract(0) links from plain text extractLinksFromText = 1 #Enable(1)/disable(0) topic classification (focused crawling) #Generated by combineINIT based on --topic parameter doCheckRecord = 0 #Which topic classification PlugIn module algorithm to use #Combine::Check_record and Combine::PosCheck_record included by default #see classifyPlugInTemplate.pm and documentation to write your own classifyPlugIn = Combine::Check_record ###Parameters for Std topic classification algorithm ###StdTitleWeight = 10 # ###StdMetaWeight = 4 # ###StdHeadingsWeight = 2 # ###StdCutoffRel = 10 #Class score must be above this % to be counted ###StdCutoffNorm = 0.2 #normalised cutoff for summed normalised score ###StdCutoffTot = 90 #non normalised cutoff for summed total score ###Parameters for Pos topic classification algorithm ###PosCutoffRel = 1 #Class score must be above this % to be counted ###PosCutoffNorm = 0.002 #normalised cutoff for summed normalised score ###PosCutoffTot = 1 #non normalised cutoff for summed total score HarvestRetries = 5 SdqRetries = 5 #Maximum length of a URL; longer will be silently discarded maxUrlLength = 250 #Time in seconds to wait for a server to respond UAtimeout = 30 #If we have seen this page before use Get-If-Modified (1) or not (0) UserAgentGetIfModifiedSince = 1 WaitIntervalExpirationGuaranteed = 315360000 WaitIntervalHarvesterLockNotFound = 2592000 WaitIntervalHarvesterLockNotModified = 2592000 WaitIntervalHarvesterLockRobotRules = 2592000 WaitIntervalHarvesterLockUnavailable = 86400 WaitIntervalRrdLockDefault = 86400 WaitIntervalRrdLockNotFound = 345600 WaitIntervalRrdLockSuccess = 345600 #Time in seconds after succesfull download before allowing a page to be downloaded again (around 11 days) WaitIntervalHarvesterLockSuccess = 1000000 #Time in seconds to wait before making a new reschedule if a reschedule results in an empty ready que WaitIntervalSchedulerGetJcf = 20 #Minimum time between accesses to the same host. Must be positive WaitIntervalHost = 60 #Identifies MySQL database name, user and host MySQLdatabase = NoDefaultValue #Base directory for configuration files; initialized by Config.pm #@#baseConfigDir = /etc/combine #Directory for job specific configuration files; taken from 'jobname' #@#configDir = NoDefaultValue <binext> #Extensions of binary files ps jpg jpeg pdf tif tiff mpg mpeg mov wav au hqx gz z tgz exe zip sdd doc rtf shar mat raw wmz arff rar </binext> <converters> #Configure which converters can be used to produce a XWI object #Format: # 1 line per entry # each entry consists of 3 ';' separated fields # #Entries are processed in order and the first match is executed # external converters have to be found via PATH and executable to be considered a match # the external converter command should take a filename as parameter and convert that file # the result should be comming on STDOUT # # mime-type ; External converter command ; Internal converter text/html ; ; GuessHTML #Check this www/unknown ; ; GuessHTML text/plain ; ; GuessText text/x-tex ; tth -g -w1 -r < ; TeXHTML application/x-tex ; tth -g -w1 -r < ; TeXHTML text/x-tex ; untex -a -e -giso ; TeXText application/x-tex ; untex -a -e -giso ; TeXText text/x-tex ; ; TeX application/x-tex ; ; TeX application/pdf ; pdftohtml -i -noframes -nomerge -stdout ; HTML application/pdf ; pstotext ; Text application/postscript ; pstotext ; Text application/msword ; antiword -t ; Text application/vnd.ms-excel ; xlhtml -fw ; HTML application/vnd.ms-powerpoint ; ppthtml ; HTML application/rtf ; unrtf --nopict --html ; HTML image/gif ; ; Image image/jpeg ; ; Image image/tiff ; ; Image </converters> <url> <exclude> #Exclude URLs or hostnames that matches these regular expressions #Malformed hostnames HOST: http:\/\/\. HOST: \@ </exclude> </url>
#Please change Operator-Email = "YourEmailAdress@YourDomain" #Password not used yet. (Please change) Password = "XxXxyYzZ" <converters> #Configure which converters can be used to produce a XWI object #Format: # 1 line per entry # each entry consists of 3 ';' separated fields # #Entries are processed in order and the first match is executed # external converters have to be found via PATH and executable to be considered a match # the external converter command should take a filename as parameter and convert that file # the result should be comming on STDOUT # # mime-type ; External converter command ; Internal converter application/pdf ; MYpdftohtml -i -noframes -nomerge -stdout ; HTML </converters> <url> #List of servernames that are aliases are in the file ./config_serveralias # (automatically updated by other programs) #use one server per line #example #www.100topwetland.com www.100wetland.com # means that www.100wetland.com is replaced by www.100topwetland.com during URL normalization <serveralias> <<include config_serveralias>> </serveralias> #use either URL or HOST: (obs ':') to match regular expressions to # either the full URL or the HOST part of a URL. <allow> #Allow crawl of URLs or hostnames that matches these regular expressions HOST: .*$ </allow> <exclude> #Exclude URLs or hostnames that matches these regular expressions # default: CGI and maps URL cgi-bin|htbin|cgi|\?|\.map$|_vti_ # default: binary files URL \.exe$|\.zip$|\.tar$|\.tgz$|\.gz$|\.hqx$|\.sdd$|\.mat$|\.raw$ URL \.EXE$|\.ZIP$|\.TAR$|\.TGZ$|\.GZ$|\.HQX$|\.SDD$|\.MAT$|\.RAW$ # default: Unparsable documents URL \.shar$|\.rmx$|\.rmd$|\.mdb$ URL \.SHAR$|\.RMX$|\.RMD$|\.MDB$ # default: images URL \.gif$|\.jpg$|\.jpeg$|\.xpm$|\.tif$|\.tiff$|\.mpg$|\.mpeg$|\.mov$|\.wav$|\.au$|\.pcx$|\.xbm$|\.tga$ URL \.GIF$|\.JPG$|\.JPEG$|\.XPM$|\.TIF$|\.TIFF$|\.MPG$|\.MPEG$|\.MOV$|\.WAV$|\.AU$|\.PCX$|\.XBM$|\.TGA$ # default: other binary formats URL \.pdb$|\.class$|\.ica$|\.ram$|.wmz$|.arff$|.rar$|\.vo$|\.fig$ URL \.PDB$|\.CLASS$|\.ICA$|\.RAM$|.WMZ$|.ARFF$|.RAR$|\.VO$|\.FIG$ #more excludes in the file config_exclude (automatically updated by other programs) <<include config_exclude>> </exclude> <sessionids> #patterns to recognize and remove sessionids in URLs sessionid lsessionid jsessionid SID PHPSESSID SessionID BV_SessionID </sessionids> #url is just a conatiner for all URL related configuration patterns </url>
DROP DATABASE IF EXISTS $database;
CREATE DATABASE $database DEFAULT CHARACTER SET utf8;
USE $database;
All tables use UTF-8
Summary tables '^'=primary key, '*'=key:
TABLE hdb: recordid^, type, dates, server, title, ip, ...
TABLE links: recordid*, mynetlocid*, urlid*, netlocid*, linktype, anchor (netlocid for urlid!!)
TABLE meta: recordid*, name, value
TABLE html: recordid^, html
TABLE analys: recordid*, name, value
TABLE topic: recordid*, notation*, absscore, relscore, terms, algorithm
(TABLE netlocalias: netlocid*, netlocstr^)
(TABLE urlalias: urlid*, urlstr^)
TABLE topichierarchy: node^, father*, notation*, caption, level
TABLE netlocs: netlocid^, netlocstr^, retries
TABLE urls: netlocid*, urlid^, urlstr^, path
TABLE urldb: netlocid*, urlid^, urllock, harvest*, retries, netloclock
TABLE newlinks urlid^, netlocid
TABLE recordurl: recordid*, urlid^, lastchecked, md5*, fingerprint*^
TABLE admin: status, queid, schedulealgorithm
TABLE log: pid, id, date, message
TABLE que: queid^, urlid, netlocid
TABLE robotrules: netlocid*, rule, expire
TABLE oai: recordid, md5^, date*, status
TABLE exports: host, port, last
CREATE TABLE hdb ( recordid int(11) NOT NULL default '0', type varchar(50) default NULL, title text, mdate timestamp NOT NULL, expiredate datetime default NULL, length int(11) default NULL, server varchar(50) default NULL, etag varchar(25) default NULL, nheadings int(11) default NULL, nlinks int(11) default NULL, headings mediumtext, ip mediumblob, PRIMARY KEY (recordid) ) ENGINE=MyISAM AVG_ROW_LENGTH = 20000 MAX_ROWS = 10000000 DEFAULT CHARACTER SET=utf8;
CREATE TABLE html ( recordid int(11) NOT NULL default '0', html mediumblob, PRIMARY KEY (recordid) ) ENGINE=MyISAM AVG_ROW_LENGTH = 20000 MAX_ROWS = 10000000 DEFAULT CHARACTER SET=utf8;
CREATE TABLE links ( recordid int(11) NOT NULL default '0', mynetlocid int(11) default NULL, urlid int(11) default NULL, netlocid int(11) default NULL, anchor text, linktype varchar(50) default NULL, KEY recordid (recordid), KEY urlid (urlid), KEY mynetlocid (mynetlocid), KEY netlocid (netlocid) ) ENGINE=MyISAM MAX_ROWS = 1000000000 DEFAULT CHARACTER SET=utf8;
CREATE TABLE meta ( recordid int(11) NOT NULL default '0', name varchar(50) default NULL, value text, KEY recordid (recordid) ) ENGINE=MyISAM MAX_ROWS = 1000000000 DEFAULT CHARACTER SET=utf8;
CREATE TABLE analys ( recordid int(11) NOT NULL default '0', name varchar(15) NOT NULL, value varchar(20), KEY recordid (recordid) ) ENGINE=MyISAM DEFAULT CHARACTER SET=utf8;
CREATE TABLE topic ( recordid int(11) NOT NULL default '0', notation varchar(50) default NULL, abscore int(11) default NULL, relscore int(11) default NULL, terms text default NULL, algorithm varchar(25), KEY notation (notation), KEY recordid (recordid) ) ENGINE=MyISAM DEFAULT CHARACTER SET=utf8;
CREATE TABLE netlocalias ( netlocid int(11), netlocstr varchar(150) NOT NULL, KEY netlocid (netlocid), PRIMARY KEY netlocstr (netlocstr) ) ENGINE=MyISAM DEFAULT CHARACTER SET=utf8;
CREATE TABLE urlalias ( urlid int(11), urlstr tinytext, KEY urlid (urlid), PRIMARY KEY urlstr (urlstr(255)) ) ENGINE=MyISAM DEFAULT CHARACTER SET=utf8;
topichierarchy have to initialized manually
CREATE TABLE topichierarchy ( node int(11) NOT NULL DEFAULT '0', father int(11) DEFAULT NULL, notation varchar(50) NOT NULL DEFAULT '', caption varchar(255) DEFAULT NULL, level int(11) DEFAULT NULL, PRIMARY KEY node (node), KEY father (father), KEY notation (notation) ) ENGINE=MyISAM DEFAULT CHARACTER SET=utf8;
CREATE TABLE netlocs ( netlocid int(11) NOT NULL auto_increment, netlocstr varchar(150) NOT NULL, retries int(11) NOT NULL DEFAULT 0, PRIMARY KEY (netlocstr), UNIQUE INDEX netlockid (netlocid) ) ENGINE=MyISAM DEFAULT CHARACTER SET=utf8;
CREATE TABLE urls ( netlocid int(11) NOT NULL DEFAULT '0', urlid int(11) NOT NULL auto_increment, urlstr tinytext, path tinytext, PRIMARY KEY urlstr (urlstr(255)), INDEX netlocid (netlocid), UNIQUE INDEX urlid (urlid) ) ENGINE=MyISAM MAX_ROWS = 1000000000 DEFAULT CHARACTER SET=utf8;
CREATE TABLE urldb ( netlocid int(11) NOT NULL default '0', netloclock int(11) NOT NULL default '0', urlid int(11) NOT NULL default '0', urllock int(11) NOT NULL default '0', harvest tinyint(1) NOT NULL default '0', retries int(11) NOT NULL default '0', PRIMARY KEY (urlid), KEY netlocid (netlocid), KEY harvest (harvest) ) ENGINE=MyISAM DEFAULT CHARACTER SET=utf8;
CREATE TABLE newlinks ( urlid int(11) NOT NULL, netlocid int(11) NOT NULL, PRIMARY KEY (urlid) ) ENGINE=MyISAM DEFAULT CHARACTER SET=utf8;
CREATE TABLE recordurl ( recordid int(11) NOT NULL auto_increment, urlid int(11) NOT NULL default '0', lastchecked timestamp NOT NULL, md5 char(32), fingerprint char(50), KEY md5 (md5), KEY fingerprint (fingerprint), PRIMARY KEY (urlid), KEY recordid (recordid) ) ENGINE=MyISAM DEFAULT CHARACTER SET=utf8;
CREATE TABLE admin ( status enum('closed','open','paused','stopped') default NULL, schedulealgorithm enum('default','bigdefault','advanced') default 'default', queid int(11) NOT NULL default '0' ) ENGINE=MEMORY DEFAULT CHARACTER SET=utf8;
Initialise admin to 'open' status
INSERT INTO admin VALUES ('open','default',0)
CREATE TABLE log ( pid int(11) NOT NULL default '0', id varchar(50) default NULL, date timestamp NOT NULL, message varchar(255) default NULL ) ENGINE=MyISAM DEFAULT CHARACTER SET=utf8;
CREATE TABLE que ( netlocid int(11) NOT NULL default '0', urlid int(11) NOT NULL default '0', queid int(11) NOT NULL auto_increment, PRIMARY KEY (queid) ) ENGINE=MEMORY DEFAULT CHARACTER SET=utf8;
CREATE TABLE robotrules ( netlocid int(11) NOT NULL default '0', expire int(11) NOT NULL default '0', rule varchar(255) default '', KEY netlocid (netlocid) ) ENGINE=MyISAM DEFAULT CHARACTER SET=utf8;
CREATE TABLE oai ( recordid int(11) NOT NULL default '0', md5 char(32), date timestamp, status enum('created', 'updated', 'deleted'), PRIMARY KEY (md5), KEY date (date) ) ENGINE=MyISAM DEFAULT CHARACTER SET=utf8;
CREATE TABLE exports ( host varchar(30), port int, last timestamp DEFAULT '1999-12-31' ) ENGINE=MyISAM DEFAULT CHARACTER SET=utf8;
GRANT SELECT,INSERT,UPDATE,DELETE,CREATE,CREATE TEMPORARY TABLES, ALTER,LOCK TABLES ON $database.* TO $dbuser;
GRANT SELECT,INSERT,UPDATE,DELETE,CREATE,CREATE TEMPORARY TABLES, ALTER,LOCK TABLES ON $database.* TO $dbuser\@localhost;
combineCtrl - controls a Combine crawling job
combineCtrl action
-jobname
name
where action can be one of start, kill, load, recyclelinks, reharvest, stat, howmany, records, hosts, initMemoryTables, open, stop, pause, continue
jobname is used to find the appropriate configuration (mandatory)
takes an optional switch -harvesters n where n is the number of crawler processes to start
kills all active crawlers (and their associated combineRun monitors) for jobnam
Read a list of URLs from STDIN (one per line) and schedules them for crawling
Schedule all newly found (since last invocation of recyclelinks) links in crawled pages for crawling
Schedules all pages in the database for crawling again (in order to check if they have changed)
opens database for URL scheduling (maybe after a stop)
stops URL scheduling
pauses URL scheduling
continues URL scheduling after a pause
prints out rudimentary status of the ready que (ie eligible now) of URLs to be crawled
prints out rudimentary status of all URLs to be crawled
prints out the number of ercords in the SQL database
prints out rudimentary status of all hosts that have URLs to be crawled
initializes the administrative MySQL tables that are kept in memory
Implements various control functionality to administer a crawling job, like starting and stoping crawlers, injecting URLs into the crawl que, scheduling newly found links for crawling, controlling scheduling, etc.
This is the preferred way of controling a crawl job.
Seed the crawling job aatest with a URL
Start 3 crawling processes for job aatest
Schedule all new links crawling
See how many URLs that are eligible for crawling right now.
combine
Combine configuration documentation in /usr/share/doc/combine/.
Anders Ardö, anders.ardo@it.lth.se
Copyright (C) 2005 Anders Ardö
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.
See the file LICENCE included in the distribution at http://combine.it.lth.se/
combine - main crawling machine in the Combine system
combine -jobname name
-logname
id
jobname is used to find the appropriate configuration (mandatory)
logname is used as identifier in the log (in MySQL table log)
Does crawling, parsing, optional topic-check and stores in MySQL database Normally started with the combineCtrl command. Briefly it get's an URL from the MySQL database, which acts as a common coordinator for a Combine job. The Web-page is fetched, provided it passes the robot exclusion protocoll. The HTML ic cleaned using Tidy and parsed into metadata, headings, text, links and link achors. Then it is stored (optionaly provided a topic-check is passed to keep the crawler focused) in the MySQL database in a structured form.
A simple workflow for a trivial crawl job might look like:
Initialize database and configuration combineINIT --jobname aatest Enter some seed URLs from a file with a list of URLs combineCtrl load --jobname aatest < seedURLs.txt Start 2 crawl processes combineCtrl start --jobname aatest --harvesters 2
For some time occasionally schedule new links for crawling combineCtrl recyclelinks --jobname aatest or look at the size of the ready que combineCtrl stat --jobname aatest
When satisfied kill the crawlers combineCtrl kill --jobname aatest Export data records in a highly structured XML format combineExport --jobname aatest
For more complex jobs you have to edit the job configuration file.
combineINIT, combineCtrl
Combine configuration documentation in /usr/share/doc/combine/.
Anders Ardö, anders.ardo@it.lth.se
Copyright (C) 2005 Anders Ardö
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.
See the file LICENCE included in the distribution at http://combine.it.lth.se/
combineExport - export records in XML from Combine database
combineExport -jobname name
[-profile alvis
dc
combine -charset utf8
isolatin -number
n
-recordid
n
-md5
MD5
-pipehost
server
-pipeport
n
-incremental ]
jobname is used to find the appropriate configuration (mandatory)
Three profiles: alvis, dc, and combine . alvis and combine are similar XML formats.
'alvis' profile format is defined by the Alvis enriched document format DTD. It uses charset UTF-8 per default.
'combine' is more compact with less redundancy.
'dc' is XML encoded Dublin Core data.
Selects a specific characterset from UTF-8, iso-latin-1 Overrides -profile settings.
Specifies the server-name and port to connect to and export data using the Alvis Pipeline. Exports incrementally, ie all changes since last call to combineExport with the same pipehost and pipeport.
the max number of records to be exported
Export just the one record with this recordid
Export just the one record with this MD5 checksum
Exports incrementally, ie all changes since last call to combineExport using -incremental
Generates records in Combine native format and converts them using this XSLT script before output. See example scripts in /etc/combine/*.xsl
Export all records in Alvis XML-format to the file recs.xml combineExport --jobname atest > recs.xml
Export 10 records to STDOUT combineExport --jobname atest --number 10
Export all records in UTF-8 using Combine native format combineExport --jobname atest --profile combine --charset utf8 > Zebrarecs.xml
Incremental export of all changes from last call using localhost at port 6234 using the default profile (Alvis) combineExport --jobname atest --pipehost localhost --pipeport 6234
Combine configuration documentation in /usr/share/doc/combine/.
Alvis XML schema (-profile alvis) at http://project.alvis.info/alvis_docs/enriched-document.xsd
Anders Ardö, anders.ardo@it.lth.se
Copyright (C) 2005 - 2006 Anders Ardö
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.
See the file LICENCE included in the distribution at L<http://combine.it.lth.se/>
combineRun - starts, monitors and restarts a combine harvesting process
combineRun pidfile
combine command to run
Starts a program and monitors it in order to make sure there is alsways a copy running. If the program dies it will be restarted with the same parameters. Used by combineCtrl when starting combine crawling.
combineCtrl
Anders Ardö, anders.ardo@it.lth.se
Copyright (C) 2005 Anders Ardö
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.
See the file LICENCE included in the distribution at http://combine.it.lth.se/
combineUtil - various operations on the Combine database
combineUtil action
-jobname
name
where action can be one of stats, termstat, classtat, sanity, all, serveralias, resetOAI, restoreSanity, deleteNetLoc, deletePath, deleteMD5, deleteRecordid, addAlias
jobname is used to find the appropriate configuration (mandatory)
Global statistics about the database
generates statistics about the terms from topic ontology matched in documents (can be long output)
generates statistics about the topic classes assigned to documents
Performs various sanity checks on the database
Deletes records which sanity checks finds insane
Removes all history (ie 'deleted' records) from the OAI table. This is done by removing the OAI table and recreating it from the existing database.
Does the statistics generation actions: stats, sanity, classtat, termstat
Deletes all records matching the ','-separated list of server net-locations (server-names optionally with port) in the switch -netlocstr. Net-locations can include SQL wild cards ('%').
Deletes all records matching the ','-separated list of URl paths (excluding net-locations) in the switch -pathsubstrs. Paths can include SQL wild cards ('%').
Delete the record which has the MD5 in switch -md5
Delete the record which has the recordid in switch -recordid
Detect server aliases in the current database and do a 'addAlias' on each detected alias.
Manually add a serveralias to the system. Requires switches -aliases and -preferred
Does various statistics generation as well as performing sanity checks on the database
Generate matched term statistics
combine
Combine configuration documentation in /usr/share/doc/combine/.
Anders Ardö, anders.ardo@it.lth.se
Copyright (C) 2005 Anders Ardö
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.
See the file LICENCE included in the distribution at http://combine.it.lth.se/
Combine::FromHTML.pm - HTML parser in combine package
Yong Cao tsao@munin.ub2.lu.se
v0.06 1997-03-19
Anders Ardø 1998-07-18
added
AREA ... HREF=link ...
fixed
A ... HREF=link ...
regexp to be more general
Anders Ardö 2002-09-20
added 'a' as a tag not to be replaced with space
added removal of Cntrl-chars and some punctuation marks from IP
added
style
...
/style
as something to be removed before processing
beefed up compression of sequences of blanks to include
240 (non-breakable space)
changed 'remove head' before text extraction to handle multiline matching (which can be
introduced by decoding html entities)
added compress blanks and remove CRs to metadata-content
Anders Ardö 2004-04
Changed extraction process dramatically
Combine::FromTeX.pm - TeX parser in combine package
Anders Ardø 2000-06-11
HTMLExtractor
Adopted from HTML::LinkExtractor - Extract links from an HTML document by D.H (PodMaster)
D.H (PodMaster)
Copyright (c) 2003 by D.H. (PodMaster). All rights reserved.
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. The LICENSE file contains the full text of the license.
LoadTermList
This a module in the DESIRE automatic classification system. Copyright 1999.
LoadTermList - A class for loading and storing a stoplist with single words a termlist with classifications and weights
Subroutines: LoadStopWordList(StopWordListFileName) loads a list of stopwords, one per line, from the file StopWordListFileName.
EraseStopWordList clears the stopword list
Subroutines: LoadTermList(TermListFileName) - loads TermClass from file LoadTermListStemmed(TermListFileName) - same plus stems terms
Input: A formatted term-list including weights and classifications Format: <weight>: <term_reg_exp>=[<classification>, ]+ weight can be a positive or negative number term_reg_exp can be words, phrases, boolean expressions (with @and as operator) on term_reg_exp or Perl regular expressions
Anders Ardö Anders.Ardo@it.lth.se
Copyright (C) 2005,2006 Anders Ardö
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.
See the file LICENCE included in the distribution at http://combine.it.lth.se/
Matcher
This a module in the DESIRE automatic classification system. Copyright 1999. Modified in the ALVIS project. Copyright 2004
Exported routines: 1. Fetching text: These routines all extract texts from a document (either a Combine XWI datastructure or a WWW-page identified by a URL. They all return: $meta, $head, $text, $url, $title, $size $meta: Metadata from document $head: Important text from document $text: Plain text from document $url: URL of the document $title: HTML title of the document $size: The size of the document
Common input parameters: $DoStem: 1=do stemming; 0=no stemming $stoplist: object pointer to a LoadTermList object with a stoplist loaded $simple: 1=do simple loading; 0=advanced loading (might induce errors)
getTextXWI parameters: $xwi, $DoStem, $stoplist, $simple $xwi is a Combine XWI datastructure
getTextURL parameters: $url, $DoStem, $stoplist, $simple $url is the URL for the page to extract text from
2. Term matcher accepts a text as a (reference) parameter, matches each term in Term against text Matches are recorded in an associative array with class as key and summed weight as value. Match parameters: $text, $termlist $text: text to match against the termlist $termlist: object pointer to a LoadTermList object with a termlist loaded output: %score: an associative array with classifications as keys and scores as values
Anders Ardö anders.ardo@it.lth.se
Copyright (C) 2005,2006 Anders Ardö
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.
See the file LICENCE included in the distribution at http://combine.it.lth.se/
PosMatcher
This a module in the DESIRE automatic classification system. Copyright 1999.
Exported routines: 1. Fetching text: These routines all extract texts from a document (either a Combine record, a Combine XWI datastructure or a WWW-page identified by a URL. They all return: $meta, $head, $text, $url, $title, $size $meta: Metadata from document $head: Important text from document $text: Plain text from document $url: URL of the document $title: HTML title of the document $size: The size of the document
Common input parameters: $DoStem: 1=do stemming; 0=no stemming $stoplist: object pointer to a LoadTermList object with a stoplist loaded $simple: 1=do simple loading; 0=advanced loading (might induce errors)
getTextMD5 parameters: $md5, $hdb_top, $DoStem, $stoplist, $simple $md5 is a key into a Combine hdb-directory $hdb_top is the path to the top of the Combine hdb-directory
getTextXWI parameters: $xwi, $DoStem, $stoplist, $simple $xwi is a Combine XWI datastructure
getTextURL parameters: $url, $DoStem, $stoplist, $simple $url is the URL for the page to extract text from
2. Term matcher accepts a text as a (reference) parameter, matches each term in Term against text Matches are recorded in an associative array with class as key and summed weight as value. Match parameters: $text, $termlist $text: text to match against the termlist $termlist: object pointer to a LoadTermList object with a termlist loaded output: %score: an associative array with classifications as keys and scores as values
3. Heuristics: sum scores down the classification tree to the leafs cleanEiTree parameters: %res - an associative array from Match output: %res - same array
Anders Ardö, anders.ardo@it.lth.se
Copyright (C) 2005,2006 Anders Ardö
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.
See the file LICENCE included in the distribution at http://combine.it.lth.se/
RobotRules.pm
Anders Ardo version 1.0 2004-02-19
SD_SQL
Reimplementation of sd.pl SD.pm and SDQ.pm using MySQL contains both recyc and guard
Basic idea is to have a table (urldb) that contains most URLs ever
inserted into the system together with a lock (the guard function) and
a boolean harvest-flag. Also in this table is the host part together with
its lock. URLs are selected from this table based on urllock, netloclock and
harvest and inserted into a queue (table que). URLs from this queue
are then given out to harvesters. The queue is implemented as:
# The admin table can be used to generate sequence numbers like this:
#mysql update admin set queid=LAST_INSERT_ID(queid+1);
# and used to extract the next URL from the queue
#mysql
select host,url from que where queid=LAST_INSERT_ID();
#
When the queue is empty it is filled from table urldb. Several different
algorithms can be used to fill it (round-robin, most urls, longest time
since harvest, ...). Since the harvest-flag and guard-lock are not updated
until the actual harvest is done it is OK to delete the queue and
regenerate it anytime.
########################## #Questions, ideas, TODOs, etc #Split table urldb into 2 tables - one for urls and one for hosts??? #Less efficient when filling que; more efficient when updating netloclock #Datastruktur TABLE hosts: create table hosts( host varchar(50) not null default '', netloclock int not null, retries int not null default 0, ant int not null default 0, primary key (host), key (ant), key (netloclock) );
############# Handle to many retries?
algorithm takes an url from the host that was accessed longest ago ($hostid,$url)=SELECT host,url,id FROM hosts,urls WHERE hosts.hostlock < UNIX_TIMESTAMP() hosts.host=urls.host AND urls.urllock < UNIX_TIMESTAMP() AND urls.harvest=1 ORDER BY hostlock LIMIT 1;
algorithm takes an url from the host with most URLs ($hostid,$url)=SELECT host,url,id FROM hosts,urls WHERE hosts.hostlock < UNIX_TIMESTAMP() hosts.host=urls.host AND urls.urllock < UNIX_TIMESTAMP() AND urls.harvest=1 ORDER BY host.ant DESC LIMIT 1;
algorithm takes an url from any available host ($hostid,$url)=SELECT host,url,id FROM hosts,urls WHERE hosts.hostlock < UNIX_TIMESTAMP() hosts.host=urls.host AND urls.urllock < UNIX_TIMESTAMP() AND urls.harvest=1 LIMIT 1;
Anders Ardö anders.ardo@it.lth.se
Copyright (C) 2005,2006 Anders Ardö
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.
See the file LICENCE included in the distribution at http://combine.it.lth.se/
XWI.pm - class for interfacing to various web-index format translators
2002-09-30 AA0 added robot section in analogue with meta
Yong Cao tsao@munin.ub2.lu.se
v0.05 1997-03-13
Anders Ardö, anders.ardo@it.lth.se
Copyright (C) 2005,2006 Anders Ardö
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.
See the file LICENCE included in the distribution at http://combine.it.lth.se/
selurl - Normalise and validate URIs for harvesting
Selurl selects and normalises URIs on basis of both general practice (hostname lowercasing, portnumber substsitution etc.) and Combine-specific handling (aplpying config_allow, config_exclude, config_serveralias and other relevant config settings).
The Config settings catered for currently are:
maxUrlLength - the maximum length of an unnormalised URL allow - Perl regular to identify allowed URLs exclude - Perl regular expressions to exclude URLs from harvesting serveralias - Aliases of server names sessionids - List sessionid markers to be removed
A selurl object can hold a single URL and has methods to obtain its subparts as defined in URI.pm, plus some methods to normalise and validate it in Combine context.
Currently, the only schemes supported are http, https and ftp. Others may or may not work correctly. For one thing, we assume the scheme has an internet hostname/port.
clone() will only return a copy of the real URI object, not a new selurl.
URI URI-escapes the strings fed into it by new() once. Existing percent signs in the input are left untouched, which implicates that:
(a) there is no risk of double-encoding; and
(b) if the original contained an inadvertent sequence that could be interpreted as an escape sequence, uri_unescape will not render the original input (e.g. url_with_%66_in_it goes whoop) If you know that the original has not yet been escaped and wish to safeguard potential percent signs, you'll have to escape them (and only them) once before you offer it to new().
A problem with URI is, that its object is not a hash we can piggyback our data on, so I had to resort to AUTOLOAD to emulate inheritance. I find this ugly, but well, this *is* Perl, so what'd you expect?
root 2006-11-08