[
next
] [
tail
] [
up
]
Contents
I
Overview
1
Introduction
2
Open source distribution, installation
2.1
Installation
2.1.1
Installation from source for the impatient
2.1.2
Porting to not supported operating systems - dependencies
2.1.3
Automated Debian/Ubuntu installation
2.1.4
Manual installation
2.1.5
Out-of-the-box installation test
2.2
Getting started
2.3
Online documentation
2.4
Use scenarios
2.4.1
General crawling without restrictions
2.4.2
Focused crawling – domain restrictions
2.4.3
Focused crawling – topic specific
2.4.4
Focused crawling in an Alvis system
2.4.5
Crawl one entire site and it’s outlinks
3
Configuration
3.1
Configuration files
3.1.1
Templates
3.1.2
Global configuration files
3.1.3
Job specific configuration files
3.1.4
Details and default values
4
Crawler internal operation
4.1
URL selection criteria
4.2
Document parsing and information extraction
4.3
URL filtering
4.4
Crawling strategy
4.5
Built-in topic filter – automated subject classification using string matching
4.5.1
Topic definition
4.5.2
Topic definition (term triplets) BNF grammar
4.5.3
Term triplet examples
4.5.4
Algorithm 1: plain matching
4.5.5
Algorithm 2: position weighted matching
4.6
Built-in topic filter – automated subject classification using SVM
4.7
Topic filter Plug-In API
4.8
Analysis
4.9
Duplicate detection
4.10
URL recycling
4.11
Database cleaning
4.12
Complete application – SearchEngine in a Box
5
Evaluation of automated subject classification
5.1
Approaches to automated classification
5.1.1
Description of the used string-matching algorithm
5.2
Evaluation methodology
5.2.1
Evaluation challenge
5.2.2
Evaluation measures used
5.2.3
Data collection
5.3
Results
5.3.1
The role of different thesauri terms
5.3.2
Enriching the term list using natural language processing
5.3.3
Importance of HTML structural elements and metadata
5.3.4
Challenges and recommendations for classification of Web pages
5.3.5
Comparing and combining two approaches
6
Performance and scalability
6.1
Speed
6.2
Space
6.3
Crawling strategy
7
System components
7.1
combineINIT
7.2
combineCtrl
7.3
combineUtil
7.4
combineExport
7.5
Internal executables and Library modules
7.5.1
Library
II
Gory details
8
Frequently asked questions
9
Configuration variables
9.1
Name/value configuration variables
9.1.1
analysePlugin
9.1.2
AutoRecycleLinks
9.1.3
baseConfigDir
9.1.4
classifyPlugIn
9.1.5
configDir
9.1.6
doAnalyse
9.1.7
doCheckRecord
9.1.8
doOAI
9.1.9
extractLinksFromText
9.1.10
HarvesterMaxMissions
9.1.11
HarvestRetries
9.1.12
httpProxy
9.1.13
LogHandle
9.1.14
Loglev
9.1.15
maxUrlLength
9.1.16
MySQLdatabase
9.1.17
MySQLfulltext
9.1.18
MySQLhandle
9.1.19
Operator-Email
9.1.20
Password
9.1.21
relTextPlugin
9.1.22
saveHTML
9.1.23
SdqRetries
9.1.24
SolrHost
9.1.25
SummaryLength
9.1.26
SVMmodel
9.1.27
UAtimeout
9.1.28
UserAgentFollowRedirects
9.1.29
UserAgentGetIfModifiedSince
9.1.30
useTidy
9.1.31
WaitIntervalExpirationGuaranteed
9.1.32
WaitIntervalHarvesterLockNotFound
9.1.33
WaitIntervalHarvesterLockNotModified
9.1.34
WaitIntervalHarvesterLockRobotRules
9.1.35
WaitIntervalHarvesterLockSuccess
9.1.36
WaitIntervalHarvesterLockUnavailable
9.1.37
WaitIntervalHost
9.1.38
WaitIntervalRrdLockDefault
9.1.39
WaitIntervalRrdLockNotFound
9.1.40
WaitIntervalRrdLockSuccess
9.1.41
WaitIntervalSchedulerGetJcf
9.1.42
ZebraHost
9.2
Complex configuration variables
9.2.1
allow
9.2.2
binext
9.2.3
converters
9.2.4
exclude
9.2.5
serveralias
9.2.6
sessionids
9.2.7
url
10
Module dependences
10.1
Programs
10.1.1
Check_record.pm.svn-base
10.1.2
CleanXML2CanDoc.pm.svn-base
10.1.3
Config.pm.svn-base
10.1.4
DataBase.pm.svn-base
10.1.5
FromHTML.pm.svn-base
10.1.6
FromImage.pm.svn-base
10.1.7
HTMLExtractor.pm.svn-base
10.1.8
LoadTermList.pm.svn-base
10.1.9
LogSQL.pm.svn-base
10.1.10
Matcher.pm.svn-base
10.1.11
MySQLhdb.pm.svn-base
10.1.12
PosCheck_record.pm.svn-base
10.1.13
PosMatcher.pm.svn-base
10.1.14
RobotRules.pm.svn-base
10.1.15
SD_SQL.pm.svn-base
10.1.16
Solr.pm.svn-base
10.1.17
UA.pm.svn-base
10.1.18
XWI.pm.svn-base
10.1.19
XWI2XML.pm.svn-base
10.1.20
Zebra.pm.svn-base
10.1.21
classifySVM.pm.svn-base
10.1.22
combine
10.1.23
combine.svn-base
10.1.24
combineCtrl
10.1.25
combineCtrl.svn-base
10.1.26
combineExport
10.1.27
combineExport.svn-base
10.1.28
combineINIT
10.1.29
combineINIT.svn-base
10.1.30
combineRank
10.1.31
combineRank.svn-base
10.1.32
combineReClassify
10.1.33
combineReClassify.svn-base
10.1.34
combineSVM
10.1.35
combineSVM.svn-base
10.1.36
combineUtil
10.1.37
combineUtil.svn-base
10.1.38
selurl.pm.svn-base
10.1.39
utilPlugIn.pm.svn-base
10.2
Library modules
10.2.1
Check_record.pm
10.2.2
CleanXML2CanDoc.pm
10.2.3
Config.pm
10.2.4
DataBase.pm
10.2.5
FromHTML.pm
10.2.6
FromImage.pm
10.2.7
HTMLExtractor.pm
10.2.8
LoadTermList.pm
10.2.9
LogSQL.pm
10.2.10
Matcher.pm
10.2.11
MySQLhdb.pm
10.2.12
PosCheck_record.pm
10.2.13
PosMatcher.pm
10.2.14
RobotRules.pm
10.2.15
SD_SQL.pm
10.2.16
Solr.pm
10.2.17
UA.pm
10.2.18
XWI.pm
10.2.19
XWI2XML.pm
10.2.20
Zebra.pm
10.2.21
classifySVM.pm
10.2.22
selurl.pm
10.2.23
utilPlugIn.pm
10.3
External modules
III
A
APPENDIX
A.1
Simple installation test
A.1.1
InstallationTest.pl
A.2
Example topic filter plug in
A.2.1
classifyPlugInTemplate.pm
A.3
Default configuration files
A.3.1
Global
A.3.2
Job specific
A.4
SQL database
A.4.1
Create database
A.4.2
Creating MySQL tables
A.4.3
Data tables
A.4.4
Administrative tables
A.4.5
Create user dbuser with required priviligies
A.5
Manual pages
A.5.1
combineExport
A.5.2
combineCtrl
A.5.3
combineRun
A.5.4
combineReClassify
A.5.5
combineSVM
A.5.6
combineRank
A.5.7
combineUtil
A.5.8
combine
A.5.9
Combine::PosMatcher
A.5.10
Combine::selurl
A.5.11
Combine::XWI
A.5.12
Combine::Matcher
A.5.13
Combine::FromTeX
A.5.14
Combine::SD_SQL
A.5.15
Combine::utilPlugIn
A.5.16
Combine::FromHTML
A.5.17
Combine::RobotRules
A.5.18
Combine::HTMLExtractor
A.5.19
Combine::LoadTermList
A.5.20
Combine::classifySVM
[
next
] [
front
] [
up
]