Creating new profile
Profiles defines how and which web pages and servers are to be indexed
by the crawler. To create a new profile select New profile
wizard and follow the online instructions.
Below, the basic configuration variables for a profile is
described, while the more advanced variables are described later, on
the Advanced profile configuration page.
- Profile id
-
A unique identification for the profile.
It should be a short identifying text, and must not contain any
spaces. For example, the id could be: my_profile.
- Profile name
-
The contents of the profile name will be
seen on the selection tab on the search page shown to the outside
world. For example, this could be My test search.
- Activated
-
Should be left at yes for now.
- Storage directory
-
Is a search path in your file
system. Ends with a "/". Intraseek has automatically created a special
directory for storage of the databases, but you can change this to any
path in the file-system.
- Working directory
-
Is a search path in your file
system. Ends with a "/". This is where data from the crawlers' data
gatherings will be stored. Due to nature of the workings of the data
base, it is advantageous for this to be situated on a fast disk, This
will increase the speed of the process by several hundred per
cent.
- Startpages
-
Where you specify a set of pages for the
crawler to start at. It is usually sufficient to state the URL of the
main page of the site you are about to index, since an IntraSeek
crawler will follow all links it finds. Separate the various URLs by
putting them on separate lines. For example:
http://my.server.com/~sysadm/
- Accept pattern
-
Specifies which pages are to be
accepted by the crawler. There are some very important things to
consider here:
- Always limit the crawler to stay within your site. If you don't,
it will, without any warning, crawl out on the worldwide web.
- Since the accept and avoid patterns really are regexps, they
should read ^http://www.foo.com/* instead of
www.foo.com/* if you want to make sure not to index
http://gazonk.www.foo.com/.
- Separate the various accept patterns by putting them on
separate lines. For example, this could be
my.server.com/~webmaster/*.
- Avoid pattern
-
Specifies what sort of pages the
crawler will avoid. Already specified are file types that contain
information the crawler can't index. If inappropriate, these may be
removed in order to have the crawler index these file types.
For example, if you specify */~webmaster/non-public/
here, the crawler will avoid ~webmaster/non-public/ on all
servers. If you specify *my.server.com/~root/*,
/~root/ will not be indexed on the server
my.server.com.
Remember to check arguments to CGI scripts and the like. For
instance, directory listings can sometimes enter infinite loops. If
any such are present, it is recommended that *?* be added
here.
Check up on the crawler while it is running, by checking its log
file, so that it doesn't go into a loop, run amok, etc.
Finally, on the last page of the New profile wizard pages, press
OK to save the new profile. Technical notes: all profiles
are saved in the text file ENGINE_HOME/profiles.txt. If no id
is specified, a new unique id will be generated.
|