Indexing
Select the Crawlers tab at the top of the page to enter the
Crawler Control page. From this page, you can start and stop crawlers,
and view their current status. First on the page is a list of crawlers
ready to start, below this is the list of active crawlers.
- At this point your test profile should be in the list of ready
crawlers. Select Launch Now to start a crawler.
- Now, your profile should appear in the list of Active
Crawlers with the status starting up. Click Reload on
your browser to update this information.
- After a while (Click Reload a few times, every few
seconds) the status of your crawler will change to Running! and
another few options will appear.
If this does not happen within a
few seconds, something has probably gone wrong. The most likely cause
is that Pike failed to start. Check:
- That you have specified the correct Engine home
location path in the Intraseek Module Configuration.
- Challenger's debug log for any hints about what went wrong.
If you find the error, click Zap! to return the crawler
to the Ready to launch list.
- If all went well, the crawler is now running. You can select
either View status to see a summary of the progress, or
View log to see a more verbose log.
- If you select Zap!, the crawler will be killed by
force, and the entire data gathering aborted. The crawler will then be
Ready to launch again, and none of the information gathered will be
used. Any temporary files will also be deleted.
- If you select Halt! the crawler will instead be
stopped. It will then save all its information to the data bases, and
also save a freeze-file, ID.scheduler_freeze.is, so that the
data-gathering can be continued. It will say Halting... (saving) in
the status box, and then Halted when everything has been saved.
Afterwards
When the crawler has finished running, and indexed all the available
web pages, two data bases with the names n.ID.pages.yabu
and n.ID.index.yabu, and a flag-file ID.new.flag
will have been created, where ID is the name of the profile.
These will be renamed to a.ID.index.yabu and
a.ID.pages.yabu when the search engine finds the flag-file
whereupon the flag-file will be deleted and the new data bases used.
More information on these is to be found in the Storage of Data Bases
chapter of the technical documentation.
You will notice a slight pause the first time you use the new data
base. The reason for this is that the IntraSeek module replaces the
old data base files with the new ones. This delay can last anywhere
from seconds to minutes depending on file system speed and data base
size.
Logs have been generated as well. Select the Logs tab in
the configuration interface to see these. The main crawler log tells
you how different crawlers have been started and stopped. This log can
be cleared by selecting Delete the main log.
Standards
IntraSeek supports the Robots Exclusion Protocol which is a method
that allows Web site administrators to indicate to visiting robots
which parts of their site should not be visited by the robot. When a
Robot visits a Web site, say http://www.foobar.com/, it
firsts checks for http://www.foobar.com/robots.txt. Within
this document it is stated if the crawler may retrieve the document or
not.
We do however recommend the use of <meta name=robots>
instead, as this is more flexible. To maintain the "/robots.txt" file,
it is necessary to have root privileges, while any HTML writer can
control the META method locally.
Frames & indexing
Frames are a big problem for all search engines. Most site maintainers
optimize their web pages for the latest versions of Netscape and MS
Internet Explorer, ignoring the <noframes> options. Right now,
no search engine on the market can follow frames and reconstruct the
exact view (frame set). It is very difficult, if not impossible, to
keep track of all dynamic frames and frame sets, and if JavaScripts
are used to generate links, it becomes even harder.
IntraSeek follows links found in frame sets. This is an acceptable
solution, although not a good one.
|