Introduction
  Installing
  Handling
  Virtual servers
  Modules
  Filesystems
  RXML tags
  Graphics
  Proxy
  Miscellaneous modules
  Security considerations
  Scripting
  Databases
  LDAP
  IntraSeek
    Directories
    Configuring
    Creating new profile
    Indexing
    Languages
    Logs
    Advanced profile
    Technical document
  LogView
  FrontPage
  Upgrading
  Third party extensions
  Portability
  Reporting bugs
  Appendix
 
Indexing

Select the Crawlers tab at the top of the page to enter the Crawler Control page. From this page, you can start and stop crawlers, and view their current status. First on the page is a list of crawlers ready to start, below this is the list of active crawlers.

  • At this point your test profile should be in the list of ready crawlers. Select Launch Now to start a crawler.

  • Now, your profile should appear in the list of Active Crawlers with the status starting up. Click Reload on your browser to update this information.

  • After a while (Click Reload a few times, every few seconds) the status of your crawler will change to Running! and another few options will appear.

    If this does not happen within a few seconds, something has probably gone wrong. The most likely cause is that Pike failed to start. Check:

    • That you have specified the correct Engine home location path in the Intraseek Module Configuration.

    • Challenger's debug log for any hints about what went wrong.

    If you find the error, click Zap! to return the crawler to the Ready to launch list.

  • If all went well, the crawler is now running. You can select either View status to see a summary of the progress, or View log to see a more verbose log.

    • If you select Zap!, the crawler will be killed by force, and the entire data gathering aborted. The crawler will then be Ready to launch again, and none of the information gathered will be used. Any temporary files will also be deleted.

    • If you select Halt! the crawler will instead be stopped. It will then save all its information to the data bases, and also save a freeze-file, ID.scheduler_freeze.is, so that the data-gathering can be continued. It will say Halting... (saving) in the status box, and then Halted when everything has been saved.

Afterwards
When the crawler has finished running, and indexed all the available web pages, two data bases with the names n.ID.pages.yabu and n.ID.index.yabu, and a flag-file ID.new.flag will have been created, where ID is the name of the profile.

These will be renamed to a.ID.index.yabu and a.ID.pages.yabu when the search engine finds the flag-file whereupon the flag-file will be deleted and the new data bases used.

More information on these is to be found in the Storage of Data Bases chapter of the technical documentation.

You will notice a slight pause the first time you use the new data base. The reason for this is that the IntraSeek module replaces the old data base files with the new ones. This delay can last anywhere from seconds to minutes depending on file system speed and data base size.

Logs have been generated as well. Select the Logs tab in the configuration interface to see these. The main crawler log tells you how different crawlers have been started and stopped. This log can be cleared by selecting Delete the main log.

Standards
IntraSeek supports the Robots Exclusion Protocol which is a method that allows Web site administrators to indicate to visiting robots which parts of their site should not be visited by the robot. When a Robot visits a Web site, say http://www.foobar.com/, it firsts checks for http://www.foobar.com/robots.txt. Within this document it is stated if the crawler may retrieve the document or not.

We do however recommend the use of <meta name=robots> instead, as this is more flexible. To maintain the "/robots.txt" file, it is necessary to have root privileges, while any HTML writer can control the META method locally.

Frames & indexing
Frames are a big problem for all search engines. Most site maintainers optimize their web pages for the latest versions of Netscape and MS Internet Explorer, ignoring the <noframes> options. Right now, no search engine on the market can follow frames and reconstruct the exact view (frame set). It is very difficult, if not impossible, to keep track of all dynamic frames and frame sets, and if JavaScripts are used to generate links, it becomes even harder.

IntraSeek follows links found in frame sets. This is an acceptable solution, although not a good one.