IntroductionIntroduction
  InstallingInstalling
  HandlingHandling
  Virtual serversVirtual servers
  ModulesModules
  FilesystemsFilesystems
  RXML tagsRXML tags
  GraphicsGraphics
  ProxyProxy
  Miscellaneous modulesMiscellaneous modules
  Security considerationsSecurity considerations
  ScriptingScripting
  DatabasesDatabases
  LDAPLDAP
  IntraSeekIntraSeek
    <Directories>Directories<Directories>Directories
    <Configuring>Configuring<Configuring>Configuring
    <Creating new profile>Creating new profile<Creating new profile>Creating new profile
    <Indexing>Indexing<Indexing>Indexing
    <Languages>Languages<Languages>Languages
    <Logs>Logs<Logs>Logs
    <Advanced profile>Advanced profile<Advanced profile>Advanced profile
    <Technical document>Technical document<Technical document>Technical document
  LogViewLogView
  FrontPageFrontPage
  UpgradingUpgrading
  Third party extensionsThird party extensions
  PortabilityPortability
  Reporting bugsReporting bugs
  AppendixAppendix
 
Indexing

Select the Crawlers tab at the top of the page to enter the Crawler Control page. From this page, you can start and stop crawlers, and view their current status. First on the page is a list of crawlers ready to start, below this is the list of active crawlers.

  • At this point your test profile should be in the list of ready crawlers. Select Launch Now to start a crawler.

  • Now, your profile should appear in the list of Active Crawlers with the status starting up. Click Reload on your browser to update this information.

  • After a while (Click Reload a few times, every few seconds) the status of your crawler will change to Running! and another few options will appear.

    If this does not happen within a few seconds, something has probably gone wrong. The most likely cause is that Pike failed to start. Check:

    • That you have specified the correct Engine home location path in the Intraseek Module Configuration.

    • Challenger's debug log for any hints about what went wrong.

    If you find the error, click Zap! to return the crawler to the Ready to launch list.

  • If all went well, the crawler is now running. You can select either View status to see a summary of the progress, or View log to see a more verbose log.

    • If you select Zap!, the crawler will be killed by force, and the entire data gathering aborted. The crawler will then be Ready to launch again, and none of the information gathered will be used. Any temporary files will also be deleted.

    • If you select Halt! the crawler will instead be stopped. It will then save all its information to the data bases, and also save a freeze-file, ID.scheduler_freeze.is, so that the data-gathering can be continued. It will say Halting... (saving) in the status box, and then Halted when everything has been saved.

Afterwards
When the crawler has finished running, and indexed all the available web pages, two data bases with the names n.ID.pages.yabu and n.ID.index.yabu, and a flag-file ID.new.flag will have been created, where ID is the name of the profile.

These will be renamed to a.ID.index.yabu and a.ID.pages.yabu when the search engine finds the flag-file whereupon the flag-file will be deleted and the new data bases used.

More information on these is to be found in the Storage of Data Bases chapter of the technical documentation.

You will notice a slight pause the first time you use the new data base. The reason for this is that the IntraSeek module replaces the old data base files with the new ones. This delay can last anywhere from seconds to minutes depending on file system speed and data base size.

Logs have been generated as well. Select the Logs tab in the configuration interface to see these. The main crawler log tells you how different crawlers have been started and stopped. This log can be cleared by selecting Delete the main log.

Standards
IntraSeek supports the Robots Exclusion Protocol which is a method that allows Web site administrators to indicate to visiting robots which parts of their site should not be visited by the robot. When a Robot visits a Web site, say http://www.foobar.com/, it firsts checks for http://www.foobar.com/robots.txt. Within this document it is stated if the crawler may retrieve the document or not.

We do however recommend the use of {meta name=robots} instead, as this is more flexible. To maintain the "/robots.txt" file, it is necessary to have root privileges, while any HTML writer can control the META method locally.

Frames & indexing
Frames are a big problem for all search engines. Most site maintainers optimize their web pages for the latest versions of Netscape and MS Internet Explorer, ignoring the {noframes} options. Right now, no search engine on the market can follow frames and reconstruct the exact view (frame set). It is very difficult, if not impossible, to keep track of all dynamic frames and frame sets, and if JavaScripts are used to generate links, it becomes even harder.

IntraSeek follows links found in frame sets. This is an acceptable solution, although not a good one.