NAME
    AI::Categorize - Automatically categorize documents based on content

SYNOPSIS
      ### This is one of the categorizers available (see below for more)
      use AI::Categorize::NaiveBayes;
      my $c = new AI::Categorize::NaiveBayes();
  
      ### Supply some training documents so it can learn how to categorize
      $c->stopwords('the','a','and','but','I');  # Ignore these words
      $c->add_document($name, \@categories, $content);
      ... repeat for many documents, then:
      $c->crunch();
  
      $c->save_state('filename'); # Save machine for later use
  
      ### Categorize a new unknown document
      my $c = new AI::Categorize::NaiveBayes();
      $c->restore_state('filename');
      my $results = $c->categorize($content);
      if ($results->in_category('sports')) { ... }
      my @cats = $results->categories;
      my @scores = $results->scores(@cats);

DESCRIPTION
    This module implements several algorithms for automatically guessing
    category information of documents based on the category information of
    existing documents. For example, one might categorize incoming email
    messages in order to place them into existing mailboxes, or one might
    categorize newspaper articles by general topic (business, sports, etc.).
    All of the categorizers learn their categorization rules from a body of
    existing pre-categorized documents.

    Disclaimer: the results of any of these algorithms are far from
    infallible (close to fallible?). Categorization of documents is often a
    difficult task even for humans well-trained in the particular domain of
    knowledge, and there are many things a human would consider that none of
    these algorithms consider. These are only statistical tests - at best
    they are neat tricks or helpful assistants, and at worst they are
    totally unreliable. If you plan to use this module for anything
    important, human supervision is essential.

    But this voodoo can be quite fun. =)

ALGORITHMS
    Currently two different algorithms are implemented in this bundle:

      AI::Categorize::NaiveBayes
      AI::Categorize::kNN

    These are all subclasses of `AI::Categorize'. Please see the
    documentation of these individual modules for more details on their guts
    and quirks. The common interface for all the algorithms is described
    here.

    All these classes are designed to be subclassible so you can modify
    their behavior to suit your needs.

AI::Categorize Methods
    * new()
        Creates a new categorizer object (hereafter referred to as `$c').
        The arguments to `new()' will depend on which subclass of
        `AI::Categorize' you happen to be using. See the subclasses'
        individual documentation for more info.

    * $c->stopwords()
    * $c->stopwords(@words)
        Gets (and optionally sets) the list of stopwords. Stopwords are
        words that should be ignored by the categorizer, and typically they
        are the most common non-informative words in the documents. The most
        common reason to use stopwords is to reduce processing time.

        The stoplist should be set before processing any documents.

    * $c->stopwords_hash()
        Returns the stopwords as the keys of a hash reference. The
        corresponding values are all 1. Can be useful for quick checking of
        whether a word is a stopword.

    * $c->add_stopword($word)
        Adds a single entry to the stopword list.

    * $c->add_document($name, $categories, $content)
        Adds a new training document to the database. `$name' should be a
        unique string identifying this document. `$categories' may be either
        the name of a single category to which this document belongs, or a
        reference to an array containing the names of several categories.
        `$content' is the text content of the document.

        To ease syntax, in the future `$content' may be allowed to be given
        as a path to the document, which will be opened and parsed.

    * $c->crunch()
        After all documents have been added, call `crunch()' so that the
        categorizer can compute some statistics on the training data and get
        ready to categorize new documents.

    * $c->categorize($content)
        Processes the text in `$content' and returns an object blessed into
        the `AI::Categorize::Result' class (hereafter abbreviated as `$r').

        To ease memory requirements, in the future `$content' may be allowed
        to be passed as a filehandle.

    * $c->save_state($filename)
        At any time you may save the state of the categorizer to a file, so
        that you can reload it later using the `restore_state()' method.

    * $c->restore_state($filename)
        Reads in the categorizer data from $filename, which should have
        previously been saved using the `save_state()' method.

    * $c->F1(\@assigned_categories, \@correct_categories)
        This method computes the F1 measure, which is helpful for evaluating
        how well the categorizer did when it assigned categories. The F1
        measure is defined to be 2 times the number of correctly assigned
        categories divided by the sum of the number of assigned categories
        and correct categories.

        In other words, if A is the set of categories that were assigned by
        the system, C is the set of categories that should have been
        assigned by the system, and I is the intersection of A and C, then

                   2*I
            F1 = -------
                  A + C

        (Other sources may define F1 as
        `2*recall*precision/(recall+precision)', which is equivalent to the
        above formula but forces division by zero if either A or C is
        empty.)

        A perfect job categorizing (all correct categories were assigned and
        no extras were assigned) will have an F1 score of 1. A terrible job
        categorizing (no overlap between correct & assigned categories) will
        have an F1 score of 0. Medium jobs will be somewhere in between.

    * $r->extract_words($text)
        Returns a reference to a hash whose keys are the words contained in
        `$text' and whose values are the number of times each word appears.
        Stopwords are omitted and words are put into canonical form
        (lower-cased, leading & trailing non-word characters stripped).

        Don't call this method directly, as it is used internally by the
        various categorization modules. However, you may be interested in
        subclassing one of the modules and overriding `extract_words()' to
        behave differently. For instance, you may want to "lemmatize" your
        words to remove affixes so that "abominable", "abominableness",
        "abominably", "abominate", "abomination", and "abominator" all share
        a single entry in the categorizer.

AI::Categorize::Result Methods
    An `AI::Categorize::Result' object is returned by the `$c->categorize'
    method, described above.

    * $r->in_category($category)
        Returns true or false depending on whether the document was placed
        in the given category.

    * $r->categories()
        Returns an ordered list of the categories the document was placed
        in, with best matches first.

    * $r->scores(@categories)
        Returns a list of result scores for the given categories. Since the
        interface is still changing, not very much can officially be said
        about the scores, except that a good score is higher than a bad
        score. This may change to something like a probability scale, with
        all numbers between 0 and 1, and a threshold for membership
        somewhere in between.

        Please consider the scoring feature somewhat unstable for now.

CAVEATS
    Don't depend on the specific scores given by `$r->scores'. They may
    change in future releases.

    The entire categorizer is currently created in memory, which can get
    pretty demanding if you have a lot of data. If this turns out to be a
    problem, future versions may try to cache large chunks on disk. This
    would come with a speed penalty.

    Finally, I am not an expert in document categorization. I have thought
    about it some, and I have written these modules largely as a way to
    concretize my thinking and learn more about the processes. If you know
    of ways to improve accuracy, please let me know.

AUTHOR
    Ken Williams, ken@forum.swarthmore.edu

COPYRIGHT
    Copyright 2000-2001 Ken Williams. All rights reserved.

    This library is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself.

SEE ALSO
    perl(1), DBI(3).

    "A re-examination of text categorization methods" by Yiming Yang the
    section on "http://www.cs.cmu.edu/~yiming/publications.html"