======================================== README Mail::Classifier VERSION 0.11 ======================================== WHAT IS MAIL::CLASSIFIER? Mail::Classifier provides a class-based framework for writing filters to classify or sort e-mail. The class was written primarily to support creation of and experimentation with content-based, probabilistic mail classification approaches where a mail classifier is separately trained on several mail files before being applied to the classification of new e-mails. However, this class can be applied or adapted to any mail classification approach that differentiates between a "learning" phase and a "scoring" (prediction) phase. SUMMARY OF FEATURES * Supports single-message processing (training or scoring) or automatic scoring of all messages in a mailbox * Supports tagging messages (or all messages in a mailbox) with a customized header of classification results * Supports several standard mailbox formats (using Mail::Box) * Provides statistical cross-validation of classification accuracy * Allows users to define an external parsing function for messages * Supports saving/loading Mail::Classifier objects * Supports storing working data either in memory or as scratch files on disk with MLDBM::Sync * Includes Mail::Classifier::GrahamSpam, which implements Naive Bayesian filtering with several customizable parameters USAGE $mc = Mail::Classifier::GrahamSpam->new(); $mc->train( { 'spam.txt' => 'SPAM', 'notspam.txt' => 'NOTSPAM' } ); # Assuming $msg is a Mail::Message object... my ($bestcat, $prob, @other_scores) = $mc->score( $msg ); DETAILS Mail::Classifier is an abstract base class for mail classification. As such provides capabilities for defining working data tables (which may be stored in memory or on disk) that will persist across saves/restores. It also provides the message handling capabilities necessary to process mailboxes and conduct statistical validation. Classes inherit from Mail::Classifier to implement a particular classification algorithm or technique. Derived classes must implement methods for learning and scoring messages. Typically, derived classes will also define methods for parsing messages into tokens for use in the learning and scoring methods. Two derivied classes are included. The first, Mail::Classifier::Trivial is an example of how to extend the base class. The second class, Mail::Classifier::GrahamSpam, implements a Naive Bayesian Filtering based on the article "A Plan For Spam" by Paul Graham (http://www.paulgraham.com/spam.html), and is a fully-functional spam filter. See the RESULTS section, below. One of the key benefits of Mail::Classifier is built-in support for cross-validation. Cross-validation divides training data into "N" folds and iteratively scores each fold based on a model built on all remaining folds to maximize available data used in model evaluation. [See "An Introduction to the Bootstrap" by Efron and Tibshirani (1998), p. 239. for more details.] The result is an out-of-sample evaluation of the performance (i.e. accuracy) of the classification engine which can operate on smaller training sets without explicit hold-out samples for validation. Mail::Classifier is not an efficient approach to high-volume classification. (It's in Perl, not C.) However, it is ideal for rapid experimentation and testing of classification algorithms, and benefits from Perl Regexp capabilities for exploring alternative message tokenization routines. RESULTS It works pretty well. Version 0.10 of GrahamSpam achieved 95.9% spam identification accuracy, with only 0.46% false positives during a test run with no special tuning. This result used a four-fold cross-validation on a 19 MB mail archive (about 4800 messages) and a 19 MB spam archive (about 1800 messages. (Results will, of course, vary significantly based on the particular spam and non-spam training sets used.) The four-fold run (processing almost 27k messages) took only 14 minutes on a 1 GHz Duron processor. INSTALLATION To install this module type the following: perl Makefile.PL make make test make install DEPENDENCIES This module requires these other modules and libraries: MLDBM MLDBM::Sync File::Copy Mail::Box Mail::Address File::Temp File::Spec COPYRIGHT AND LICENCE Copyright (C) 2002 David Golden This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.