NAME DBIx::KwIndex - create and maintain keyword indices in DBI tables _________________________________________________________________ SYNOPSIS package MyKwIndex; use DBIx::KwIndex; sub document_sub { ... } package main; $kw = DBIx::KwIndex->new({dbh => $dbh, index_name => 'myindex'}) or die "can't create index"; $kw->add_document ([1,2,3,...]) or die $kw->{ERROR}; $kw->remove_document([1,2,3,...]) or die $kw->{ERROR}; $kw->update_document([1,2,3,...]) or die $kw->{ERROR}; $docs = $kw->search({ words=>'upset stomach' }); $docs = $kw->search({ words=>'upset stomach', boolean=>'AND' }); $docs = $kw->search({ words=>'upset stomach', start=>11, num=>10 }); $docs = $kw->search({ words=>'upset (bite|stomach)', re=>1 }); $kw->add_stop_word(['the','an','am','is','are']) or die $kw->{ERROR}; $words = $kw->common_word(85); $kw->remove_word(['gingko', 'bibola']) or die $kw->{ERROR}; $ndocs = $kw->document_count(); $nwords = $kw->word_count(); $kw->remove_index or die $kw->{ERROR}; $kw->empty_index or die $kw->{ERROR}; _________________________________________________________________ DESCRIPTION DBIx::KwIndex is a keyword indexer. It indexes documents and stores the index data in database tables. You can tell DBIx::KwIndex to index [lots] of documents and later on show you which ones contain a certain word. The typical application of DBIx::KwIndex is in a search engine. How to use this module: 1. Provide a database handle. use DBI; my $dbh = DBI->connect(...) or die $DBI::errstr; 2. Subclass DBIx::KwIndex and provide a `document_sub' method to retrieve documents referred by an integer id. The method should accept a list of document ids in an array reference and return the documents in an array reference. In this way, you can index any kind of documents that you want: text files, HTML files, BLOB columns, etc., as long as you provide the suitable document_sub() to retrieve the documents. The one thing to remember is that the documents must be referred by unique integer number. Below is a sample of a document_sub() that retrieves document from the 'content' field of a database table. package MyKwIndex; require DBIx::KwIndex; use base 'DBIx::KwIndex'; sub document_sub { my ($self, $ary_ref) = @_; my $dbh = $self->{dbh}; my $result = $dbh->selectall_arrayref( 'SELECT id,content FROM documents WHERE id IN ('. join(',',@$ary_ref). ')'); # if retrieval fails, you should return undef defined($result) or return undef; # now returns the content field in the order of the id's # requested. remember to return the documents exactly # in the order requested! my %tmp = map { $_->[0] => $_->[1] } @$result; return [ @tmp{ @$aref } ]; } 3. Create the indexer object. my $kw = MyKwIndex->new({ dbh => $dbh, index_name => 'article_index', # other options... }); dbh is the database handle. index_name is the name of the index, DBIx::KwIndex will create several tables which are all prefixed with the index_name. The default index_name is 'kwindex'. Other options include: max_word_length (default 32). 4. Index some documents. You can index one document at a time, e.g. $kw->add_document([1]) or die $kw->{ERROR}; $kw->add_document([2]) or die $kw->{ERROR}; or small batches of documents at a time: $kw->add_document([1..10]) or die $kw->{ERROR}; $kw->add_document([11..20]) or die $kw->{ERROR}; or large batches of documents at a time: $kw->add_document([1..300]) or die $kw->{ERROR}; $kw->add_document([301..600]) or die $kw->{ERROR}; Which one to choose is a matter of memory-speed trade-off. Larger batches will increase the speed of indexing, but with increased memory usage. Note: DBIx::KwIndex ignores single-character words, numbers, and words longer than 'max_word_length'. 5. If you want to search the index, use the search() method. $docs = $kw->search({ words => 'upset stomach' }); die "can't search" if !defined($docs); The search() method will return an ARRAY ref containing the document ids that matches the criteria. Other parameter include: num => maximum number of results to retrieve; start => starting position (1 = from the beginning); boolean => 'AND' or 'OR' (default is 'OR'); re => use regular expression, 1 or 0. Note: num and start uses the LIMIT clause (which is quite unique to MySQL). re uses the REGEXP clause. Do not use these options if your database server does not support them. Also note: Searching is entirely done from the index. No documents will be retrieved while searching. A simple 'relevancy' ranking is used. Search is case-insensitive and there is no phrase-search support yet. Some examples: # retrieve only the 11th-20th result. $docs = $kw->search({ words=>'upset stomach', start=>11, num=>10 }); die "can't search" if !defined($docs); # find documents which contains all the words. $docs = $kw->search({ words=>['upset stomach'], boolean=>'AND' }); die "can't search" if !defined($docs); 6. Now suppose some documents change, and you need to update the index to reflect that. Just use the methods below. # if you want to remove documents from index $kw->remove_document([90..100]) or die $kw->{ERROR}; # if you want to update the index $kw->update_document([90..100]) or die $kw->{ERROR}; _________________________________________________________________ SOME UTILITY METHODS If you want to exclude some words (usually very common words, or ``stop words'') from being indexed, do this before you index any document: $kw->add_stop_word(['the','an','am','is','are']) or die "can't add stop words"; Adding stop words is a good thing to do, as stop words are not very useful for your index. They occur in a large proportion of documents (they do not help searches differentiate documents) and they increase the size your index (slowing the searches). But which words are common in your collection? you can use the common_word method: $words = $kw->common_word(85); This will return an array reference containing all the words that occur in at least 85% of all documents (default is 80%). If you want to delete some words from the index: $kw->remove_word(['common','cold']); or die "can't remove words"; To get some statistics about your index: # the number of documents $ndocs = $kw->document_count(); # the number of words $nwords = $kw->word_count(); Last, if you got bored with the index and want to delete it: $kw->remove_index or die $kw->{ERROR}; This will delete the database tables. Or, if you just want to empty the index and start all over: $kw->empty_index or die $kw->{ERROR}; _________________________________________________________________ AUTHOR Steven Haryanto _________________________________________________________________ COPYRIGHT Copyright (c) 1995-1999 Steven Haryanto. All rights reserved. You may distribute under the terms of either the GNU General Public License or the Artistic License, as specified in the Perl README file. _________________________________________________________________ BUGS/CAVEATS/TODOS Test the module under other database server (besides MySQL). Use a more correct search sorting (the current one is kinda bogus :). Probably implement phrase-searching (but this will require a larger vectorlist). Probably, maybe, implement English/Indonesian stemming. Any safer, non database-specific way to test existence of tables other than $dbh->tables? _________________________________________________________________ NOTES At least two other Perl extensions exist for creating keyword indices and storing them in a database: DBIx::TextIndex and MyConText. As of this writing, only DBIx::TextIndex features phrase-searching and boolean NOT; and only DBIx::KwIndex offers feature to delete documents from index (but please see the updated version and documentation for details). I personally find DBIx::KwIndex more convenient when I need to index documents that change often, because one can add/remove some documents without rebuilding the entire index. Advices/comments/patches welcome. _________________________________________________________________ HISTORY 0001xx=first draft,satunet.com. 000320=words->scalar. 000412=0.01/documentation/cpan.