NAME Seq - An inverted text index. ABSTRACT THIS IS ALPHA SOFTWARE Seq is a text indexer and search utility written in Perl and C. It has several special features not to be found in most available similar programs, namely arbitrarily complex sequence and alternation queries, and proximity searches with both exact counts and limits. There is no result ranking (yet). The index format draws some ideas from the Lucene search engine, with some simplifications and enhancements. The index segments are stored in a CDB disk hash (from dj bernstein), but support for a standard SQL database backend is coming. SYNOPSIS Index documents: # cat textcorpus.txt | tokenize | indexstream index_dir # cat textcorpus.txt | tokenize | stopstop | indexstream index_dir # optimize index_dir Search: # seqsearch index_dir # (type search terms) PROGRAMMING API use Seq; # open for indexing $index = Seq->open_write( "indexname" ); $index->index_document( "docname", $string ); $index->close_index(); # open for searching $index = Seq->open_read( "indexname" ); # Find all docs containing a phrase $result = $index->search( "this phrase and no other phrase" ); # result is a list: # [ [ docid1, match1, match2 ... matchN ], # [ docid2, match1, ... ] ] # # ... where 'match' is the token location of each match within that doc. # List information about the result set $id = $index->query( "this phrase and no other phrase" ); $result = $index->set_info( $id ); # $result is a hashref # { ndocs => N, nmatches => M } SEARCH SYNTAX Sequences of words are enclosed in angle brackets '<' and '>'. Alternations are enclosed in square brackets '[' and ']'. These may be nested within each other as long as it makes logical sense. "" is a simple valid phrase. Nested square brackets don't make sense, logically, so they aren't allowed. Also not allowed are adjacent angle bracket sequences. However, alternations may be adjacent, as in "". As long as these rules are followed, search terms may be arbitrarily complex. For example: "] with all our [hearts strength]>] on olympus [hercules apollo athena ] [was were] [praised ]>" Two operators are available to do proximity searches. '#wN' represents *at least* N intervening skips between words (the number of between words plus 1). Thus "" would match the text "The quick brown fox jumped over the lazy dog". If #w7 or lesser had been used it would not match, but if #w9 or greater had been used it would still match. Also there is the '#tN' operator, which represents *exactly* N intervening skips. Thus for the above example "", and no other value, would match. These operators can be used after words or alternations, but no other place. AUTHOR Ira Joseph Woodhead, ira at sweetpota dot to SEE ALSO "Lucene" "Plucene" "CDB_File" "Inline"