Text::Ngram version 0.04 ========================== Basis for n-gram analysis SYNOPSIS use Text::Ngram qw(ngram_counts add_to_counts); my $text = "abcdefghijklmnop"; my $hash_r = ngram_counts($text, 3); # Window size = 3 # $hash_r => { abc => 1, bcd => 1, ... } add_to_counts($more_text, 3, $hash_r); DESCRIPTION n-Gram analysis is a field in textual analysis which uses sliding window character sequences in order to aid topic analysis, language determination and so on. The n-gram spectrum of a document can be used to compare and filter documents in multiple languages, prepare word prediction networks, and perform spelling correction. The neat thing about n-grams, though, is that they're really easy to determine. For n=3, for instance, we compute the n-gram counts like so: the cat sat on the mat --- $counts{"the"}++; --- $counts{"he "}++; --- $counts{"e c"}++; ... This module provides an efficient XS-based implementation of n-gram spectrum analysis. There are two functions which can be imported: $href = ngram_counts($text[, $window]); This first function returns a hash reference with the n-gram histogram of the text for the given window size. If the window size is omitted, then 5-grams are used. This seems relatively standard. add_to_counts($more_text, $window, $href) This incrementally adds to the supplied hash; if $window is zero or undefined, then the window size is computed from the hash keys. Important note on text preparation Most of the published algorithms for textual n-gram analysis assume that the only characters you're interested in are alphabetic characters and spaces. So before the text is counted, the following preparation is made. All characters are lowercased; (most papers use upper-casing, but that just feels so 1970s) punctuation and numerals are replaced by stop characters flanked by blanks; multiple spaces are compressed into a single space. After the counts are made, n-grams containing stop characters are dropped from the hash. If you prefer to do your own text preparation, use the internal routines "process_text" and "process_text_incrementally" instead of "count_ngrams" and "add_to_counts" respectively. SEE ALSO Cavnar, W. B. (1993). N-gram-based text filtering for TREC-2. In D. Harman (Ed.), *Proceedings of TREC-2: Text Retrieval Conference 2*. Washington, DC: National Bureau of Standards. Shannon, C. E. (1951). Predication and entropy of printed English. *The Bell System Technical Journal, 30*. 50-64. Ullmann, J. R. (1977). Binary n-gram technique for automatic correction of substitution, deletion, insert and reversal errors in words. *Computer Journal, 20*. 141-147. AUTHOR Maintained by Jose Castro, C. Originally created by Simon Cozens, C. COPYRIGHT AND LICENSE Copyright 2004 by Jose Castro Copyright 2003 by Simon Cozens This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.