NAME Text::SpeedyFx - tokenize/hash large amount of strings efficiently VERSION version 0.005 SYNOPSIS use Data::Dumper; use Text::SpeedyFx; my $sfx = Text::SpeedyFx->new; my $words_bag = $sfx->hash('To be or not to be?'); print Dumper $words_bag; #$VAR1 = { # '1422534433' => '1', # '4120516737' => '2', # '1439817409' => '2', # '3087870273' => '1' # }; my $feature_vector = $sfx->hash_fv("thats the question", 8); print unpack('b*', $feature_vector); # 01001000 DESCRIPTION XS implementation of a very fast combined parser/hasher which works well on a variety of *bag-of-word* problems. Original implementation is in Java and was adapted for a better Unicode compliance. METHODS new([$seed]) Initialize parser/hasher, optionally using a specified $seed (default: 1). hash($string) Parses $string and returns a hash reference where keys are the hashed tokens and values are their respective count. Note that this is the slowest form due to the (computational) complexity of the Perl hash structure itself: "hash_fv()" is 147% faster, while "hash_min()" is 175% faster. hash_fv($string, $n) Parses $string and returns a feature vector (string of bits) with length $n. $n is supposed to be a multiplier of 8, as the length of the resulting feature vector is "ceil($n / 8)". Feature vector format can be useful in Bloom filter implementation, for instance. hash_min($string) Parses $string and returns the hash with the lowest value. Useful in MinHash implementation. See also the included minhash_cmp utility. REFERENCES * Extremely Fast Text Feature Extraction for Classification and Indexing by George Forman and Evan Kirshenbaum * MinHash — выявляем похожие множества * Фильтр Блума AUTHOR Stanislaw Pusep COPYRIGHT AND LICENSE This software is copyright (c) 2012 by Stanislaw Pusep. This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.