NAME Lingua::JA::TFWebIDF - TF*WebIDF calculator SYNOPSIS use Lingua::JA::TFWebIDF; use utf8; use feature qw/say/; use Data::Printer; my $tfidf = Lingua::JA::TFWebIDF->new( pos1_filter => [qw/非自立 代名詞 ナイ形容詞語幹 副詞可能 サ変接続/], ng_word => [qw/編集 本人 自身 自分 たち さん 年 月 日/], ); my %tf = ( '自然言語処理' => 9, '自然言語' => 6, '自然言語理解' => 4, '処理' => 5, '解析' => 4, ); p $tfidf->tfidf(\%tf)->dump; p $tfidf->tfidf($text)->dump; p $tfidf->tf($text)->dump; for my $result (@{ $tfidf->tfidf($text)->list(20) }) { my ($word, $weight) = each %{$result}; say "$word: $weight"; } DESCRIPTION Lingua::JA::TFWebIDF calculates TF*WebIDF weight. Compared with Lingua::JA::TFIDF, this module has the following advantages. * supports Tokyo Cabinet and many options. * tfidf method accepts \%tf. (This facilitates the use of other morphological analyzers.) METHODS new( %config || \%config ) Creates a new Lingua::JA::TFWebIDF instance. The following configuration is used if you don't set %config. KEY DEFAULT VALUE ----------- --------------- pos1_filter [qw/非自立 代名詞 ナイ形容詞語幹 副詞可能/] pos2_filter [] pos3_filter [] ng_word [] term_length_min 2 term_length_max 30 concat_max 30 tf_min 1 df_min 0 df_max 250_0000_0000 fetch_unk_word_df 0 db_auto 1 guess_df 1 idf_type 1 api 'YahooPremium' appid undef driver 'TokyoCabinet' df_file './df.tch' fetch_df 0 expires_in 365 documents 250_0000_0000 Furl_HTTP undef verbose 0 pos(1|2|3)_filter => \@pos The filters of '品詞細分類'. concat_max => $num The maximum value of the number of term concatenations. If 2 is specified, 2 consecutive nouns are concatenated. I recommend that you specify a large value or 0. fetch_df => 0 || 1 If df_file has the (Web)DF and it has not expired yet, it is used. If it is not so, 1: fetches the DF of the word which exists in the dictionary of MeCab. 0: if guess_df is 1, guesses the DF. if guess_df is 0, does not give TF*WebIDF weight. fetch_unk_word_df => 0 || 1 'unk word' is a word which not exists in the dictionary of MeCab. If df_file has the (Web)DF and it has not expired yet, it is used. If it is not so, 1: fetches the DF of the unk word if fetch_df is 1. 0: if guess_df is 1, guesses the DF. if guess_df is 0, does not give TF*WebIDF weight. db_auto => 0 || 1 If 1 is specified, (open|close)s the WebDF(Document Frequency) database automatically. This option works when tfidf method is called. guess_df => 0 || 1 If df_file has the (Web)DF and it has not expired yet, it is used. If it is not so, 1: guesses the DF based on the string length of $word. 0: doesn't give TF*WebIDF weight. idf_type, api, appid, driver, df_file, expires_in, documents, Furl_HTTP, verbose See Lingua::JA::WebIDF. tfidf( $text || \%tf ) Calculates TF*WebIDF weight. If scalar value is set, MeCab separates the value into appropriate morphemes. If you want to use other morphological analyzers, you have to set a hash reference which contains terms and their TF(Term Frequency). tf($text) Calculates TF(Term Frequency) via MeCab. idf, df, db_open, db_close, purge See Lingua::JA::WebIDF. AUTHOR pawa SEE ALSO Lingua::JA::TermExtractor Lingua::JA::WebIDF Lingua::JA::WebIDF::Driver::TokyoTyrant LICENSE This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.