NAME "Text::Categorize::Textrank" - Computes textrank of text. SYNOPSIS use Text::Categorize::Textrank; use Data::Dump qw(dump); my $textranker = Text::Categorize::Textrank->new; DESCRIPTION "Text::Categorize::Textrank" uses the modules Text::StemTagPOS and Graph::Centrality::Pagerank to compute the textrank of words given the text. Encoding of all text should be in Perl's internal format; see the module Text::Iconv for converting text from various encodings. CONSTRUCTOR "new" The method "new" creates an instance of the "Text::Categorize::Textrank" class with the following parameters: "isoLangCode" isoLangCode => 'en' "isoLangCode" is the ISO language code of the language that will be tagged and stemmed by the object. It must be 'en', which is also the default. In the future 'de' may be added. "endingSentenceTag" endingSentenceTag => 'PP' "endingSentenceTag" is the part-of-speech tag that should be used it indicate the end of a sentence. The default is 'PP'. The value of this tag must be a tag generated by the module Lingua::EN::Tagger. "listOfPOSTypesToKeep" listOfPOSTypesToKeep => [...] The Textrank method preprocesses the text so that only specific parts of speech are retained and used to build the graph representing the text. The module Lingua::EN::Tagger is used to tag the parts-of-speech of the text. The parts-of-speech retained can be specified by word types, where the type is a combination of 'ALL', 'ADJECTIVES', 'ADVERBS', 'CONTENT_WORDS', 'NOUNS', 'PUNCTUATION', 'TEXTRANK_WORDS', or 'VERBS'. The default is "[qw(TEXTRANK_WORDS)]", which equates to "[qw(ADJECTIVES NOUNS)]". "listOfPOSTagsToKeep" listOfPOSTagsToKeep => [...] "listOfPOSTagsToKeep" provides finer control over the parts-of-speech to be retained when filtering previously tagged text. For a list of all the possible tags call "getListOfPartOfSpeechTags". "stdevFactor" stdevFactor => 1 If "avgPageRank" is the average pagerank score and "stdevPageRank" is their standard deviation, then a words page rank must be greater than or equal to the "avgPageRank + stdevFactor * stdevPageRank" to be selected as keyword. The default is one. "computeStatsIgnoringRepeatedWords" computeStatsIgnoringRepeatedWords => 0 When computing the average and standard deviation of the pageranks of the words this can be done by treating the text as a sequence of words or a set of words; in the latter case repeating words are ignored. For example, suppose the only words in the text were "very" and "good" with pageranks of .3 and .7 respectively. If the original text is "Very very good.", then the average pagerank of the text is "(.3 + .3 + .7) / 3". If the pagerank is computed ignoring repeated words it is "(.3 + .7) / 2". "getTopKeyPhrases" getTopKeyPhrases => 10 Instead of using "stdevFactor" to chose the top key phrases to extract from the text, "getTopKeyPhrases" can be used to return the top scoring words or phrases. This is the default method, with the top 10 words or phrases being returned. Note, if the text has ten words or less, the entire text will be returned as one phrase. METHODS "getTextrankScores" getTextrankScores (...) The method "getTextrankScores" returns a hash-reference of all the stemmed words used in computing the textrank. The value of each hash is the textrank of the word. The sum of all the textranks is one. "listOfText" listOfText => [...] "listOfText" is the array reference of the strings of text that is to be categorized. "tokenEdgeSpan" tokenEdgeSpan => 3 When building the textrank graph, "tokenEdgeSpan" tokenEdgeSpan holds the number of successive words linked by an edge. For example, given the tokens "apple orange pear", the edges "[apple, orange]" and "[apple, pear]" will be created for the token "apple" if $tokenEdgeSpan is three. The default is two. Need to change this! "undirectedGraph" directedGraph => 0 Computing the textrank of the works involves building a graph from the words. If "directedGraph" evaluates to true, then the graph will be directed. The default is false, that is, the graph is undirected. "pageRankDampeningFactor" pageRankDampeningFactor => 0.85 When computing the pagerank of the textrank graph, the dampening factor specified by "pageRankDampeningFactor" will be used; it should range from zero to one. The default is 0.85. "useEdgeWeights" useEdgeWeights => 1 If "useEdgeWeights" is true, then when computing the pagerank of the textrank graph the edge weights will be included in the calculation. The default is true. "addEdgesSpanningSentences" addEdgesSpanningSentences => 1 If "addEdgesSpanningSentences" is true, then when build the textrank graph, links between the words at the end of sentence and the beginning of the next sentence will be made. The default is true. INSTALLATION To install the module run the following commands: perl Makefile.PL make make test make install If you are on a windows box you should use 'nmake' rather than 'make'. AUTHOR Jeff Kubina COPYRIGHT Copyright (c) 2009 Jeff Kubina. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. The full text of the license can be found in the LICENSE file included with this module. SEE ALSO Lingua::Stem::Snowball, Lingua::EN::Tagger, and Lingua::DE::Tagger