NAME WordNet-SenseRelate version 0.03 OVERVIEW Selecting the correct sense of a word in a context is called word sense disambiguation (WSD). The correct sense is selected from a set of predefined senses for that word (i.e., from a dictionary). SYNOPSIS use WordNet::SenseRelate; use WordNet::QueryData; my $qd = WordNet::QueryData->new; my %options = (wordnet => $qd, measure => 'WordNet::Similarity::lesk' ); my $wsd = WordNet::SenseRelate->new (%options); my @words = qw/when in the course of human events/; my @res = $wsd->disambiguate (window => 2, tagged => 0, scheme => 'normal', context => [@words], ); print join (' ', @res), "\n"; CONTENTS When the distribution is unpacked, several subdirectories are created: /lib This directory contains the Perl modules that do the actual work of disambiguation. By default, these files are installed into /usr/local/lib/perl5/site_perl/PERL_VERSION (where PERL_VERSION is the version of Perl you are using). See the INSTALL file for more information. /utils This directory contains useful scripts. These scripts will be install when 'make install' is run. By default, these files are installed into your /usr/local/bin directory. See the INSTALL file for more information. The scripts in this directory are: wsd.pl This very useful script can be used to disambiguate a file of words. It is discussed in greater detail later in this document. semcor-reformat.pl This script will reformat a Semcor file so that it can be used as input to wsd.pl reformat-for-senseval.pl This script will reformat the output of wsd.pl so that it can be used as input to the Senseval scorer2 program. Each of these scripts has detailed documentation. Run perldoc on a file to see the detailed documentation; for example, 'perldoc wsd.pl' shows the documentation for wsd.pl. /samples This directory contains files that may (or may not) be useful. The files are primarily files that can be used as input to the scripts in the utils directory. There is a README file in the directory that describes the contents in more detail. /t This directory contains test scripts. These scripts are run when you execute 'make test'. DESCRIPTION Words can have multiple meanings or senses. For example, the word *glass* in WordNet [1] has seven senses as a noun and five senses as a verb. Glass can mean a clear solid, a container for drinking, the quantity a drinking container will hold, etc. WSD is the process of selecting the correct sense of a word when that word occurs in a specific context. For example, in the sentence, "the window is made of glass", the correct sense of glass is the first sense, a clear solid. WordNet::SenseRelate implements an extension of the algorithm described by Pedersen, Banerjee, and Patwardhan [2]. This implementation is similar to the original SenseRelate package. The original SenseRelate was intended for a "lexical sample" situation where the goal is to disambiguate only one word (specified by markup tags) in a given context. The goal of WordNet::SenseRelate is to disambiguate every word in a context or document. Algorithm for each word w in input disambiguate-single-word (w) disambiguate-single-word for each sense s_ti of target word t let socre_i = 0 for each word w_j in context window next if j = t for each sense s_jk of w_j temp-score_k = relatedness (s_ti, s_jk) best-score = max temp-score if best-score > threshold score_i = score_i + best-score return i s.t. score_i > score_j for all j in {s_t0, ..., s_tN} The Context Window The size of the context window can be specified by the user. A context window of size 3 means that the 3 words to the left and the 3 words to the right of the target word will be in the context window; however, the algorithm will expand the context window so that the 3 words on each side will be words known to WordNet. For example, if the word 'the', occurs in the context window to the left of the target word, then the window will be expanded by one word to the left. Note that the context window will only include words in the same sentence as the target word. If, for example, the target word is the first word in the sentence, then there will be no words to left of the target word in the context window. Part of Speech Coercion Certain measures of semantic similarity only work on noun-noun or verb-verb pairs; therefore, the usefulness of these measures for WSD is somewhat limited. As a way of coping with this problem, WordNet::SenseRelate provides an option to "coerce" words in the context window to be of the same part of speech as the target word. When POS coercion is in effect, if the target word is a noun, then WordNet::SenseRelate will attempt to convert non-nouns in the context window to noun forms of the same word. For example, if the target word is a noun and the verb *love* occurs in the window, the module might convert that word to the noun *love*. WordNet::SenseRelate first uses the validForms method from WordNet::QueryData to find any valid forms of the word being coerced that are of the desired part of speech. In the case of part of speech tagged text, the POS tags are discarded. If validForms did not return any forms of the desired part of speech, then the derived forms relation in WordNet is used to find possible forms of the word. If neither of these methods returned usable forms, then no further attempt is made to coerce the word to be the desired part of speech. Tracing/Debugging Several different levels of trace output are available. The trace level can be specified as a command-line option to wsd.pl or as a parameter to the WordNet::SenseRelate module. Trace Levels 1 Show the context window for each pass through the algorithm. 2 Display winning score for each pass. 4 Display the scores for all senses for each pass (overrides level 2). 8 Display traces from the semantic relatedness module. Different trace levels can be combined to achieve the desired behavior. For example, by specifying a trace level of 3, both level 1 and level 2 traces are generated (i.e., the context window will be shown along with the winning score for each pass). Using wsd.pl The wsd.pl script provides an easy method of performing disambiguation from the command line. The text to be disambiguated is read from a file provided by the user on the command line. Output The output of wsd.pl is simply the disambiguated words. The output will be in the form word#part_of_speech#sense_number. The part of speech will be one of 'n' for noun, 'v' for verb, 'a' for adjective, or 'r' for adverb. Words from other parts of speech are not disambiguated and are not found in WordNet. The sense number will be a WordNet sense number. WordNet sense numbers are assigned by frequency, so sense 1 of a word is more common than sense 2, etc. Sometimes when a word is disambiguated, a "different" but synonymous word will be found in the output. This is not a bug but is a consequence of how WordNet works. The word sense returned will always be the first word sense in a synset (synonym set) to which the original word belongs. Usage wsd.pl --context FILE --format FORMAT [--scheme SCHEME] [--type MEASURE] [--config FILE] [--compounds FILE] [--stoplist FILE] [--window INT] [--contextScore NUM] [--pairScore NUM] [--outfile FILE] [--trace INT] [--silent] | --help | --version The format option specifies one of the three different formats supported by wsd.pl. The three formats are: raw Raw text that is not part of speech tagged and needs undergo sentence boundary detection. Example: Red cars are faster than white cars. However, white cars are less expensive. parsed Parsed text is untagged text that has had all unwanted punctuation removed and has exactly one sentence per line. Example: Red cars are faster than white cars However white cars are less expensive tagged Tagged text is part of speech tagged text that has no unwanted punctuation and has exactly one sentence per line. Example: Red/JJ cars/NNS are/VBP faster/RBR than/IN white/JJ cars/NNS However/RB white/JJ cars/NNS are/VBP less/RBR expensive/JJ The different options and parameters for wsd.pl are discussed in detail in the documentation for wsd.pl. Run 'perldoc wsd.pl' to view the documentation. Usage Examples 1. wsd.pl --context input.txt --format raw 2. wsd.pl --trace 3 --context input.txt --format raw 3. wsd.pl --trace 3 --context input.txt --window 4 --format raw Using the Disambiguation Module The WordNet::SenseRelate Perl module can be used in other Perl programs to perform word sense disambiguation. Example use WordNet::SenseRelate; use WordNet::QueryData; my $qd = WordNet::QueryData->new; my $wsd = WordNet::SenseRelate->new (wordnet => $qd, measure => 'WordNet::Similarity::lesk'); my @words = qw/this is a test/; my @results = $wsd->disambiguate (context => [@words]); print join (' ', @results), "\n"; The context parameter to disambiguate() specifies a set of words to disambiguate. The function treats the context as one sentence. To disambiguate multiple sentences, make a call to disambiguate() for each sentence. The usage of the disambiguation module is discussed in detail in the documentation for the module. Run 'perldoc WordNet::SenseRelate' or 'man WordNet::SenseRelate' (after installing the module) to view the documentation. To view the documentation before installing the module, run 'perldoc lib/WordNet/SenseRelate.pm'. SEE ALSO WordNet::SenseRelate(3) wsd.pl(1) AUTHORS Jason Michelizzi Ted Pedersen COPYRIGHT AND LICENSE Copyright (C) 2004-2005 by Jason Michelizzi and Ted Pedersen This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. REFERENCES 1. Christiane Fellbaum. 1998. WordNet: an Electronic Lexical Database. MIT Press. 2. Ted Pedersen, Satanjeev Banerjee, and Siddharth Patwardhan. 2003. Maximizing Semantic Relatedness to Perform Word Sense Disambiguation.