WordNet::Similarity ===================== version 0.05 Copyright (c) 2003 Siddharth Patwardhan, patw0006@d.umn.edu Ted Pedersen, tpederse@d.umn.edu University of Minnesota, Duluth http://groups.yahoo.com/group/wn-similarity http://search.cpan.org/dist/WordNet-Similarity This package consists of Perl modules along with supporting Perl programs that implement the semantic relatedness measures described by Leacock Chodorow (1998), Jiang Conrath (1997), Resnik (1995), Lin (1998), Hirst St Onge (1998) and the adapted gloss overlap measure by Banerjee and Pedersen (2002). The Perl modules are designed as object classes with methods that take as input two word senses. The semantic relatedness of these word senses is returned by these methods. A quantitative measure of the degree to which two word senses are related has wide ranging applications in numerous areas, such as word sense disambiguation, information retrieval, etc. For example, in order to determine which sense of a given word is being used in a particular context, the sense having the highest relatedness with its context word senses is most likely to be the sense being used. Similarly, in information retrieval, retrieving documents containing highly related concepts are more likely to have higher precision and recall values. A command line interface to these modules is also present in the package. The simple, user-friendly interface returns the relatedness measure of two given words. A number of switches and options have been provided to modify the output and enhance it with trace information and other useful output. Details of the usage are provided in other sections of this README. Supporting utilities for generating information content files from various corpora are also available in the package. The information content files are required by three of the measures for computing the relatedness of concepts. Discussion about the package and online help for the package is available on the wn-similarity mailing list. To join the list, go to http://groups.yahoo.com/group/wn-similarity The following sections describe the organization of this software package and how to use it. A few typical examples are given to help clearly understand the usage of the modules and the supporting utilities. SEMANTIC RELATEDNESS ==================== We observe that humans find it extremely easy to say if two words are related and if one word is more related to a given word than another. For example, if we come across two words -- 'car' and 'bicycle', we know they are related as both are means of transport. Also, we easily observe that 'bicycle' is more related to 'car' than 'fork' is. But is there some way to assign a quantitative value to this relatedness? Some ideas have been put forth by researchers to quantify the concept of relatedness of words, with encouraging results. Six of these different measures of relatedness have been implemented in this software package. Apart from these a simple edge counting approach and a random method has also been provided. These measures rely heavily on the vast store of knowledge available in the online electronic dictionary -- WordNet. So, we use a Perl interface for WordNet called WordNet::QueryData to make it easier for us to access WordNet. The modules in this package REQUIRE that the WordNet::QueryData module be installed on the system before these modules are installed. CONTENTS OF THE PACKAGE ======================= The package contains the semantic relatedness modules, some support Perl utilities and some sample configuration files, data files and programs. Modules ------- All the modules that will be installed in the Perl system directory are present in the '/lib' directory tree of the package. These include the semantic relatedness modules -- jcn.pm, res.pm, lin.pm, lch.pm, hso.pm, lesk.pm, vector.pm, edge.pm and random.pm -- present in the WordNet/Similarity subdirectory and the supporting modules get_wn_info.pm and string_compare.pm. There also exists a WordNet/Similarity.pm module that currently contains only Perl documentation and version information. All these modules, once installed in the Perl system directory, can be directly used by Perl programs. Supporting Perl Utilities ------------------------- The '/utils' subdirectory of the package contains supporting Perl programs. 'similarity.pl' is a commandline interface to the relatedness modules. A number of Perl programs, that generate information content files from various corpora, are provided. As part of the standard install, these are also installed into the system directories, and can be accessed from any working directory if the common system directories (/usr/bin, /usr/local/bin, etc) are in your path. Samples ------- If you downloaded this package as a tar-gzipped file from the web, you will find a '/samples' subdirectory in the package, which contains sample configuration files for the modules, sample programs showing usage of the modules and sample data files (information content and relation files). INSTALLATION OF THE MODULES =========================== To build these modules and the default data files, set up the WNHOME environment variable to contain the path to WordNet, and then type the following: perl Makefile.PL make make test To install modules type the following as root: make install The installation assumes that WordNet::Querydata is installed in the Perl system path and is accessible via the @INC list of paths. The QueryData module determines the location of WordNet from the WNHOME environment variable. So, make sure you have WNHOME set up to contain the path of the directory where WordNet is installed (eg. /usr/local/WordNet-1.7.1). If WNHOME is not set up, by default the 'perl Makefile.PL' looks for WordNet in /usr/local/WordNet-1.7.1 on a unix system or in C:\Program Files\WordNet\1.7.1 on a Windows system. If it is not possible to set up WNHOME on your system, use the --WNHOME option during the 'perl Makefile.PL' step, to specify the path of your WordNet installation. For example: perl Makefile.PL --WNHOME /home/sid/wordnet1.7 The above steps will install the modules and the supporting default data files in the Perl system path. It is very likely that you will require root or supervisor privileges to install these modules in the Perl system path. In order to install these in a user-specified path you would need to provide this as an option during the 'perl Makefile.PL' step. For example, in order to install the modules under '/home/sid/lib' I would run the command perl Makefile.PL PREFIX=/home/sid/lib In order to include and use modules installed in non-standard directories (paths not present in the Perl @INC list of paths), you may need to add a line like so use lib '/home/sid/lib'; in your Perl program that uses the installed modules. The above instructions should be sufficient for standard and slightly non-standard installations. However, if you need to modify other makefile options you should look at the ExtUtils::MakeMaker docmentation. Modifying other makefile options is not recommended unless you really, absolutely and completely know what you're doing! NOTE: The information-content based measures (res, lin, jcn) are invoked using the default information content file generated during installation of the modules. If, however, the version of WordNet being used on your system has changed since that time, or for some reason the modules are unable to locate the default information content files, then alternate information content files can be specified only by using a configuration file corresponding to each of the modules. Format and creation of configuration files has been discussed in a later section. Utilities to generate information content files have been provided in the package. SYSTEM REQUIREMENTS =================== The following should be installed on your system so as to be able to use this software. 1. Perl version 5.6: This package has been written in Perl which is freely available from www.perl.org. This package assumes that Perl is installed in the directory /usr/local/bin. If so, the support programs can directly be run at the command line as 'similarity.pl ...' or 'semCorFreq.pl ...', etc. However, if Perl is not installed at this location, you would need to explicitly invoke them as 'perl similarity.pl ... ' or 'perl freqCount.pl ...', etc. 2. WordNet: All the measures are based on WordNet. WordNet must be installed on your system. WordNet is freely downloadable from http://www.cogsci.princeton.edu/~wn/ WordNet version 1.7.1 was used during the development and testing of the package, however it should work with other versions of WordNet as well. The WordNet::QueryData Perl module is used to access WordNet. This module requires that an environment variable 'WNHOME', containing the path to the WordNet files, be set up. For further details, please see the WordNet::QueryData documentation. 3. WordNet::QueryData: This is the Perl interface to WordNet written by Jason Rennie. QueryData should be accessible on the @INC path of Perl. (Can be freely downloaded from http://www.ai.mit.edu/~jrennie/WordNet/). QueryData 1.27 was used during the development. Also we observed that that due to some major changes in QueryData from its previous versions, this software does not work with the earlier versions of QueryData. If you have an earlier version of QueryData (1.18 or earlier) you may need to upgrade QueryData. 4. Berkeley DB and the perl interface to Berkeley DB (BerkeleyDB v0.19): If you are using the WordNet::Similarity::vector measure, then you need to have Berkeley DB installed on your system. You also need the perl interface (BerkeleyDB.pm) be installed on the system. This package was tested with version 0.20 of the BerkeleyDB perl module and version 4.1 of the Berkeley DB database. 5. PDL: WordNet::Similarity::vector uses the PDL perl module. Make sure PDL installed before using the WordNet::Similarity::vector measure. THE MODULES =========== Using the relatedness modules ----------------------------- The semantic relatedness modules in this distribution are built as classes that expose the following methods: new() getRelatedness() getError() getTraceString() - new() The first thing that is done in order to use one of the semantic relatedness measures is to create an object of the measure. This is done by calling the 'new' method of that measure or module. For all the semantic relatedness measures provided in this package, the 'new' method takes two parameters -- (a) a WordNet::QueryData object (REQUIRED) (b) the name of a configuration file for that module (Optional) This method initializes an object of the requested measure, using the configuration file data, or with default values if a configuration file is not provided. A reference to this object is returned by the 'new' method and must be saved by the calling program, if any of the other methods of this module are to be called. It is possible to create multiple objects of the same module (possibly initialized differently by specifying different configuration files for each). The format of the configuration files is discussed later in this section. An 'undef' value returned by the 'new' method, indicates that it was unable to create an object. It is also possible that non-fatal errors occur during the creation of the object. In such a case an object is created by the 'new' method using default conditions. However, a non-fatal error condition flag is set within the object, which can be retrieved using the getError() method. It is advisable to check for this error condition after the creation of every such object. - getRelatedness() The 'getRelatedness' method is called on the created object to determine the semantic relatedness of two concepts (synsets in WordNet) as computed by that measure. The input parameters are two WordNet synsets, represented in the word#pos#sense format returned/used by WordNet::QueryData. In this format each synset is represented by a word from that synset, its part-of-speech and its sense number. For example, if the second sense of 'teacher' as a noun occurs in a synset containing synonyms for 'teacher', then this synset can be represented by the string 'teacher#n#2'. The 'getRelatedness' method takes as input two strings of this form and returns a floating point value, which is the semantic relatedness of these (as computed by the measure). - getError() During a call to either the 'new' method or the 'getRelatedness' method of a measure, if a fatal or non-fatal error occurs, the module sets an error flag within the created object and sets an error string within (the exception to this is when the module is unable to create an object upon a call to the 'new' method, in which case it simply returns 'undef'). Both the error condition flag and the error string can be retrieved using the 'getError' method on the created object. The method is called without any parameters and it returns an array containing the error flag as the first element and the error string as the second element. The error flag can take the values 0, 1 or 2. A value of 0 indicates that there was no error or warning since the last call to 'getError'. 1 indicates that there was/were non-fatal error(s) (warnings) since the last call to 'getError'. A value of 2 usually indicates that the errors were serious enough to warrant the termination of the program. However, how these errors are handled is completely upto the programmer writing the Perl program. It is advisable that the error flag be checked after every call to either 'new' or 'getRelatedness', but this is not a necessary step and the error condition may be tested at less regular intervals also. - getTraceString() If traces are enabled, a trace string generated during the last call to the 'getRelatedness' method is stored within the object. This trace string can be retrieved using the 'getTraceString' method. This method is called with no parameters and returns a scalar containing the most recently generated trace string. By default traces are not enabled. Traces can be enabled by specifying this as an option in the configuration file for the measure. Instructions for writing configuration files for the measures follow later in this section. Examples of typical usage ------------------------- To create an object of the Resnik measure, we would have the following lines of code in the Perl program. use WordNet::Similarity::res; $object = WordNet::Similarity::res->new($wn, '/home/sid/resnik.conf'); The reference of the initialized object is stored in the scalar variable '$object'. '$wn' contains a WordNet::QueryData object that should have been created earlier in the program. The second parameter to the 'new' method is the path of the configuration file for the resnik measure. If the 'new' method is unable to create the object, '$object' would be undefined. This, as well as any other error/warning may be tested. die "Unable to create resnik object.\n" if(!defined $object); ($err, $errString) = $object->getError(); die $errString."\n" if($err); To create a Leacock-Chodorow measure object, using default values, i.e. no configuration file, we would have the following: use WordNet::Similarity::lch; $measure = WordNet::Similarity::lch->new($wn); To find the sematic relatedness of the first sense of the noun 'car' and the second sense of the noun 'bus' using the resnik measure, we would write the following piece of code: $relatedness = $object->getRelatedness('car#n#1', 'bus#n#2'); To get traces for the above computation: print $object->getTraceString(); However, traces must be enabled using configuration files. By default traces are turned off. Configuration files ------------------- The behaviour of the measures of semantic relatedness can be controlled by using configuration files. These configuration files specify how certain parameters are initialized within the object. A configuration file may be specififed as a parameter during the creation of an object using the new method. The configuration files follow a fixed file format. Every configuration file starts the name of the module ON THE FIRST LINE of the file. For example, a configuration file for the Resnik module will have on the first line 'WordNet::Similarity::res'. This is followed by the various parameters, each on a new line and having the form 'name::value'. The 'value' of a parameter is optional (in case of boolean parameters). In case 'value' is omitted, we would have just 'name::' on that line. Comments are supported in the configuration file. Anything following a '#' is ignored in the configuration file. Sample configuration files are present in the '/samples' subdirectory of the package. Each of the modules has specific parameters that can be set/reset using the configuration files. Please read the manpages or the perldocs of the respective modules for details on the parameters specific to each of the modules. For instance, 'man WordNet::Similarity::res' or 'perldoc WordNet::Similarity::res' should display the documentation for the Resnik module. Information Content ------------------- Three of the measures provided within the package require information content values of concepts (WordNet synsets) for computing the semantic relatedness of concepts. Resnik (1995) describes a method for computing the information content of concepts from large corpora of text. In order to compute information content of concepts, according to the method described in the paper, we require the frequency of occurrence of every concept in a large corpus of text. We provide these frequency counts to the three measures (Resnik, Jiang-Conrath and Lin measures) in files that we call information content files. These files contain a list of WordNet synset offsets along with their part of speech and frequency count. The files are also used to determine the topmost node of the noun and verb 'is-a' hierarchies in WordNet. The information content file that should be used by a module is specified in the configuration file of that module. If no information content file is specified, then the default information content file, generated at the time of the installation of the WordNet::Similarity modules, is used. A description of the format of these files follows. The FIRST LINE of this file must contain the version of WordNet that the file was created with. This should be present as a string of the form wnver:: For example, if WordNet version 1.7.1 was used for creation of the information content file, the following line would be present at the start of the information content file. wnver::1.7.1 The rest of the file contains on each line a WordNet synset offset, part-of-speech and a frequency count, in the form [ROOT] without any leading or trailing spaces. For example, one of the lines of an information content file may be as follows. 63723n 667 where '63723' is a 'noun' synset offset and 667 is its frequency count. Suppose the noun synset with offset 1740 is the root node of one of the noun taxonomies and has a frequency count of 17625. Then this synset would appear in an information content file as follows: 1740n 17625 ROOT The ROOT tags are extremely significant in determining the top of the hierarchies and must not be omitted. Typically, frequency counts for the noun and verb hierarchies are present in each information content file. A number of support programs to generate these files from various corpora are present in the '/utils' directory of the package. A sample information content file has been provided in the '/samples' directory of the package. NOTE: Using the "Resnik" counting it is possible to get fractional values for the frequency counts. SUPPORTING PERL UTILITIES ========================= The '/utils' directory of the pacakge contains a few support Perl programs, that use the WordNet::Similarity modules or generate data files for it. As part of the standard installation these are installed into the system directories (such as /usr/bin or /usr/local/bin) from where they can be easily accessed. similarity.pl ------------- The similarity.pl program provides a commandline interface to the relatedness modules. - Usage Usage: similarity.pl [{--type TYPE [--config CONFIGFILE] [--allsenses] [--offsets] [--trace] [--wnpath PATH] [--simpath SIMPATH] {--file FILENAME | WORD1 WORD2} |--help |--version }] Displays the semantic similarity between the base forms of WORD1 and WORD2 using various similarity measures described in Budanitsky Hirst (2001). The parts of speech of WORD1 and/or WORD2 can be restricted by appending the part of speech (n, v, a, r) to the word. (For eg. car#n will consider only the noun forms of the word 'car' and walk#nv will consider the verb and noun forms of 'walk'). Individual senses of can also be given as input, in the form of word#pos#sense strings (For eg., car#n#1 represents the first sense of the noun 'car'). Options: --type Switch to select the type of similarity measure to be used while calculating the semantic relatedness. The following strings are defined. 'WordNet::Similarity::lch' The Leacock Chodorow measure. 'WordNet::Similarity::jcn' The Jiang Conrath measure. 'WordNet::Similarity::res' The Resnik measure. 'WordNet::Similarity::lin' The Lin measure. 'WordNet::Similarity::hso' The Hirst St. Onge measure. 'WordNet::Similarity::lesk' Adapted gloss overlap measure. 'WordNet::Similarity::vector' Gloss Vector measure. 'WordNet::Similarity::edge' Simple edge-counts (inverted). 'WordNet::Similarity::random' A random measure. --config Module-specific configuration file CONFIGFILE. This file contains the configuration that is used by the WordNet::Similarity modules during initialization. The format of this file is specific to each modules and is specified in the module man pages and in the documentation of the WordNet::Similarity package. --allsenses Displays the relatedness between every sense pair of the two input words WORD1 and WORD2. --offsets Displays all synsets (in the output, including traces) as synset offsets and part of speech, instead of the word#partOfSpeech#senseNumber format used by QueryData. With this option any WordNet synset is displayed as word#partOfSpeech#synsetOffset in the output. --trace Switches on 'Trace' mode. Displays as output on STDOUT, the various stages of the processing. This option overrides the trace option in the module configuration file (if specified). --file Allows the user to specify an input file FILENAME containing pairs of word whose semantic similarity needs to be measured. The file is assumed to be a plain text file with pairs of words separated by newlines, and the words of each pair separated by a space. --wnpath Option to specify the path of the WordNet data files as PATH. (Defaults to /usr/local/WordNet-1.7.1/dict on Unix systems and C:\WordNet\1.7.1\dict on Windows systems) --simpath If the relatedness module to be used, is locally installed, then SIMPATH can be used to indicate the location of the local install of the measure. --help Displays this help screen. --version Displays version information. NOTE: The environment variables WNHOME and WNSEARCHDIR, if present, are used to determine the location of the WordNet data files. Use '--wnpath' to override this. ANOTHER NOTE: For the information-content based measures, similarity.pl without the '--config' option invokes the relatedness modules using the default information content file generated during installation of the modules. If, however, the version of WordNet being used has changed since that time, or for some reason the modules are unable to locate the default information content files, then alternate information content files can be specified only via the configuration file. Utilities to generate information content files have been provided in the package. For the WordNet::Similarity::vector measure, it is mandatory to provide the location of a word vector data file and this can be only done by using the '--config' option. In short, the '--config' option is REQUIRED for the WordNet::Similarity::vector measure. Compound words may also be given as input to similarity.pl. They may be specified using underscores for spaces (as in WordNet) or may be enclosed within double quotes. For example: similarity.pl --type WordNet::Similarity::jcn school private_school similarity.pl --type WordNet::Similarity::lch "interest rate" bank Here 'private school' and 'interest rate' are the compound words intended in the two examples, respectively. ANOTHER NOTE: Using the '--file' option however, does not allow us to use both methods of entering compound words in the input file. The compound words in the input file may be entered only using underscores for spaces (the double quotes option is not available for input via the input file). The part of speech of the input word(s) may be restricted to one or more parts of speech by appending '#' followed by a combination of one or more of 'n', 'v', 'a' or 'r' (for nouns, verbs, adjectives and adverbs) to the one or both words. A particular sense of a particular word may also be specified as input in the word#pos#sense format. Here 'pos' is exactly one of 'n', 'v', 'a' or 'r'. For example: similarity.pl --type WordNet::Similarity::jcn school#n child#n similarity.pl --type WordNet::Similarity::lesk "interest rate#n" bank#nv similarity.pl --type WordNet::Similarity::hso telephone talk#v similarity.pl --type WordNet::Similarity::vector word#n#2 newspaper#v similarity.pl --type WordNet::Similarity::random chat#n#1 talk#v#2 - Interpreting the output In the simplest case interpreting the output is rather straightforward. This is the case when just the semantic relatedness of two words has been requested. The output, in this case, consists of the two words and the relatedness value. However, when the '--allsenses' option or the '--trace' option is specified, the program needs to display in the output, WordNet synsets. In order to do this, we decided to adopt the convention introduced by Jason Rennie in the WordNet::QueryData module to represent the WordNet synsets.According to this convention a synset is represented by (1) a representative word from that synset (2) its part of speech and (3) a number specifying the sense number of the word (in this synset) For example, consider the synset (teacher, instructor) from the noun data file of WordNet. Here the words 'teacher' as well as 'instructor' are each in their first sense. Using the above convention this synset may be represented by 'teacher#n#1' or by 'instructor#n#1'. Besides this, if '--offsets' commandline option is used, a small variation of the above convention is used that displays the offset of the synset (in the WordNet data file) instead of the sense number. The above synset could then be represented by 'teacher#n#8562747' or 'instructor#n#8562747', since 8562747 if the offset of this synset in the noun data file of WordNet 1.7. The first convention was adopted as the default, since synset offsets vary between different versions of WordNet, while sense numbers of words would more or less remain constant. - Typical usage examples (1) Suppose you wanted to find the measure of relatedness between 'car' and 'bicycle', using the Jiang-Conrath measure. similarity.pl --type WordNet::Similarity::jcn car bicycle (2) Suppose you need to find the relatedness of the noun forms of 'comb' and 'hair' using the Leacock-Chodorow measure and also your WordNet database files happen to be located at /wordnet1.7/dict, then you would have similarity.pl --type WordNet::Similarity::lch --wnpath /wordnet1.7/dict comb#n hair#n If the --wnpath option is not given, the program looks for the path to the data files in the WNHOME and the WNSEARCHDIR environment variables. If these have also not been specified, then by default the program assumes that the WordNet data files reside in the directory /usr/local/wordnet1.7/dict on a unix machine and in C:\wn17\dict on a windows machine. (3) An example using a data file as input to the program (using the Jiang-Conrath measure for this example) similarity.pl --type WordNet::Similarity::jcn --file testfile (4) Displaying relatedness between all senses of the two words along with traces. similarity.pl --type WordNet::Similarity::lch --allsenses --trace paper pencil (5) Displaying the relatedness between the verb form of 'talk' and all parts of speech of 'speaker', with traces using the adapted gloss overlap measure. similarity.pl --type WordNet::Similarity::lesk --trace speaker talk#v (6) Using a configuration file "/home/sid/lesk.conf" to specify the configuration options to the WordNet::Similarity::lesk module. similarity.pl --type WordNet::Similarity::lesk --config /home/sid/lesk.conf duck fowl (7) To display version information. similarity.pl --version (8) To display detailed help. similarity.pl --help infocontent.pl -------------- Three of the measures provided within the package require information content values of concepts (WordNet synsets) for computing the semantic relatedness of concepts. We provide these measures with frequency counts of WordNet synsets computed from large corpora of text, in files called information content files. A number of programs have been provided in the '/utils' subdirectory to generate information content files from various different corpora of text available. BNCFreq.pl -- from the BNC corpus. brownFreq.pl -- from the Brown corpus. semCor17Freq.pl -- from SemCor 1.7 (ignoring the sense tags). semTagFreq.pl -- from SemCor 1.7 (using the sense tags). treebankFreq.pl -- from the Treebank corpus. rawtextFreq.pl -- from raw text. All the six have a similar interface, however there are slight differences in the way the programs are called on the command-line due to the differences in the organization and format of the various corpora. But the following sub-sections give the typical usage and examples of all these programs. Please use the '--help' switch of each of the programs for the exact usage and help. - Usage [{--compfile COMPFILE --outfile OUTFILE [--stopfile STOPFILE] [--wnpath WNPATH] [--resnik] [--smooth SCHEME] PATH | --help | --version }] Here is one of the Perl programs provided, that generates an information content file from a large corpus of text. This program computes the information content of concepts, by counting the frequency of their occurrence in a corpus. PATH specifies the files of the corpus or the root of the directory tree containing the text of the corpus. Each utility has a different way in which the input files may be specified to it. Please use --help to get the -specific idiosyncracies. Options: --compfile Used to specify the file COMPFILE containing the list of compounds in WordNet. --outfile Specifies the output file OUTFILE. --stopfile STOPFILE is a list of stop listed words that will not be considered in the frequency count. --wnpath Option to specify WNPATH as the location of WordNet data files. If this option is not specified, the program tries to determine the path to the WordNet data files using the WNHOME environment variable. --resnik To enable the counting of frequencies using the method described by Resnik [3]. This was the method of counting originally used. We implemented a different scheme of counting described in our publication (Patwardhan, Banerjee and Pedersen [9]). --smooth Specifies the smoothing to be used on the probabilities computed. SCHEME specifies the type of smoothing to perform. It is a string, which can be only be 'ADD1' as of now. Other smoothing schemes will be added in future releases. --help Displays this help screen. --version Displays version information. A sample COMPFILE containing the list of compounds in WordNet 1.7 is present is the '/samples' subdirectory. A utility called compounds.pl has been provided in the '/utils' subdirectory. This utility generates a list of compounds present in your version of WordNet and can be used to generate a file containing the list of compounds in WordNet as follows: compounds.pl > compounds.dat In this case compounds.pl detects the location of the WordNet data files using the WNHOME environment variable. If the WNHOME environment variable has not been set up it tries default locations (C:\Program Files\WordNet\1.7.1 on Windows and /usr/local/WordNet-1.7.1 on a unix system). Another way to specify the location of the WordNet data files is by using the '--wnpath' option in compounds.pl, like so compounds.pl --wnpath /usr/local/wordnet1.6/dict > compounds.dat - The utility-specific idiosyncracies (a) BNCFreq.pl -- This utility creates the information content file from the word frequencies counted from the BNC. The data files in the BNC are XML tagged files present in a 2-level directory structure. In a typical BNC install, the data files of the BNC reside in /BNCWorld/Texts subpath of the BNC installation. This is the path that needs to be specified to BNCFreq.pl to count the frequency of words: BNCFreq.pl [OPTIONS] /home/sid/BNC/BNCWorld/Texts (b) brownFreq.pl -- The version of the Brown corpus that we used, contained the data files in the BROWN1 and BROWN2 subdirectories of the Brown corpus installation. Both directories contain the same data formatted a little differently. These files are provided to the brownFreq.pl utility as a list of files (commonly specified by wildcards, as follows): brownFreq.pl [OPTIONS] /home/sid/Brown/BROWN1/*.TXT (c) semCor17Freq.pl -- SemCor 1.7 was downloaded from Dr. Rada Mihalcea's website http://www.cs.unt.edu/~rada/software.html and is a sense tagged version of the Brwon Corpus. The tagged data files are present in the /brown1/tagfiles subdirectory of the extracted package. These files are provided to the semCor17Freq.pl utility as a list of files (using wildcards): semCor17Freq.pl [OPTIONS] /home/sid/semcor17/brown1/tagfiles/* (d) semTagFreq.pl -- These information content files are computed from SemCor 1.7 (using the sense tags). The word frequencies for these have already been computed and are distributed as a part of the standard distribution of WordNet. Thus only the location of the WordNet data files need be specified for this utility (using the WNHOME environment variable or the '--wnpath' option). (e) treebankFreq.pl -- This utility computes the information content files from the Treebank corpus (only the Wall Street Journal articles). The Wall Street Journal articles are usually present in the /raw/wsj subdirectory of the Treebank installation. Only this needs to be specified when using this utility: treebankFreq.pl [OPTIONS] /home/sid/treebank/raw/wsj (f) rawtextFreq.pl -- To compute the information content files from raw text, only the raw text file(s) need to be specified as the input. - Some typical examples (1) In order to generate the information content file from the BNC, we type the command: BNCFreq.pl --compfile ../samples/compounds.dat --outfile infoBNC.dat /home/sid/BNC/BNCWorld/Texts Here '/home/sid/BNCWorld/Texts' is the path containing the BNC. Ouptut information content file infoBNC.dat is generated and 'compounds.dat' is used for the list of compound in WordNet. (2) Frequency counts generated from the Brown corpus, using a stop-list. brownFreq.pl --compfile compounds.dat --outfile infoBrown.dat --stopfile stop.txt /home/sid/Brown/* Uses the file 'stop.txt' containing stop words -- words that are ignored while counting the frequencies. (3) Frequency counts generated from a raw text file, using Resnik counting. rawtextFreq.pl --compfile compounds.dat --outfile infoRawText.dat --resnik /home/sid/Texts/WorldWar.txt WorldWar.txt is the raw text file. infoRawText.dat is the output information content file. (4) Using a the Treebank corpus (WSJ articles) to generate an information content file with the option for Add-1 smoothing. treebankFreq.pl --compfile compounds.dat --outfile tbInfo.dat --smooth ADD1 /home/sid/treebank/raw/wsj 'tbInfo.dat' is the output file. '/home/sid/treebank/raw/wsj' is the path to the Wall Street Journal articles of the Treebank corpus. The '--smooth ADD1' requests the program to use Add-1 smoothing of the frequency counts. It adds 1 to all frequency counts to prevent any 0 frequency values. wordVectors.pl -------------- The WordNet::Similarity::vector module requires a BerkeleyDB file containing co-occurrence vectors for all the words in the WordNet glosses. The utility wordVectors.pl has been provided to generate such a database file. This utility generates co-occurrence vectors from the WordNet glosses themselves. Utilities to generate these from other corpora will be provided in future releases of this software. Usage: wordVectors.pl [{ [--compfile COMPOUNDS] [--stopfile STOPLIST] [--wnpath WNPATH] [--noexamples] [--cutoff VALUE] DBFILE | --help | --version }] This program writes out word vectors computed from WordNet in a BerkeleyDB database (Hash) specified by filename DBFILE. Options: --compfile Option specifying the the list of compounds present in WordNet in the file COMPOUNDS. This list is used for compound detection. --stopfile Option specifying a list of stopwords to not be considered while counting. --wnpath WNPATH specifies the path of the WordNet data files. Ordinarily, this path is determined from the $WNHOME environment variable. But this option overides this behavior. --noexamples Removes examples from the glosses before processing. --cutoff Option used to restrict the dimensions of the word vectors with a tf/idf cutoff. VALUE is the cutoff. Only a tf/idf score above VALUE is acceptable. --help Displays this help screen. --version Displays version information. COPYRIGHT AND LICENCE ===================== This suite of programs is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. Note: The text of the GNU General Public License is provided in the file 'GPL' that you should have received with this distribution. Copyright (C) 2003 Siddharth Patwardhan and Ted Pedersen This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself. ACKNOWLEDGEMENTS ================ We would like to thank the following for their support and contribution towards the development of this package. We thank Jason Rennie for his QueryData package, the WordNet guys at Princeton for WordNet, Resnik, Hirst, St. Onge, Jiang, Conrath, Lin, Leacock and Chodorow for their algorithms and work on the relatedness measures. We also thank Bano (Satanjeev Banerjee) for his work on the adapted gloss overlap module. REFERENCES ========== (1) Leacock C. and Chodorow M. 1998. Combining local context and WordNet similarity for word sense identification. In Fellbaum 1998, pp. 265-283. (2) Jiang J. and Conrath D. 1997. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of International Conference on Research in Computational Linguistics, Taiwan. (3) Resnik P. 1995. Using information content to evaluate semantic similarity. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, pages 448-453, Montreal. (4) Lin D. 1998. An information-theoretic definition of similarity. In Proceedings of the 15th International Conference on Machine Learning, Madison, WI. (5) Hirst G. and St-Onge D. 1998. Lexical Chains as representations of context for the detection and correction of malapropisms. In Fellbaum 1998, pp. 305-332. (6) Budanitsky A. and Hirst G. 2001. Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures. In Workshop on WordNet and Other Lexical Resources, Second meeting of the North American Chapter of the Association for Computational Linguistics. Pittsburgh, PA. (7) Banerjee S. and Pedersen T. 2002. An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet. In Proceeding of the Fourth International Conference on Computational Linguistics and Intelligent Text Processing (CICLING-02). Mexico City. (8) Fellbaum C., editor. WordNet: An electronic lexical database. MIT Press, 1998. (9) Patwardhan S., Banerjee S. and Pedersen T. 2002. Using Semantic Relatedness for Word Sense Disambiguation. In Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics, Mexico City. (10) Schütze H. 1998. Automatic Word Sense Discrimination. Computational Linguistics, 24(1):97-123. (README: Last Updated 06/09/2003 -- Sid.)