Word Sorting - The GNU Awk User's Guide

Next: History Sorting, Previous: Labels Program, Up: Miscellaneous Programs

13.3.5 Generating Word-Usage Counts

The following awk program prints the number of occurrences of each word in its input. It illustrates the associative nature of awk arrays by using strings as subscripts. It also demonstrates the `for index in array' mechanism. Finally, it shows how awk is used in conjunction with other utility programs to do a useful task of some complexity with a minimum of effort. Some explanations follow the program listing:

     # Print list of word frequencies
     {
         for (i = 1; i <= NF; i++)
             freq[$i]++
     }
     
     END {
         for (word in freq)
             printf "%s\t%d\n", word, freq[word]
     }

This program has two rules. The first rule, because it has an empty pattern, is executed for every input line. It uses awk's field-accessing mechanism (see Fields) to pick out the individual words from the line, and the built-in variable NF (see Built-in Variables) to know how many fields are available. For each input word, it increments an element of the array freq to reflect that the word has been seen an additional time.

The second rule, because it has the pattern END, is not executed until the input has been exhausted. It prints out the contents of the freq table that has been built up inside the first action. This program has several problems that would prevent it from being useful by itself on real text files:

Words are detected using the awk convention that fields are separated just by whitespace. Other characters in the input (except newlines) don't have any special meaning to awk. This means that punctuation characters count as part of words.
The awk language considers upper- and lowercase characters to be distinct. Therefore, “bartender” and “Bartender” are not treated as the same word. This is undesirable, since in normal text, words are capitalized if they begin sentences, and a frequency analyzer should not be sensitive to capitalization.
The output does not come out in any useful order. You're more likely to be interested in which words occur most frequently or in having an alphabetized table of how frequently each word occurs.

The way to solve these problems is to use some of awk's more advanced features. First, we use tolower to remove case distinctions. Next, we use gsub to remove punctuation characters. Finally, we use the system sort utility to process the output of the awk script. Here is the new version of the program:

     
     # wordfreq.awk --- print list of word frequencies
     
     {
         $0 = tolower($0)    # remove case distinctions
         # remove punctuation
         gsub(/[^[:alnum:]_[:blank:]]/, "", $0)
         for (i = 1; i <= NF; i++)
             freq[$i]++
     }
     
     END {
         for (word in freq)
             printf "%s\t%d\n", word, freq[word]
     }

Assuming we have saved this program in a file named wordfreq.awk, and that the data is in file1, the following pipeline:

     awk -f wordfreq.awk file1 | sort -k 2nr

produces a table of the words appearing in file1 in order of decreasing frequency. The awk program suitably massages the data and produces a word frequency table, which is not ordered.

The awk script's output is then sorted by the sort utility and printed on the terminal. The options given to sort specify a sort that uses the second field of each input line (skipping one field), that the sort keys should be treated as numeric quantities (otherwise `15' would come before `5'), and that the sorting should be done in descending (reverse) order.

The sort could even be done from within the program, by changing the END action to:

     
     END {
         sort = "sort -k 2nr"
         for (word in freq)
             printf "%s\t%d\n", word, freq[word] | sort
         close(sort)
     }

This way of sorting must be used on systems that do not have true pipes at the command-line (or batch-file) level. See the general operating system documentation for more information on how to use the sort program.