Previous: Multi-scanning, Up: Arrays



7.11 Sorting Array Values and Indices with gawk

The order in which an array is scanned with a `for (i in array)' loop is essentially arbitrary. In most awk implementations, sorting an array requires writing a sort function. While this can be educational for exploring different sorting algorithms, usually that's not the point of the program. gawk provides the built-in asort and asorti functions (see String Functions) for sorting arrays. For example:

     populate the array data
     n = asort(data)
     for (i = 1; i <= n; i++)
         do something with data[i]

After the call to asort, the array data is indexed from 1 to some number n, the total number of elements in data. (This count is asort's return value.) data[1] <= data[2] <= data[3], and so on. The comparison of array elements is done using gawk's usual comparison rules (see Typing and Comparison).

An important side effect of calling asort is that the array's original indices are irrevocably lost. As this isn't always desirable, asort accepts a second argument:

     populate the array source
     n = asort(source, dest)
     for (i = 1; i <= n; i++)
         do something with dest[i]

In this case, gawk copies the source array into the dest array and then sorts dest, destroying its indices. However, the source array is not affected.

Often, what's needed is to sort on the values of the indices instead of the values of the elements. To do that, starting with gawk 3.1.2, use the asorti function. The interface is identical to that of asort, except that the index values are used for sorting, and become the values of the result array:

     { source[$0] = some_func($0) }
     
     END {
         n = asorti(source, dest)
         for (i = 1; i <= n; i++) {
             do something with dest[i]             Work with sorted indices directly
             ...
             do something with source[dest[i]]     Access original array via sorted indices
         }
     }

If your version of gawk is 3.1.0 or 3.1.1, you don't have asorti. Instead, use a helper array to hold the sorted index values, and then access the original array's elements. It works in the following way:

     populate the array data
     # copy indices
     j = 1
     for (i in data) {
         ind[j] = i    # index value becomes element value
         j++
     }
     n = asort(ind)    # index values are now sorted
     for (i = 1; i <= n; i++) {
         do something with ind[i]           Work with sorted indices directly
         ...
         do something with data[ind[i]]     Access original array via sorted indices
     }

Sorting the array by replacing the indices provides maximal flexibility. To traverse the elements in decreasing order, use a loop that goes from n down to 1, either over the elements or over the indices.

Copying array indices and elements isn't expensive in terms of memory. Internally, gawk maintains reference counts to data. For example, when asort copies the first array to the second one, there is only one copy of the original array elements' data, even though both arrays use the values. Similarly, when copying the indices from data to ind, there is only one copy of the actual index strings.

We said previously that comparisons are done using gawk's “usual comparison rules.” Because IGNORECASE affects string comparisons, the value of IGNORECASE also affects sorting for both asort and asorti. Caveat Emptor.