String Functions - The GNU Awk User's Guide

Next: I/O Functions, Previous: Numeric Functions, Up: Built-in

8.1.3 String-Manipulation Functions

The functions in this section look at or change the text of one or more strings. Optional parameters are enclosed in square brackets ([ ]). Those functions that are specific to gawk are marked with a pound sign (`#'):

asort(source [, dest]) #

asort is a gawk-specific extension, returning the number of elements in the array source. The contents of source are sorted using gawk's normal rules for comparing values (in particular, IGNORECASE affects the sorting) and the indices of the sorted values of source are replaced with sequential integers starting with one. If the optional array dest is specified, then source is duplicated into dest. dest is then sorted, leaving the indices of source unchanged. For example, if the contents of a are as follows:

          a["last"] = "de"
          a["first"] = "sac"
          a["middle"] = "cul"

A call to asort:

          asort(a)

results in the following contents of a:

          a[1] = "cul"
          a[2] = "de"
          a[3] = "sac"

The asort function is described in more detail in Array Sorting. asort is a gawk extension; it is not available in compatibility mode (see Options).

asorti(source [, dest]) #

asorti is a gawk-specific extension, returning the number of elements in the array source. It works similarly to asort, however, the indices are sorted, instead of the values. As array indices are always strings, the comparison performed is always a string comparison. (Here too, IGNORECASE affects the sorting.)

The asorti function is described in more detail in Array Sorting. It was added in gawk 3.1.2. asorti is a gawk extension; it is not available in compatibility mode (see Options).

index(in, find)

This searches the string in for the first occurrence of the string find, and returns the position in characters where that occurrence begins in the string in. Consider the following example:

          $ awk 'BEGIN { print index("peanut", "an") }'
          -| 3

If find is not found, index returns zero. (Remember that string indices in awk start at one.)

length([string])

This returns the number of characters in string. If string is a number, the length of the digit string representing that number is returned. For example, length("abcde") is 5. By contrast, length(15 * 35) works out to 3. In this example, 15 * 35 = 525, and 525 is then converted to the string "525", which has three characters.

If no argument is supplied, length returns the length of $0.

NOTE: In older versions of awk, the length function could be called without any parentheses. Doing so is marked as “deprecated” in the POSIX standard. This means that while a program can do this, it is a feature that can eventually be removed from a future version of the standard. Therefore, for programs to be maximally portable, always supply the parentheses.

match(string, regexp [, array])

The match function searches string for the longest, leftmost substring matched by the regular expression, regexp. It returns the character position, or index, at which that substring begins (one, if it starts at the beginning of string). If no match is found, it returns zero.

The regexp argument may be either a regexp constant (`/.../') or a string constant ("..."). In the latter case, the string is treated as a regexp to be matched. Computed Regexps, for a discussion of the difference between the two forms, and the implications for writing your program correctly.

The order of the first two arguments is backwards from most other string functions that work with regular expressions, such as sub and gsub. It might help to remember that for match, the order is the same as for the `~' operator: `string ~ regexp'.

The match function sets the built-in variable RSTART to the index. It also sets the built-in variable RLENGTH to the length in characters of the matched substring. If no match is found, RSTART is set to zero, and RLENGTH to −1.

For example:

          
          {
                 if ($1 == "FIND")
                   regex = $2
                 else {
                   where = match($0, regex)
                   if (where != 0)
                     print "Match of", regex, "found at",
                               where, "in", $0
                 }
          }

This program looks for lines that match the regular expression stored in the variable regex. This regular expression can be changed. If the first word on a line is `FIND', regex is changed to be the second word on that line. Therefore, if given:

          
          FIND ru+n
          My program runs
          but not very quickly
          FIND Melvin
          JF+KM
          This line is property of Reality Engineering Co.
          Melvin was here.

awk prints:

          Match of ru+n found at 12 in My program runs
          Match of Melvin found at 1 in Melvin was here.

If array is present, it is cleared, and then the 0th element of array is set to the entire portion of string matched by regexp. If regexp contains parentheses, the integer-indexed elements of array are set to contain the portion of string matching the corresponding parenthesized subexpression. For example:

          $ echo foooobazbarrrrr |
          > gawk '{ match($0, /(fo+).+(bar*)/, arr)
          >           print arr[1], arr[2] }'
          -| foooo barrrrr

In addition, beginning with gawk 3.1.2, multidimensional subscripts are available providing the start index and length of each matched subexpression:

          $ echo foooobazbarrrrr |
          > gawk '{ match($0, /(fo+).+(bar*)/, arr)
          >           print arr[1], arr[2]
          >           print arr[1, "start"], arr[1, "length"]
          >           print arr[2, "start"], arr[2, "length"]
          > }'
          -| foooo barrrrr
          -| 1 5
          -| 9 7

There may not be subscripts for the start and index for every parenthesized subexpressions, since they may not all have matched text; thus they should be tested for with the in operator (see Reference to Elements).

The array argument to match is a gawk extension. In compatibility mode (see Options), using a third argument is a fatal error.

split(string, array [, fieldsep])

This function divides string into pieces separated by fieldsep and stores the pieces in array. The first piece is stored in array[1], the second piece in array[2], and so forth. The string value of the third argument, fieldsep, is a regexp describing where to split string (much as FS can be a regexp describing where to split input records). If fieldsep is omitted, the value of FS is used. split returns the number of elements created.

The split function splits strings into pieces in a manner similar to the way input lines are split into fields. For example:

          split("cul-de-sac", a, "-")

splits the string `cul-de-sac' into three fields using `-' as the separator. It sets the contents of the array a as follows:

          a[1] = "cul"
          a[2] = "de"
          a[3] = "sac"

The value returned by this call to split is three.

As with input field-splitting, when the value of fieldsep is " ", leading and trailing whitespace is ignored, and the elements are separated by runs of whitespace. Also as with input field-splitting, if fieldsep is the null string, each individual character in the string is split into its own array element. (This is a gawk-specific extension.)

Note, however, that RS has no effect on the way split works. Even though `RS = ""' causes newline to also be an input field separator, this does not affect how split splits strings.

Modern implementations of awk, including gawk, allow the third argument to be a regexp constant (/abc/) as well as a string. (d.c.) The POSIX standard allows this as well. Computed Regexps, for a discussion of the difference between using a string constant or a regexp constant, and the implications for writing your program correctly.

Before splitting the string, split deletes any previously existing elements in the array array.

If string is null, the array has no elements. (So this is a portable way to delete an entire array with one statement. See Delete.)

If string does not match fieldsep at all (but is not null), array has one element only. The value of that element is the original string.

sprintf(format, expression1, ...)

This returns (without printing) the string that printf would have printed out with the same arguments (see Printf). For example:

          pival = sprintf("pi = %.2f (approx.)", 22/7)

assigns the string "pi = 3.14 (approx.)" to the variable pival.

strtonum(str) #

Examines str and returns its numeric value. If str begins with a leading `0', strtonum assumes that str is an octal number. If str begins with a leading `0x' or `0X', strtonum assumes that str is a hexadecimal number. For example:

          $ echo 0x11 |
          > gawk '{ printf "%d\n", strtonum($1) }'
          -| 17

Using the strtonum function is not the same as adding zero to a string value; the automatic coercion of strings to numbers works only for decimal data, not for octal or hexadecimal.¹

strtonum is a gawk extension; it is not available in compatibility mode (see Options).

sub(regexp, replacement [, target])

The sub function alters the value of target. It searches this value, which is treated as a string, for the leftmost, longest substring matched by the regular expression regexp. Then the entire string is changed by replacing the matched text with replacement. The modified string becomes the new value of target.

This function is peculiar because target is not simply used to compute a value, and not just any expression will do—it must be a variable, field, or array element so that sub can store a modified value there. If this argument is omitted, then the default is to use and alter $0.² For example:

          str = "water, water, everywhere"
          sub(/at/, "ith", str)

sets str to "wither, water, everywhere", by replacing the leftmost longest occurrence of `at' with `ith'.

The sub function returns the number of substitutions made (either one or zero).

If the special character `&' appears in replacement, it stands for the precise substring that was matched by regexp. (If the regexp can match more than one string, then this precise substring may vary.) For example:

          { sub(/candidate/, "& and his wife"); print }

changes the first occurrence of `candidate' to `candidate and his wife' on each input line. Here is another example:

          $ awk 'BEGIN {
          >         str = "daabaaa"
          >         sub(/a+/, "C&C", str)
          >         print str
          > }'
          -| dCaaCbaaa

This shows how `&' can represent a nonconstant string and also illustrates the “leftmost, longest” rule in regexp matching (see Leftmost Longest).

The effect of this special character (`&') can be turned off by putting a backslash before it in the string. As usual, to insert one backslash in the string, you must write two backslashes. Therefore, write `\\&' in a string constant to include a literal `&' in the replacement. For example, the following shows how to replace the first `|' on each line with an `&':

          { sub(/\|/, "\\&"); print }

As mentioned, the third argument to sub must be a variable, field or array reference. Some versions of awk allow the third argument to be an expression that is not an lvalue. In such a case, sub still searches for the pattern and returns zero or one, but the result of the substitution (if any) is thrown away because there is no place to put it. Such versions of awk accept expressions such as the following:

          sub(/USA/, "United States", "the USA and Canada")

For historical compatibility, gawk accepts erroneous code, such as in the previous example. However, using any other nonchangeable object as the third parameter causes a fatal error and your program will not run.

Finally, if the regexp is not a regexp constant, it is converted into a string, and then the value of that string is treated as the regexp to match.

gsub(regexp, replacement [, target])

This is similar to the sub function, except gsub replaces all of the longest, leftmost, nonoverlapping matching substrings it can find. The `g' in gsub stands for “global,” which means replace everywhere. For example:

          { gsub(/Britain/, "United Kingdom"); print }

replaces all occurrences of the string `Britain' with `United Kingdom' for all input records.

The gsub function returns the number of substitutions made. If the variable to search and alter (target) is omitted, then the entire input record ($0) is used. As in sub, the characters `&' and `\' are special, and the third argument must be assignable.

gensub(regexp, replacement, how [, target]) #

gensub is a general substitution function. Like sub and gsub, it searches the target string target for matches of the regular expression regexp. Unlike sub and gsub, the modified string is returned as the result of the function and the original target string is not changed. If how is a string beginning with `g' or `G', then it replaces all matches of regexp with replacement. Otherwise, how is treated as a number that indicates which match of regexp to replace. If no target is supplied, $0 is used.

gensub provides an additional feature that is not available in sub or gsub: the ability to specify components of a regexp in the replacement text. This is done by using parentheses in the regexp to mark the components and then specifying `\N' in the replacement text, where N is a digit from 1 to 9. For example:

          $ gawk '
          > BEGIN {
          >      a = "abc def"
          >      b = gensub(/(.+) (.+)/, "\\2 \\1", "g", a)
          >      print b
          > }'
          -| def abc

As with sub, you must type two backslashes in order to get one into the string. In the replacement text, the sequence `\0' represents the entire matched text, as does the character `&'.

The following example shows how you can use the third argument to control which match of the regexp should be changed:

          $ echo a b c a b c |
          > gawk '{ print gensub(/a/, "AA", 2) }'
          -| a b c AA b c

In this case, $0 is used as the default target string. gensub returns the new string as its result, which is passed directly to print for printing.

If the how argument is a string that does not begin with `g' or `G', or if it is a number that is less than or equal to zero, only one substitution is performed. If how is zero, gawk issues a warning message.

If regexp does not match target, gensub's return value is the original unchanged value of target.

gensub is a gawk extension; it is not available in compatibility mode (see Options).

substr(string, start [, length])

This returns a length-character-long substring of string, starting at character number start. The first character of a string is character number one.³ For example, substr("washington", 5, 3) returns "ing".

If length is not present, this function returns the whole suffix of string that begins at character number start. For example, substr("washington", 5) returns "ington". The whole suffix is also returned if length is greater than the number of characters remaining in the string, counting from character start.

If start is less than one, substr treats it as if it was one. (POSIX doesn't specify what to do in this case: Unix awk acts this way, and therefore gawk does too.) If start is greater than the number of characters in the string, substr returns the null string. Similarly, if length is present but less than or equal to zero, the null string is returned.

The string returned by substr cannot be assigned. Thus, it is a mistake to attempt to change a portion of a string, as shown in the following example:

          string = "abcdef"
          # try to get "abCDEf", won't work
          substr(string, 3, 3) = "CDE"

It is also a mistake to use substr as the third argument of sub or gsub:

          gsub(/xyz/, "pdq", substr($0, 5, 20))  # WRONG

(Some commercial versions of awk do in fact let you use substr this way, but doing so is not portable.)

If you need to replace bits and pieces of a string, combine substr with string concatenation, in the following manner:

          string = "abcdef"
          ...
          string = substr(string, 1, 2) "CDE" substr(string, 6)

tolower(string)

This returns a copy of string, with each uppercase character in the string replaced with its corresponding lowercase character. Nonalphabetic characters are left unchanged. For example, tolower("MiXeD cAsE 123") returns "mixed case 123".

toupper(string)

This returns a copy of string, with each lowercase character in the string replaced with its corresponding uppercase character. Nonalphabetic characters are left unchanged. For example, toupper("MiXeD cAsE 123") returns "MIXED CASE 123".

Footnotes

[1] Unless you use the --non-decimal-data option, which isn't recommended. See Nondecimal Data, for more information.

[2] Note that this means that the record will first be regenerated using the value of OFS if any fields have been changed, and that the fields will be updated after the substituion, even if the operation is a “no-op” such as `sub(/^/, "")'.

[3] This is different from C and C++, in which the first character is number zero.