Previous: Computed Regexps, Up: Regexp



2.9 Where You Are Makes A Difference

Modern systems support the notion of locales: a way to tell the system about the local character set and language. The current locale setting can affect the way regexp matching works, often in surprising ways. In particular, many locales do case-insensitive matching, even when you may have specified characters of only one particular case.

The following example uses the sub function, which does text replacement (see String Functions). Here, the intent is to remove trailing uppercase characters:

     $ echo something1234abc | gawk '{ sub("[A-Z]*$", ""); print }'
     -| something1234

This output is unexpected, since the `abc' at the end of `something1234abc' should not normally match `[A-Z]*'. This result is due to the locale setting (and thus you may not see it on your system). There are two fixes. The first is to use the POSIX character class `[[:upper:]]', instead of `[A-Z]'. The second is to change the locale setting in the environment, before running gawk, by using the shell statements:

     LANG=C LC_ALL=C
     export LANG LC_ALL

The setting `C' forces gawk to behave in the traditional Unix manner, where case distinctions do matter. You may wish to put these statements into your shell startup file, e.g., $HOME/.profile.

Similar considerations apply to other ranges. For example, `["-/]' is perfectly valid in ASCII, but is not valid in many Unicode locales, such as `en_US.UTF-8'. (In general, such ranges should be avoided; either list the characters individually, or use a POSIX character class such as `[[:punct:]]'.)

For the normal case of `RS = "\n"', the locale is largely irrelevant. For other single byte record separators, using `LC_ALL=C' will give you much better performance when reading records. Otherwise, gawk has to make several function calls, per input character to find the record terminator.