The Hackerlab at regexps.com

Posix Regexps

up: libhackerlab
next: XML Regular Expressions
prev: Unicode

The Posix.2 standard (ISO/IEC 994502: 1993 (ANSI/IEEE Std 1003.2 - 1992)), section 2.8) specifies the syntax and semantics of a "Regular Expression Notation". Appendix "B.5" defines a "C Binding for RE Matching" which includes the functions regcomp (to compile a Posix regexp), regexec (to search for a match), regerror (to translate a regexp error code into a string), and regfree (to release storage associated with a compiled regexp).

The Hackerlab C library provides Rx , an implementation of Posix.2 regexp functions (with some extensions).

This chapter describes that interface. An appendix to this manual contains an introduction to Posix regexps. (See An Introduction to Posix Regexps.) If you are unfamiliar with regexps, reading that appendix before reading this chapter may be helpful.

This chapter begins with documentation for the standard Posix functions (and some closely related non-standard extensions). If you are looking for programmer reference manual, see Posix Regexp Functions.

The Posix standard for regexps is precise. On the other hand, it is often implemented incorrectly and almost never implemented completely. A discussion of the relation between the Posix standard and Rx can be found in Rx and Conformance to the Posix Standard.

Obtaining good performance from regexp matchers is sometimes complicated: they are easy to understand and use in many common situations, but they require careful attention in applications that use complicated regexps and in applications that use regexp matching heavily. Some general advice can be found in Rx and the Performance of Regexp Pattern Matching.

Finally, if you are performance tuning a regexp-intensive application, you'll need to understand the non-standard interfaces in Tuning Posix Regexp and XML Regular Expression Performance.


Posix Regexp Functions

up: Posix Regexps
next: Rx and Conformance to the Posix Standard


#include <sys/types.h>
#include <hackerlab/rx-posix/regex.h>

The standard Posix regexp functions provided by Rx are:

     regcomp
     regexec
     regfree
     regerror

Two closely related but nonstandard functions are also provided:

     regncomp
     regnexec

Function regcomp

int regcomp (regex_t * preg, const char * pattern, int cflags);

Compile the 0-terminated regexp specification pattern .

The compiled pattern is stored in *preg , which has the field (required by Posix):

     size_t re_nsub;            The number of parenthesized
                                subexpressions in the compiled
                                pattern.

cflags is a combination of bits which effect compilation:


enum rx_cflags
{
  REG_EXTENDED = 1,
    /*  If REG_EXTENDED is set, then use extended regular expression
        syntax.  If not set, then use basic regular expression
        syntax. In extended syntax, none of the regexp operators are
        written with a backslash. */


  REG_ICASE = (REG_EXTENDED << 1),
  /*    If REG_ICASE is set, then ignore case when matching.  If not
        set, then case is significant. */

 
  REG_NOSUB = (REG_ICASE << 1),
  /*    Report only success/failure in `regexec'.
        Using this flag can improve performance for
        some regexps. */

  REG_NEWLINE = (REG_NOSUB << 1),
  /*    If REG_NEWLINE is set, then "." and complemented character
        sets do not match at newline characters in the string.  Also,
        "^" and "$" do match at newlines.
        
        If not set, then anchors do not match at newlines and the
        character sets contain newline.*/

  REG_DFA_ONLY = (REG_NEWLINE << 1),
  /*    If this bit is set, then restrict the pattern 
        language to patterns that compile to efficient 
        state machines.  In particular, `regexec' will
        not report positions for parenthesized subexpressions;
         "^", "$", backreferences ("\n"), and duplication
         ("{n,m}") are interpreted as normal characters.

        REG_DFA_ONLY is a non-standard flag. */
};

regcomp returns 0 on success and an error code on failure (see regerror).



Function regncomp

int regncomp (regex_t * preg,
              const char * pattern,
              size_t len,
              int cflags);

Compile the len -byte regexp specification pattern .

The compiled pattern is stored in *preg , which has the field (required by Posix):

     size_t re_nsub;            The number of parenthesized
                                subexpressions in the compiled
                                pattern.

cflags is a combination of bits which effect compilation. See regcomp.

regncomp returns 0 on success and an error code on failure (see regerror).

Note: regncomp is not part of the Posix.2 interface for regexp matching. It is an Rx extension.



Function regexec

int regexec (const regex_t *preg,
             const char *string,
             size_t nmatch,
             regmatch_t pmatch[],
             int eflags);

Search for a match of compiled regexp preg in string . Return the positions of the match and the first nmatch-1 parenthesized subexpressions in pmatch .

Return 0 if a match is found, an error code otherwise. See regerror.

It is possible to asynchronously abort a call to regexec . See Escaping Long-Running Matches.

preg must have been filled in by regcomp or regncomp .

string must be 0 terminated. See regnexec.

nmatch may be 0 and must not be negative (Posix specifies that the parameter be declared signed). It is the number of elements in the array pointed to by pmatch .

pmatch may be 0 if nmatch is 0 . The details of regmatch_t are:

struct rx_registers
{
  regoff_t rm_so;       /* Byte offset to substring start.  */
  regoff_t rm_eo;       /* Byte offset to substring end.  */

  int final_tag;        /* In pmatch[0] this field is set to
                         * the state label of the last DFA state 
                         * encountered during a match.
                         * 
                         * This field is implementation specific.
                         * Applications which intend to be portable
                         * between implementations of Posix should
                         * not use this field.
                         */
};

The state label of the final DFA state for most regexps is 1 . If a pattern contains the cut operator [[:cut <n>:]] its DFAs will contain a final state with label n at that point in the regexp. This is useful for detecting which of several possible alternatives actually occured in a match, as in this example:

     pattern: if[[:cut 1:]]\\|while[[:cut 2:]]

       pmatch[0].final_tag is 1 after matching "if" 
       pmatch[0].final_tag is 2 after matching "while"

eflags is a bit-wise or (| ) of any of these values:

enum rx_eflags
{
  REG_NOTBOL = 1,
  /* If REG_NOTBOL is set, then the beginning-of-line operator `^'
   * doesn't match the beginning of the input string (presumably
   * because it's not the beginning of a line).  If not set, then the
   * beginning-of-line operator does match the beginning of the
   * string.
   * 
   * (Standardized in Posix.2)
   */

  REG_NOTEOL = (REG_NOTBOL << 1),
  /* REG_NOTEOL is similar to REG_NOTBOL, except that it applies to
   * the end-of-line operator `$' and the end of the input string.
   * 
   * (Standardized in Posix.2)
   */

  REG_NO_SUBEXP_REPORTING = (REG_NOTEOL << 1),
  /* REG_NO_SUBEXP_REPORTING causes `regexec' to fill in only
   * `pmatch[0]' and to ignore other elements of `pmatch'.  For some
   * patterns (those which do not contain back-references or anchors)
   * this can speed up matching considerably.
   * 
   * (non-standard)
   */

  REG_ALLOC_REGS = (REG_NO_SUBEXP_REPORTING << 1),
  /* REG_ALLOC_REGS is only used by `regnexec'.  It causes `regnexec' 
   * to allocate storage for `regmatch_t' values.
   * 
   * (non-standard)
   */
};

The match returned satisfies the left-most longest rule which states a left-most match of the overall regexp will be returned. Of those matches, one of the longest will be returned.

There may be more than one longest match because two matches of equal length may differ in how they fill in the array pmatch . For example:

     "aaaabbbb" can match \(a*\)\(a*b*\)
          with pmatch[1] == "aaaa"   [*]
           and pmatch[2] == "bbbb"
        or
          with pmatch[1] == "aaa"
           and pmatch[2] == "abbbb"
        or
          with pmatch[1] == "aa"
           and pmatch[2] == "aabbbb"
        or
          with pmatch[1] == "a"
           and pmatch[2] == "aaabbbb"
        or
          with pmatch[1] == ""
           and pmatch[2] == "aaaabbbb"

Of the possible values of pmatch , Rx implements the standard behavior of returning that match which recursively maximizes the lengths of the substrings matched by each subpattern, from left to right. In the preceeding example, the correct answer is marked with [*] .



Function regnexec

int regnexec (const regex_t *preg,
              const char *string,
              regoff_t length,
              size_t nmatch,
              regmatch_t ** pmatch,
              int eflags);

Search for a match of compiled regexp preg in string . Return the positions of the match and the first nmatch-1 parenthesized subexpressions in *pmatch .

Return 0 if a match is found, an error code otherwise. See regerror.

preg must have been filled in by regcomp or regncomp .

string must be length bytes long.

See regnexec for details about other parameters but note that regnexec and regexec use different types for the parameter pmatch .

In regexec , pmatch is only used to pass a pointer. In regnexec , pmatch is used both to pass a pointer, and to return a pointer to the caller.

Callers are permitted to pass 0 for nmatch and pmatch . Callers are also permitted to pass the address of a pointer whose value is 0 for parameter pmatch . If they do so, and also set the bit REG_ALLOC_REGS in eflags , then pmatch will be a return parameter, returning a malloced array of preg->re_nsub elements containing the sub-expression positions of a successful match.

It is possible to asynchronously abort a call to regnexec . See Escaping Long-Running Matches.

Note: regnexec is not part of the Posix.2 interface for regexp matching. It is an Rx extension.



Function regfree

void regfree (regex_t *preg);

Release all storage allocated for the compiled regexp preg . This does not free preg itself.



Function regerror

size_t regerror (int errcode,
                 const regex_t *preg,
                 char *errbuf,
                 size_t errbuf_size);

Returns a message corresponding to an error code, errcode , returned from either regcomp or regexec . The size of the message is returned. At most, errbuf_size - 1 characters of the message are copied to errbuf . Whatever is stored in errbuf is 0-terminated.

The POSIX error codes for regexp pattern matchers are:

     REG_NOMATCH     "no match"
     REG_BADPAT      "invalid regular expression"
     REG_ECOLLATE    "invalid collation character"
     REG_ECTYPE      "invalid character class name"
     REG_EESCAPE     "trailing backslash"
     REG_ESUBREG     "invalid back reference"
     REG_EBRACK      "unmatched [ or [^"
     REG_EPAREN      "unmatched (, \\(, ) or \\)"
     REG_EBRACE      "unmatched \\{"
     REG_BADBR       "invalid content of \\{\\}"
     REG_ERANGE      "invalid range end"
     REG_ESPACE      "memory exhausted"
     REG_BADRPT      "invalid preceding regular expression"

Rx also provides a non-standard error code that is used if regexec or regnexec is interrupted (see Escaping Long-Running Matches).

     REG_MATCH_INTERRUPTED   "match interrupted"




Rx and Conformance to the Posix Standard

up: Posix Regexps
next: Rx and the Performance of Regexp Pattern Matching
prev: Posix Regexp Functions

Posix specifies the behavior of the C regexp functions quite precisely. Rx attempts to honor this specification to the greatest extent possible in a portable implementation. There are two areas of conformance that are worthy of note: the question of what is matched by a given regexp, and the question of how regexp matching interacts with Posix locales.

The question of "what is matched" is worthy of note because obtaining correct behavior has proven to be extremely difficult -- few implementations succeed. The implementation of Rx has been carefully developed and tested with correctness in mind.

The question of how regexps and locales interact is worthy of note because it is impossible to completely implement the behavior specified by Posix in a portable implementation (i.e., in an implementation that is not intimately familiar with the non-standard internals of a particular implementation of the standard C library).

What is Matched by a Posix Regexp

Posix requires that when regexec reports the position of a matching substring, it must report the first-occurring ("leftmost") match. Of the possible first-occuring matches, the longest match must be returned. This is called the left-most longest rule .

Posix requires that when regexec reports the position of a substring matched by a parenthesized subexpression, it must report the last substring matched by that subexpression. If one parenthesized expression (the "inner expression") is enclosed in another (the "outer expression") and the inner expression did not participate in the last match of the outer expression, then no substring match may be reported for the inner expression.

Finally, Posix requires that when regexec determines what each subpattern matched (regardless of whether the subpattern is surrounded by parentheses), if there is more than one possibility, regexec must choose the possibility that first maximizes the length of substrings matched by the outermost subpatterns, from left-to-right, and then recursively apply the same rule to inner subpatterns. However, this rule is subordinate to the left-most longest rule: if an earlier-occuring or longer overall match can be obtained by returning a non-maximal match for some subpattern, regexec must return that earlier or longer match.

This combination of constraints completely determines the return value of regexec and describes the behavior of Rx. Many other implementations do not conform to the standard in this regard -- in exceptional situations, compatibility issues may arise when switching between regexp implementations.

Rx and Posix Locales

Posix requires that a character set containing a multi-character collating element be treated specially. For example, if the character sequence ch is a collating element, then the regexp:

     [[.ch.]]

will match ch . On the other hand, if ch is not a collating element, the same expression is illegal. Similarly, an expression like:

     [a-z]

should match all collating elements which collate between "a" and "z" (inclusively), including multi-character collating elements that collate between "a" and "z".

Unfortunately, Posix provides no portable mechanism for determining which sequences of characters are multi-character collating elements, and which are not. Consequently, Rx operates as if multi-character collating elements did not exist.

Posix also defines a character set construct called an "equivalence class":

     [[=<X>=]]       where <X> is a collating element

An equivalence class stands for the set of all collating elements having the same primary collation weight as <X> . Unfortunately, Posix provides no portable mechanism for determining the primary collation weight of any collating element. Consequently, Rx implements the equivalence class construct by returning an error from regcomp whenever it is used.

Posix requires that in a character set, a range of characters such as:

     [a-z]

includes all characters that collate between a and z in the current locale. Some people argue that this behavior is confusing: that character ranges should be based on the encoding values of characters -- not on the rules of collation. Because of differences in collation, Posix advises that character ranges are a non-portable construct: portable programs should not use them at all!

Rx conforms to Posix by using collation order to interpret character ranges (with the exception that Rx always behaves as if there are no multi-character collating elements). Using the C locale when calling regcomp and regexec ensures that character ranges will be interpreted in a way consistent with the ASCII character set.


Rx and the Performance of Regexp Pattern Matching

up: Posix Regexps
next: Escaping Long-Running Matches
prev: Rx and Conformance to the Posix Standard

The performance (speed and memory use) of any Posix regexp matcher (including Rx) is a complicated subject. Programmers who want to use regexps will benefit by understanding the issues, at least in broad outline, so that they can avoid pitfalls, so they can make the best possible use of a particular implementation (Rx, in this case), and so they know where to delve deeper when performance issues become particularly important.

Traditionally, many programmers use regexps as if they were always computationally inexpensive. This is naive. Some uses of regexps are inexpensive, others are intractable. Many fall somewhere in the middle. Which uses fall into which cases varies between implementations.

Posix Regexp Performance in General

This section describes the performance of Posix regexp matching in general -- it is not specific to Rx.

Posix regexp pattern matching, in its full generality, is a computationally expensive process. In some cases, involving only moderately sized regexps, it is intractably expensive, regardless of what implementation is being used. Thus, one should never write programs with the assumption that regexec will always return normally in a reasonable amount of time. (Rx includes non-standard functionality which can be used to interrupt a call to regexec after a time-out. See Escaping Long-Running Matches.)

On the other hand, for many very simple regexps, Posix regexp matching is very inexpensive, again, (nearly) regardless of implementation. For example, if a regexp is simply a string of literal characters, searching for a match is almost certain to be fast.

Implementations of Posix regexps often differ in the set of optimizations they provide. Simplistic implementations, containing few optimizations, perform well for small and simple regexps, but poorly in many other cases. Sophisticated implementations can perform very well on even large regexps if those regexps are true regular expressions or are nearly true regular expressions. (For an explanation of the distinction between regexps in general and "true regular expressions", see An Introduction to Posix Regexps.)

Implementations of Posix regexps often differ in correctness, and this has a bearing on performance. Several popular implementations sometimes give incorrect results. The bugs that cause those errors also improve the performance of the same matchers on some patterns for which they give correct results. Thus, programmers choosing an implementation are sometimes faced with the uncomfortable trade-off between the best performance bench-mark results, and the best correctness testing results. In such situations, an important question is the relevance of the tests: do the bench-mark tests accurately reflect regexp usage in the target application? What are the risks of using an incorrect matcher in the application? Consider whether the better performance of buggy matchers on some expressions is offset by their considerably worse performance on other expressions: it is not the case that the buggy implementations are always faster.

Posix Regexp Performance in Rx

This section describes the performance of Posix regexp matching in Rx.

Rx is designed to give excellent performance over the widest possible range of regexps (including many large, complicated regexps), but to never sacrifice correctness. While Rx is at least competitive with most most implementations on most regexps (and is sometimes much faster), there are some regexps for which Rx is much slower than other implementations. Often, this difference can be attributed to the bugs in other implementations which speed up some cases while getting other cases wrong. This is something to keep in mind when comparing Rx to other implementations.

When a trade-off is necessary between memory use and speed, Rx is designed to allow programmers to choose how much memory to use and to provide programmers with the tools necessary to tune memory use for best possible performance. Rx can operate usefully (though comparatively slowly) with as little as 10-20K of dynamically allocated memory. As a rule of thumb, Rx's default of approximately 2MB is suitable even for applications that use regexps fairly heavily. (See Tuning Posix Regexp and XML Regular Expression Performance.)

Rx contains optimizations targetted for regexps which are true regular expressions. Rx converts true regular expressions to deterministic automata and can compare such expressions to a potentially matching string very quickly. This optimization makes it practical to use even quite large regular expressions. For more information, see the the appendix Data Sheet for the Hackerlab Rx Posix Regexp Functions.

Sometimes regexps which are not true regular expressions can be matched as quickly as if they were true regular expressions. If a regexp is not a regular expression only because it begins with a left anchor (^ ) and/or ends with a right anchor ($ ), Rx can match the expression as quickly as a true regular expression. If, in addition, the regexp contains parenthesized subexpressions, Rx can use regular expression optimizations if, either, the REG_NOSUB flag is passed to regcomp , or the nmatch parameter to regexec is less than or equal to 1 (i.e., if regexec is not required to report the positions of substrings matched by parenthesized subexpressions).

If a regexp is not a regular expression because it contains backreferences (\n ) or counted iterations (RE{n,m} ), Rx's DFA optimizations do not apply in their full generality. Such regexps run the greatest risk of being slow.

The Rx implementation of regcomp supports a non-standard flag, REG_DFA_ONLY , which can be used to disable all regexp constructs that are forbidden in true regular expressions. See regcomp.

When regexps are being used for lexical analysis, good performance can often be achieved by using true regular expressions in combination with the non-standard regexp operator [[:cut n:]] . See The cut Operator.


Escaping Long-Running Matches

up: Posix Regexps
prev: Rx and the Performance of Regexp Pattern Matching


#include <hackerlab/rx/escape.h>

Regexp searches can take a long time. Rx makes provisions for asynchronously aborting a long-running match. When regexec or regnexec is aborted, it returns REG_MATCH_INTERRUPTED . Callers of other match functions (such as rx_xml_is_match ) can catch asynchronous interrupts using the jump buffer rx_escape_jmp_buf (documented below).

Asynchronous match interrupts are permitted whenever Rx calls the function pointed to by rx_poll (see below). If that pointer is 0 , no interrupts will occur. If it points to a function, that function may cause an interrupt by calling longjmp to reach the point from which the interrupt resumes.

By convention, the global jump buffer rx_escape_jmp_buf is used. To cause an interrupt the next time rx_poll is called, set rx_poll to the function rx_escape which performs a longjmp to rx_escape_jmp_buf .

Variable rx_poll

extern void (*rx_poll)(void);

A function pointer that is called by Rx (if not 0 ) whenever it is safe to interrupt an on-going search.



Function rx_escape_jmp_buf

The conventionally used jump buffer for defining where to resume execution after an Rx match function is interrupted.

See rx_escape.



Function rx_escape

void rx_escape (void);

This function is conventionally used for interrupting a long-running Rx match function. To cause an interrupt of an on-going match from an asynchronously called function, such as a signal handler, assign rx_escape to rx_poll and return normally from the asynchronously called function. When rx_poll is next called, rx_escape will assign 0 to rx_poll and longjmp to rx_escape_jmp_buf . rx_escape is quite simple:

void
rx_escape (void)
{
  rx_poll = 0;
  longjmp (rx_escape_jmp_buf, 1);
}

Here is how rx_escape might be used in conjunction with a signal handler while using regexec :

     void
     match_fn (void)
     {
        int status;
        ...;
        status = regexec (...);
        rx_poll = 0;                 // prevent a race condition
        if (status == REG_MATCH_INTERRUPTED)
          {
             "matching was cut short";
          }
     }

     void
     signal_handler (int signal)
     {
       rx_poll = rx_escape;          // interrupt an ongoing match.
     }

Here is how the same signal handler might be used in conjunction with a Unicode regular expression match function (such as rx_xml_is_match ):

     void
     match_fn (void)
     {
        int status;
        ...;
        if (setjmp (rx_escape_jmp_buf))
          {
             "matching was cut short";
             return;
          }
        status = xml_is_match (...);
        rx_poll = 0;                 // prevent a race condition
        ...
     }



libhackerlab: The Hackerlab C Library
The Hackerlab at regexps.com