regexps.com
The Posix.2 standard (ISO/IEC 994502: 1993 (ANSI/IEEE Std 1003.2
- 1992)), section 2.8) specifies the syntax and semantics of a
"Regular Expression Notation". Appendix "B.5" defines a "C Binding
for RE Matching" which includes the functions regcomp
(to compile
a Posix regexp), regexec
(to search for a match), regerror
(to
translate a regexp error code into a string), and regfree
(to
release storage associated with a compiled regexp).
The Hackerlab C library provides Rx , an implementation of Posix.2 regexp functions (with some extensions).
This chapter describes that interface. An appendix to this manual contains an introduction to Posix regexps. (See An Introduction to Posix Regexps.) If you are unfamiliar with regexps, reading that appendix before reading this chapter may be helpful.
This chapter begins with documentation for the standard Posix functions (and some closely related non-standard extensions). If you are looking for programmer reference manual, see Posix Regexp Functions.
The Posix standard for regexps is precise. On the other hand, it is often implemented incorrectly and almost never implemented completely. A discussion of the relation between the Posix standard and Rx can be found in Rx and Conformance to the Posix Standard.
Obtaining good performance from regexp matchers is sometimes complicated: they are easy to understand and use in many common situations, but they require careful attention in applications that use complicated regexps and in applications that use regexp matching heavily. Some general advice can be found in Rx and the Performance of Regexp Pattern Matching.
Finally, if you are performance tuning a regexp-intensive application, you'll need to understand the non-standard interfaces in Tuning Posix Regexp and XML Regular Expression Performance.
#include <sys/types.h> #include <hackerlab/rx-posix/regex.h>
The standard Posix regexp functions provided by Rx are:
regcomp regexec regfree regerror
Two closely related but nonstandard functions are also provided:
regncomp regnexec
int regcomp (regex_t * preg, const char * pattern, int cflags);
Compile the 0-terminated regexp specification pattern
.
The compiled pattern is stored in *preg
, which has the field
(required by Posix):
size_t re_nsub; The number of parenthesized subexpressions in the compiled pattern.
cflags
is a combination of bits which effect compilation:
enum rx_cflags { REG_EXTENDED = 1, /* If REG_EXTENDED is set, then use extended regular expression syntax. If not set, then use basic regular expression syntax. In extended syntax, none of the regexp operators are written with a backslash. */ REG_ICASE = (REG_EXTENDED << 1), /* If REG_ICASE is set, then ignore case when matching. If not set, then case is significant. */ REG_NOSUB = (REG_ICASE << 1), /* Report only success/failure in `regexec'. Using this flag can improve performance for some regexps. */ REG_NEWLINE = (REG_NOSUB << 1), /* If REG_NEWLINE is set, then "." and complemented character sets do not match at newline characters in the string. Also, "^" and "$" do match at newlines. If not set, then anchors do not match at newlines and the character sets contain newline.*/ REG_DFA_ONLY = (REG_NEWLINE << 1), /* If this bit is set, then restrict the pattern language to patterns that compile to efficient state machines. In particular, `regexec' will not report positions for parenthesized subexpressions; "^", "$", backreferences ("\n"), and duplication ("{n,m}") are interpreted as normal characters. REG_DFA_ONLY is a non-standard flag. */ };
regcomp
returns 0
on success and an error code on failure (see
regerror).
int regncomp (regex_t * preg, const char * pattern, size_t len, int cflags);
Compile the len
-byte regexp specification pattern
.
The compiled pattern is stored in *preg
, which has the field
(required by Posix):
size_t re_nsub; The number of parenthesized subexpressions in the compiled pattern.
cflags
is a combination of bits which effect compilation. See
regcomp.
regncomp
returns 0
on success and an error code on failure (see
regerror).
Note: regncomp
is not part of the Posix.2 interface for
regexp matching. It is an Rx extension.
int regexec (const regex_t *preg, const char *string, size_t nmatch, regmatch_t pmatch[], int eflags);
Search for a match of compiled regexp preg
in string
.
Return the positions of the match and the first nmatch-1
parenthesized subexpressions in pmatch
.
Return 0
if a match is found, an error code otherwise. See
regerror.
It is possible to asynchronously abort a call to regexec
. See
Escaping Long-Running Matches.
preg
must have been filled in by regcomp
or regncomp
.
string
must be 0
terminated. See regnexec.
nmatch
may be 0
and must not be negative (Posix specifies that
the parameter be declared signed). It is the number of elements
in the array pointed to by pmatch
.
pmatch
may be 0
if nmatch
is 0
. The details of regmatch_t
are:
struct rx_registers { regoff_t rm_so; /* Byte offset to substring start. */ regoff_t rm_eo; /* Byte offset to substring end. */ int final_tag; /* In pmatch[0] this field is set to * the state label of the last DFA state * encountered during a match. * * This field is implementation specific. * Applications which intend to be portable * between implementations of Posix should * not use this field. */ };
The state label
of the final DFA state for most regexps is 1
. If a
pattern contains the cut operator
[[:cut <n>:]]
its DFAs will
contain a final state with label n
at that point in the regexp.
This is useful for detecting which of several possible alternatives
actually occured in a match, as in this example:
pattern: if[[:cut 1:]]\\|while[[:cut 2:]]
pmatch[0].final_tag is 1 after matching "if" pmatch[0].final_tag is 2 after matching "while"
eflags
is a bit-wise or (|
) of any of these values:
enum rx_eflags { REG_NOTBOL = 1, /* If REG_NOTBOL is set, then the beginning-of-line operator `^' * doesn't match the beginning of the input string (presumably * because it's not the beginning of a line). If not set, then the * beginning-of-line operator does match the beginning of the * string. * * (Standardized in Posix.2) */ REG_NOTEOL = (REG_NOTBOL << 1), /* REG_NOTEOL is similar to REG_NOTBOL, except that it applies to * the end-of-line operator `$' and the end of the input string. * * (Standardized in Posix.2) */ REG_NO_SUBEXP_REPORTING = (REG_NOTEOL << 1), /* REG_NO_SUBEXP_REPORTING causes `regexec' to fill in only * `pmatch[0]' and to ignore other elements of `pmatch'. For some * patterns (those which do not contain back-references or anchors) * this can speed up matching considerably. * * (non-standard) */ REG_ALLOC_REGS = (REG_NO_SUBEXP_REPORTING << 1), /* REG_ALLOC_REGS is only used by `regnexec'. It causes `regnexec' * to allocate storage for `regmatch_t' values. * * (non-standard) */ };
The match returned satisfies the left-most longest rule which states a left-most match of the overall regexp will be returned. Of those matches, one of the longest will be returned.
There may be more than one longest match because two matches of
equal length may differ in how they fill in the array pmatch
.
For example:
"aaaabbbb" can match \(a*\)\(a*b*\) with pmatch[1] == "aaaa" [*] and pmatch[2] == "bbbb" or with pmatch[1] == "aaa" and pmatch[2] == "abbbb" or with pmatch[1] == "aa" and pmatch[2] == "aabbbb" or with pmatch[1] == "a" and pmatch[2] == "aaabbbb" or with pmatch[1] == "" and pmatch[2] == "aaaabbbb"
Of the possible values of pmatch
, Rx implements the standard
behavior of returning that match which recursively maximizes the
lengths of the substrings matched by each subpattern, from left to
right. In the preceeding example, the correct answer is marked
with [*]
.
int regnexec (const regex_t *preg, const char *string, regoff_t length, size_t nmatch, regmatch_t ** pmatch, int eflags);
Search for a match of compiled regexp preg
in string
.
Return the positions of the match and the first nmatch-1
parenthesized subexpressions in *pmatch
.
Return 0
if a match is found, an error code otherwise. See
regerror.
preg
must have been filled in by regcomp
or regncomp
.
string
must be length
bytes long.
See regnexec for details about other parameters but
note that regnexec
and regexec
use different types for
the parameter pmatch
.
In regexec
, pmatch
is only used to pass a pointer. In
regnexec
, pmatch
is used both to pass a pointer, and to return
a pointer to the caller.
Callers are permitted to pass 0
for nmatch
and pmatch
. Callers
are also permitted to pass the address of a pointer whose value is
0
for parameter pmatch
. If they do so, and also set the bit
REG_ALLOC_REGS
in eflags
, then pmatch
will be a return
parameter, returning a malloced array of preg->re_nsub
elements
containing the sub-expression positions of a successful match.
It is possible to asynchronously abort a call to regnexec
. See
Escaping Long-Running Matches.
Note: regnexec
is not part of the Posix.2 interface for
regexp matching. It is an Rx extension.
void regfree (regex_t *preg);
Release all storage allocated for the compiled regexp preg
.
This does not free preg
itself.
size_t regerror (int errcode, const regex_t *preg, char *errbuf, size_t errbuf_size);
Returns a message corresponding to an error code, errcode
,
returned from either regcomp
or regexec
. The size of the
message is returned. At most, errbuf_size - 1
characters of the
message are copied to errbuf
. Whatever is stored in errbuf
is
0-terminated.
The POSIX error codes for regexp pattern matchers are:
REG_NOMATCH "no match" REG_BADPAT "invalid regular expression" REG_ECOLLATE "invalid collation character" REG_ECTYPE "invalid character class name" REG_EESCAPE "trailing backslash" REG_ESUBREG "invalid back reference" REG_EBRACK "unmatched [ or [^" REG_EPAREN "unmatched (, \\(, ) or \\)" REG_EBRACE "unmatched \\{" REG_BADBR "invalid content of \\{\\}" REG_ERANGE "invalid range end" REG_ESPACE "memory exhausted" REG_BADRPT "invalid preceding regular expression"
Rx also provides a non-standard error code that is used if
regexec
or regnexec
is interrupted (see Escaping Long-Running Matches).
REG_MATCH_INTERRUPTED "match interrupted"
Posix specifies the behavior of the C regexp functions quite precisely. Rx attempts to honor this specification to the greatest extent possible in a portable implementation. There are two areas of conformance that are worthy of note: the question of what is matched by a given regexp, and the question of how regexp matching interacts with Posix locales.
The question of "what is matched" is worthy of note because obtaining correct behavior has proven to be extremely difficult -- few implementations succeed. The implementation of Rx has been carefully developed and tested with correctness in mind.
The question of how regexps and locales interact is worthy of note because it is impossible to completely implement the behavior specified by Posix in a portable implementation (i.e., in an implementation that is not intimately familiar with the non-standard internals of a particular implementation of the standard C library).
Posix requires that when regexec
reports the position of a
matching substring, it must report the first-occurring ("leftmost")
match. Of the possible first-occuring matches, the longest match
must be returned. This is called the
left-most longest rule
.
Posix requires that when regexec
reports the position of a
substring matched by a parenthesized subexpression, it must report
the last substring matched by that subexpression. If one
parenthesized expression (the "inner expression") is enclosed
in another (the "outer expression") and the inner expression did
not participate in the last match of the outer expression, then
no substring match may be reported for the inner expression.
Finally, Posix requires that when regexec determines what
each subpattern matched (regardless of whether the subpattern is
surrounded by parentheses), if there is more than one possibility,
regexec must choose the possibility that first maximizes the length
of substrings matched by the outermost subpatterns, from
left-to-right, and then recursively apply the same rule to
inner subpatterns. However, this rule is subordinate to the
left-most longest rule: if an earlier-occuring or longer overall
match can be obtained by returning a non-maximal match for some
subpattern, regexec
must return that earlier or longer match.
This combination of constraints completely determines the return
value of regexec
and describes the behavior of Rx. Many other
implementations do not conform to the standard in this regard --
in exceptional situations, compatibility issues may arise when
switching between regexp implementations.
Posix requires that a character set containing a multi-character
collating element be treated specially. For example, if the
character sequence ch
is a collating element, then the regexp:
[[.ch.]]
will match ch
. On the other hand, if ch
is not a collating
element, the same expression is illegal. Similarly, an expression
like:
[a-z]
should match all collating elements which collate between "a" and "z" (inclusively), including multi-character collating elements that collate between "a" and "z".
Unfortunately, Posix provides no portable mechanism for determining which sequences of characters are multi-character collating elements, and which are not. Consequently, Rx operates as if multi-character collating elements did not exist.
Posix also defines a character set construct called an "equivalence class":
[[=<X>=]] where <X> is a collating element
An equivalence class stands for the set of all collating elements
having the same primary collation weight as <X>
. Unfortunately,
Posix provides no portable mechanism for determining the primary
collation weight of any collating element. Consequently, Rx
implements the equivalence class construct by returning an error
from regcomp
whenever it is used.
Posix requires that in a character set, a range of characters such as:
[a-z]
includes all characters that collate between a
and z
in the
current locale. Some people argue that this behavior is confusing:
that character ranges should be based on the encoding values of
characters -- not on the rules of collation. Because of
differences in collation, Posix advises that character ranges are a
non-portable construct: portable programs should not use them at
all!
Rx conforms to Posix by using collation order to interpret
character ranges (with the exception that Rx always behaves as if
there are no multi-character collating elements). Using the C
locale when calling regcomp
and regexec
ensures that character
ranges will be interpreted in a way consistent with the ASCII
character set.
The performance (speed and memory use) of any Posix regexp matcher (including Rx) is a complicated subject. Programmers who want to use regexps will benefit by understanding the issues, at least in broad outline, so that they can avoid pitfalls, so they can make the best possible use of a particular implementation (Rx, in this case), and so they know where to delve deeper when performance issues become particularly important.
Traditionally, many programmers use regexps as if they were always computationally inexpensive. This is naive. Some uses of regexps are inexpensive, others are intractable. Many fall somewhere in the middle. Which uses fall into which cases varies between implementations.
This section describes the performance of Posix regexp matching in general -- it is not specific to Rx.
Posix regexp pattern matching, in its full generality, is a
computationally expensive process. In some cases, involving only
moderately sized regexps, it is intractably expensive, regardless of
what implementation is being used. Thus, one should never write
programs with the assumption that regexec
will always return
normally in a reasonable amount of time. (Rx includes non-standard
functionality which can be used to interrupt a call to regexec
after a time-out. See Escaping Long-Running Matches.)
On the other hand, for many very simple regexps, Posix regexp matching is very inexpensive, again, (nearly) regardless of implementation. For example, if a regexp is simply a string of literal characters, searching for a match is almost certain to be fast.
Implementations of Posix regexps often differ in the set of optimizations they provide. Simplistic implementations, containing few optimizations, perform well for small and simple regexps, but poorly in many other cases. Sophisticated implementations can perform very well on even large regexps if those regexps are true regular expressions or are nearly true regular expressions. (For an explanation of the distinction between regexps in general and "true regular expressions", see An Introduction to Posix Regexps.)
Implementations of Posix regexps often differ in correctness, and this has a bearing on performance. Several popular implementations sometimes give incorrect results. The bugs that cause those errors also improve the performance of the same matchers on some patterns for which they give correct results. Thus, programmers choosing an implementation are sometimes faced with the uncomfortable trade-off between the best performance bench-mark results, and the best correctness testing results. In such situations, an important question is the relevance of the tests: do the bench-mark tests accurately reflect regexp usage in the target application? What are the risks of using an incorrect matcher in the application? Consider whether the better performance of buggy matchers on some expressions is offset by their considerably worse performance on other expressions: it is not the case that the buggy implementations are always faster.
This section describes the performance of Posix regexp matching in Rx.
Rx is designed to give excellent performance over the widest possible range of regexps (including many large, complicated regexps), but to never sacrifice correctness. While Rx is at least competitive with most most implementations on most regexps (and is sometimes much faster), there are some regexps for which Rx is much slower than other implementations. Often, this difference can be attributed to the bugs in other implementations which speed up some cases while getting other cases wrong. This is something to keep in mind when comparing Rx to other implementations.
When a trade-off is necessary between memory use and speed, Rx is designed to allow programmers to choose how much memory to use and to provide programmers with the tools necessary to tune memory use for best possible performance. Rx can operate usefully (though comparatively slowly) with as little as 10-20K of dynamically allocated memory. As a rule of thumb, Rx's default of approximately 2MB is suitable even for applications that use regexps fairly heavily. (See Tuning Posix Regexp and XML Regular Expression Performance.)
Rx contains optimizations targetted for regexps which are true regular expressions. Rx converts true regular expressions to deterministic automata and can compare such expressions to a potentially matching string very quickly. This optimization makes it practical to use even quite large regular expressions. For more information, see the the appendix Data Sheet for the Hackerlab Rx Posix Regexp Functions.
Sometimes regexps which are not true regular expressions can be
matched as quickly as if they were true regular expressions. If a
regexp is not a regular expression only because it begins with a
left anchor (^
) and/or ends with a right anchor ($
), Rx can
match the expression as quickly as a true regular expression.
If, in addition, the regexp contains parenthesized subexpressions,
Rx can use regular expression optimizations if, either, the
REG_NOSUB
flag is passed to regcomp
, or the nmatch
parameter
to regexec
is less than or equal to 1
(i.e., if regexec
is not
required to report the positions of substrings matched by
parenthesized subexpressions).
If a regexp is not a regular expression because it contains
backreferences (\n
) or counted iterations (RE{n,m}
), Rx's DFA
optimizations do not apply in their full generality. Such
regexps run the greatest risk of being slow.
The Rx implementation of regcomp
supports a non-standard flag,
REG_DFA_ONLY
, which can be used to disable all regexp constructs
that are forbidden in true regular expressions. See
regcomp.
When regexps are being used for lexical analysis, good performance can
often be achieved by using true regular expressions in combination
with the non-standard regexp operator [[:cut n:]]
. See The cut Operator.
#include <hackerlab/rx/escape.h>
Regexp searches can take a long time. Rx makes provisions for
asynchronously aborting a long-running match. When regexec
or
regnexec
is aborted, it returns REG_MATCH_INTERRUPTED
. Callers
of other match functions (such as rx_xml_is_match
) can catch
asynchronous interrupts using the jump buffer rx_escape_jmp_buf
(documented below).
Asynchronous match interrupts are permitted whenever Rx calls the
function pointed to by rx_poll
(see below). If that pointer is
0
, no interrupts will occur. If it points to a function, that
function may cause an interrupt by calling longjmp
to reach
the point from which the interrupt resumes.
By convention, the global jump buffer rx_escape_jmp_buf
is used.
To cause an interrupt the next time rx_poll
is called, set
rx_poll
to the function rx_escape
which performs a longjmp
to
rx_escape_jmp_buf
.
extern void (*rx_poll)(void);
A function pointer that is called by Rx (if not 0
) whenever it is
safe to interrupt an on-going search.
The conventionally used jump buffer for defining where to resume execution after an Rx match function is interrupted.
See rx_escape.
void rx_escape (void);
This function is conventionally used for interrupting a
long-running Rx match function. To cause an interrupt of an
on-going match from an asynchronously called function, such as a
signal handler, assign rx_escape
to rx_poll
and return normally
from the asynchronously called function. When rx_poll
is next
called, rx_escape
will assign 0
to rx_poll
and longjmp
to
rx_escape_jmp_buf
. rx_escape
is quite simple:
void rx_escape (void) { rx_poll = 0; longjmp (rx_escape_jmp_buf, 1); }
Here is how rx_escape
might be used in conjunction with
a signal handler while using regexec
:
void match_fn (void) { int status; ...; status = regexec (...); rx_poll = 0; // prevent a race condition if (status == REG_MATCH_INTERRUPTED) { "matching was cut short"; } }
void signal_handler (int signal) { rx_poll = rx_escape; // interrupt an ongoing match. }
Here is how the same signal handler might be used in conjunction with a
Unicode regular expression match function (such as rx_xml_is_match
):
void match_fn (void) { int status; ...; if (setjmp (rx_escape_jmp_buf)) { "matching was cut short"; return; } status = xml_is_match (...); rx_poll = 0; // prevent a race condition ... }
regexps.com