The Hackerlab at regexps.com
Data Sheet for the Hackerlab Rx Posix Regexp Functions
up: libhackerlab
next: Data Sheet for the Hackerlab Rx XML Regular Expression Matcher
prev: Portability Assumptions in the Hackerlab C Library
Package:
Hackerlab Rx (a Posix regexp matcher)
Supplier:
regexps.com
Function:
Hackerlab Rx provides the Posix regexp functions (regcomp,
regexec, regfree, and regerror).
In addition, closely related non-standard functions provide
extended functionality.
Key Features:
Rx is fast and accurate.
Non-standard extensions include support for searching
strings not terminated by 0 (regnexec), support for
asynchronously interrupting long-running searches (rx_poll),
non-capturing parentheses in regexps,
the regexp operator `cut' for efficient lexical analysis,
and support for tuning memory usage by search functions (a
time/space trade-off).
Extensive unit tests and validation tests are included.
Postscript and HTML documentation is included.
Licensing:
Hackerlab Rx is part of the Hackerlab C Library, which is
distributed under the terms of the GNU General Public
License, Version 2, as published by the Free Software
Foundation.
Alternative licenses (such as licenses permitting
binary-only re-distribution) can be purchased from
regexps.com.
Prerequisites:
standard C compiler
Posix libc.a (standard C library)
GNU Make
Recommended and Disrecommended Applications:
Rx is for applications needing the standard Posix functions
such as `regcomp' and `regexec'. Versions of these
functions found in several implementations of the standard C
library are buggy, slow, or both slow and buggy. Rx
provides a correct, extended, high-performance alternative.
Rx is recommended for applications requiring good performance
over the widest possible range of regexps. Many other
implementations optimize some simple cases, but perform either
poorly or incorrectly in difficult cases.
Rx is recommended for applications which benefit by being
scalable from low-memory environments to large-memory
environments. Rx permits programmers to tune its memory use
over a wide range, trading space for time.
Rx is recommended for applications which accept as input
either regexps or strings to search but which must not be
subject to denial of service attacks. The reason is that
the standard function `regexec' is intractable for some
conceivable inputs: In any correct implementation, if
regexec is called with certain inputs, it will not return in
any reasonable amount of time. Rx supports returning from
`regexec' after a time-out has expired, to protect
applications from malicious regexp parameters.
Rx is recommended for applications which use regexps for
lexical analysis. Rx supports a non-standard regexp
operator for lexical analysis (the "cut" operator).
Rx is NOT recommended for applications running with severely
limited amounts of memory (see the sections "Contribution to
Executable Size" and "Run-time Allocation Requirements"
below.)
Rx is NOT recommended for applications in which the amount
of regexp compilation dominates the amount of regexp search.
The Rx regexp compiler is not the fastest available. On the
other hand, this recommendation is subordinate to
correctness: if regexp accuracy is more important than
regexp speed, Rx is recommended even in cases where regexp
compilation dominates regexp search.
Finally, Rx is recommended for programs that can benefit by
using regexps, but that might otherwise be precluded from
using regexps by the performance limitations of older regexps
matchers. Rx can efficiently handle regexps which are
significantly larger and more complex than can be handled by
most other matchers.
Limitations:
Multi-character collating sequences are not supported.
Character equivalence class expressions are not supported.
The trivial function `regerror', which translates regexp
error codes to strings, has not been internationalized.
Test coverage is extensive, but not (yet) 100%.
Rx has large executable size and run-time allocation
requirements when compared to some implementations.
(However, substantially smaller implementations are
often buggy and are, in general, slower.)
Rx is single threaded. In multi-threaded applications,
programs must ensure that Rx is active in only one thread at a
time. (This is likely to be fixed in future releases.)
Contribution to Executable Size:
On a Pentium based machine, running gcc (egcs version
2.91.66) we compiled this simple (nonsense) program and
linked it against the Hackerlab C Library:
void regexec(void);
void regcomp(void);
void regerror(void);
int main ()
{ regexec(); regcomp(); regerror(); return 0; }
Both the library and the program were compiled with the "-g"
option.
Total executable size:
text data
104992 20112
The following list of object files from the Hackerlab
C library were linked into the executable:
alloc-limits.o bits-tree-rules.o bits.o
bitset-tree.o bitset.o char-class-locale.o
char-class.o char-cmp-locale.o coding.o cvt.o
dfa-cache.o dfa.o errnorx.o escape.o hashtree.o
match-regexp.o mem.o must-malloc.o nfa-cache.o nfa.o
panic-exit.o panic.o posix.o re8-parse.o str.o
super.o tree.o uni-bits.o
The contribution of those files to the executable size is:
text data
103387 19888
Sizes may differ slightly from the latest release
of Rx. Sizes will obviously vary with platform
compiler, and compiler options.
External Linkage Dependencies:
When compiled under FreeBSD 3.0, the simple program used to
test executable sizes depends on the following symbols
defined in "libc":
_CurrentRuneLocale
_DefaultRuneLocale
___runetype
_exit
free
longjmp
malloc
realloc
setjmp
strcoll
write
The exact dependencies may, of course, differ from
system to system. The symbols `_CurrentRuneLocale',
`_DefaultRuneLocale', and `__runetype' are used in FreeBSD
to implement macros in the `ctype(3)' family.
Accuracy (Comparisons):
Rx is distributed with a test suite. The tests consist of
385 distinct regexps. Of those expressions, 23 are invalid
expressions, 362 are valid expressions. For valid
expressions, the tests include a string to compare to the
compiled expression, and the expected results from
`regexec'. Rx passes all of those tests.
A subset of those tests, consisting of 371 regexps, is based
on the Posix.2 standard. Those tests do not use any of Rx's
extensions to Posix. The tests were designed by hand to
systematically exercise all features of the Posix regexp
language.
We tested two alternative implementations using only the
Posix.2 tests. These were the libc implementation distributed
with FreeBSD 3.0, and GNU regex 0.12. (There are many
versions of GNU regex in distribution, all numbered "0.12".
We used one found in libc, the GNU C Library, version 2.1, as
distributed with a popular and recent version of Linux.)
The FreeBSD implementation failed 14 of 371 tests.
These failures were:
1 invalid regexp successfully compiled
6 valid regexps failed to compile (all 6
appear to be caused by a single bug)
2 calls to `regexec' failed to return the
longest possible match
5 calls to `regexec' returned incorrect
positions for matching subexpressions
GNU regex failed 22 of 371 tests. These failures were:
8 invalid expressions compiled successfully
(apparently due to 3 or 4 bugs)
10 valid expressions failed to match matching
strings or matched incorrect strings
(apparently due to 2 or 3 bugs)
4 calls to `regexec' returned incorrect
positions for matching subexpressions
Execution Speed:
The performance characteristics of Posix regexp matchers are
complex and difficult to summarize. Performance varies
wildly depending on the types of regexps being used, the
details of the strings being searched for matches, and the
pattern of calls made to regexp functions.
Because of the complexity of the subject, we are wary of
publishing benchmark comparisons of regexp matchers. There
is no industry-wide agreement on a realistic set of
benchmarks. There are not even any proposed realistic
benchmarks. We are perfectly capable of constructing
benchmarks that would purport to show "Rx always wins big",
"Rx always looses big", and benchmarks that would give mixed
results. None of those would legitimately inform anyone of
what to expect from Rx or any other regexp matcher.
Nevertheless, it is our belief, based on our internal
measurements, on our experience with Rx, and on our
understanding of the implementation issues involved, that Rx
is the best performing matcher available. Moreover, the
performance advantages of Rx are so great, in some cases, as
to extend the usefulness of regexp pattern matching well
beyond its traditional applications. Below, we have
provided qualitative and quantitative information to back up
this assertion.
*Correctness Before Speed*
In general, the implementation of Rx emphasizes correctness
first, and performance second. Subject to that constraint,
considerable effort has gone into achieving the best
possible performance over the widest range of expressions.
Emphasizing correctness has important implications for
performance. Some popular regexp matchers contain
some interesting bugs: the bugs cause those matchers to
give incorrect results for some patterns, but also speed
up some cases when the matchers give correct results.
That implies that when comparing implementations,
correctness and performance can not be regarded as separate
issues: fixing the bugs in an implementation can drastically
alter its performance characteristics.
*Deterministic Finite Automata*
Rx is a "DFA based" implementation. That means that, whenever
possible, Rx tests for a matching pattern or sub-pattern using
a single pass scan of the target string rather than an
exhaustive, backtracking search. In some cases, correctness
demands an exhaustive backtracking search.
Rx excels when DFA optimizations apply. When compiled with
optimization, the DFA routines scan a target string at a cost
of approximately 12 instructions per character (provided the
DFA cache is sufficiently large). As a result, for an
interesting and useful subset of regexps in general, namely
true regular expressions, regexp comparison is not
significantly more expensive than `strcmp', and regexp
searching is not significantly more expensive than `strstr'
(excluding implementations of `strstr' which use sub-linear
search techniques). The reference manual describes the
conditions under which DFA optimizations apply.
When a pattern demands a backtracking search, but some of
its sub-patterns permit DFA optimizations, Rx uses DFAs for
those sub-patterns.
In a number of cases, Rx is able to apply inexpensive
heuristics to "short-cut" an exhaustive backtracking search
without sacrificing correctness.
*Timing Demonstrations*
Here are some examples that illustrate some of the
performance advantages of Rx. These are not benchmarks,
for reasons outlined above. These examples were specifically
designed to show off some of Rx's strengths.
To generate these results, we used a simple program
called `pseudo-grep'. `pseudo-grep' accepts a regexp
as a command line argument and compiles that regexp once.
It reads lines of input from its standard input using
`fgets'. It compares each line to the expression using
`regexec'. If a line matches, that line is printed
on standard output, with brackets surrounding the matching
text.
We compiled `pseudo-grep' three times: once using Rx,
once using libc under FreeBSD 3.0, and once using
`GNU regex 0.12' (as obtained from "ftp.gnu.org").
We ran the three programs on two examples.
In the first example, we searched `/usr/dict/words' (a
dictionary containing one word per line) for words that can
be spelled using hexadecimal digits, substituting `0' for
`O', `1' for `I' and `L, 5 for `S', and `6' for `G'. The
regexp used for this example was:
"^[a-fA-FoOiIlLsSgG]+$"
In the second example, we searched a large file of
C source code for C keywords. The regexp used for
this example was:
"(if|else|while|for|case|switch|default|char|\
int|long|float|double|struct|enum|union|goto\
|break|continue)"
Hexadecimal C Keywords
words
----------------------------------
Rx 0.05/0.04/0.02 0.29/0.20/0.09
FreeBSD 0.17/0.16/0.02 2.06/2.06/0.00
GNU regex 0.18/0.12/0.06 2.18/2.13/0.04
----------------------------------
real/user/system time
(seconds)
Run-time Allocation Requirements:
The amount of memory allocated by Rx at run-time is
dominated by the size of two caches: the "nfa cache" and
the "dfa cache". Roughly speaking, the "nfa cache"
limits the quantity of compiled regexps that are re-used
across calls to "regcomp"; the size of the "dfa cache"
limits the number of DFA transition tables dynamically
constructed by "regexec".
The sizes of these caches are independent and under
application control. They may be varied, trading space
for time.
By default, both cache sizes are set to an advisory
limit of 1MB. (This is an "advisory limit" because it
may be exceeded when necessary for correct operation.
Except in rare circumstances, actual cache sizes closely
approximate the advisory limit.)
The Rx test suite for Posix regexps tests successfully with
with advisory cache limits as small as 10K for each cache,
and as large as 8MB for each cache, and exhibits the
expected space-for-time trade-off.
Because of the complex nature of regexp performance,
there is no simple, fixed relation between cache sizes
and run-time. The exact trade-off will depend on the
regexp usage patterns of your application and is best
determined by experimentation.
The following table illustrates the space/time trade-off in
action. The program `test-rx' uses 385 distinct regexps.
Of those expressions, 23 are invalid expressions, 362 are
valid expressions. For valid expressions, the tests include
a string to compare to the compiled expression, and the
expected results from `regexec'. It should be noted that
the test suite includes highly artificial tests which are
specifically designed to be very expensive to match. Thus,
the time-per-regexp exhibited by `test-rx' is not typical of
regexp usage in general.
To generate the data in the table, the test program was run
3 times, varying the size of the NFA and DFA caches. Within
each run, each test case was repeated 3 times. These
repetitions create opportunities for the caches to impact
performance dramatically.
Six numbers are reported for each run: the advisory limit on
the size of both the NFA and DFA caches; the actual amount of
memory used by each of the caches; and the real, user, and
system times consumed.
advisory NFA/DFA cache size real/user/system
limits high water mark times (seconds)
(both caches) (bytes)
--------------------------------------------------------------
32K 32863 / 56764 201.0 / 188.0 / 12.4
64K 65632 / 65580 41.0 / 37.8 / 2.6
1MB (default) 237170 / 1048600 4.3 / 4.1 / 0.0
8MB 241085 / 3183432 4.2 / 4.0 / 0.1
INTERPRETATION
For this test program:
A DFA cache size of 32K is too small. In order to complete
the tests successfully, Rx transiently allocated 55.4K
to the DFA cache.
Effective time/space trade-offs are evident between 32K,
64K, and 1MB caches.
A cache size of 8MB is too large. Rx used more memory than
with 1MB caches, but there was no significant speed up.
Support:
To purchase an alternative license, request additional
features, or for any kind of support assistance, you can
contact us at "hackerlab@regexps.com" or via our web site
"www.regexps.com". We can also be reached at (412) 401-5204.
We are currently in the midst of relocating from Pennsylvania
to California, so at this time, we have no reliable postal
address.
libhackerlab: The Hackerlab C Library
The Hackerlab at regexps.com