The Hackerlab at regexps.com
Data Sheet for the Hackerlab Rx XML Regular Expression Matcher
up: libhackerlab
next: Data Sheet for the Hackerlab Unicode Database
prev: Data Sheet for the Hackerlab Rx Posix Regexp Functions
Package:
Hackerlab Rx-XML (an XML/Unicode regular expression matcher)
Supplier:
regexps.com
Function:
Hackerlab Rx-XML is a regular expression pattern matcher for
Schema-capable validating XML processors. It is also a
general purpose Unicode regular expression matcher.
Key Features:
Rx-XML is fast and accurate.
Supports the regular expression language specified
in the W3C document "XML Schema Part 2"
(http://www.w3.org/TR/xmlschema-2).
Supports alternative regular expression syntaxes.
Clean and simple "classic C" interface.
Patterns may use UTF-8 or UTF-16.
Strings compared to compiled patterns may use UTF-8 or UTF-16.
Provides protection against encoding-based illegal data
attacks: Ill-formed encoding sequences (e.g. non-shortest form
UTF-8) are detected and rejected during regular expression
compilation and matching.
Configurable space/time trade-offs.
Ready for Unicode 3.1: Designed for a character set with 2^21
code points.
Validation tests are included.
Postscript and HTML documentation is included.
Licensing:
Hackerlab Rx-XML is part of the Hackerlab C Library
which is distributed under the terms of the
GNU General Public License, Version 2, as published
by the Free Software Foundation.
Alternative licenses (such as licenses permitting
binary-only re-distribution) can be purchased from
regexps.com.
Prerequisites:
standard C compiler
Posix libc.a (standard C library)
GNU Make
Recommended and Disrecommended Applications:
Rx-XML is recommended for any Unicode application
in need of regular expression pattern matching.
Rx-XML is recommended for use in high-performance
Schema-capable XML processors (e.g., a XML-based
transaction server).
Limitations:
This is the first public (non-prototype) release. Rx-XML has
been carefully tested but is not yet in widespread use. Test
coverage is extensive, but not exhaustive. Thus, prudence
suggests that minor bugs are likely to remain.
Rx-XML protects applications against an important class of
regular-expression-based denial-of-service attacks: those
where a trusted source provides regular expressions, but an
untrusted source provides strings to compare to those
expressions. The current release does *not* protect
applications against another class of attacks: those where an
untrusted source provides the regular expressions. (This is
likely to be fixed in future releases.)
Rx-XML is single threaded. In multi-threaded applications,
programs must ensure that Rx-XML is active in only one thread
at a time. (This is likely to be fixed in future releases.)
The set of error codes returned from functions that compile
and match regular expressions is likely to change in
future releases.
Contribution to Executable Size:
On a Pentium based machine, running gcc (egcs version
2.91.66) we compiled this simple (nonsense) program and
linked it against the Hackerlab C Library:
int
main (int argc, char * argv[])
{
rx_xml_recomp ();
rx_xml_is_match ();
rx_xml_free_re ();
}
Both the library and the program were compiled with the "-g"
option.
Total executable size:
text data
91745 58740
The following list of object files from the Hackerlab
C library were linked into the executable:
alloc-limits.o bits-tree-rules.o bits.o bitset-lookup.o
bitset-tree.o bitset.o bitsets.o blocks.o char-class.o
charsets.o coding-inlines.o coding.o cvt.o dfa-cache.o
dfa-iso8859-1.o dfa-utf16.o dfa-utf8.o dfa.o escape.o
hashtree.o mem.o must-malloc.o nfa-cache.o nfa.o
panic-exit.o panic.o re.o str.o super.o tree.o uni-bits.o
unidata.o
The contribution of those files to the executable size is:
text data
90472 58548
Sizes may differ slightly from the latest release
of Rx-XML. Sizes will obviously vary with platform
compiler, and compiler options.
External Linkage Dependencies:
When compiled under FreeBSD 3.0, the simple program used to
test executable sizes depends on the following symbols
defined in "libc":
_exit
free
longjmp
malloc
realloc
setjmp
Accuracy:
Rx-XML is distributed with a test suite that consists of 334
test cases that systematically exercise every feature of the
regular expression language. Of those, 319 test cases involve
legal expressions and a string that may or may not match.
Fifteen test cases involve illegal expressions, expected to
return an error.
During validation, expressions are compiled from both UTF-8
and UTF-16 source. Each time a legal expression is compiled,
it is compared to both a UTF-8 and UTF-16 string. A total of
1276 match tests are performed.
Rx-xml permits programs to make a space/time trade-off --
trading memory available to the matcher for matcher
throughput. The validation tests are run in small, default,
and large memory configurations.
The validation test are also run in "stress test" mode. In
that mode, each legal expression is compiled only once, but
the list of match tests is repeated 500 times (a total
of 638,000) matches.
(Rx-XML passes all of the tests described above.)
We repeat the caveat from the section "Limitations":
This is the first public (non-prototype) release.
Rx-XML has been carefully tested but is not yet in
widespread use. Test coverage is extensive, but not
exhaustive. Thus, prudence suggests that minor bugs
are likely to remain.
Execution Speed:
*Qualitative Analysis*
Rx-XML converts regular expressions to deterministic finite
automata (DFA) and compares strings to expressions in a single
pass. Subject to memory limitations, comparing a length `N'
string to an expression requires `O(N)' (very inexpensive)
steps.
When there is not enough memory to hold a complete
deterministic automata, Rx-XML discards and rebuilds
DFA states as needed. In the worst case, comparing a string
of length `N' to an expression of length `K' requires
`O(K * N)' (somewhat expensive) steps.
*Quantitative Analysis*
There are no agreed upon standards for measuring the
performance of Unicode regular expression pattern matchers.
We know of no other comparable regular expression matchers.
For these reasons, comparative performance measurements are
not available.
The primary target applications for Rx-XML are
high-performance, Schema-capable, validating XML processors.
Consider, for example, an XML-based transaction processor.
Our hypothetical processor is configured by reading a
set of schema definitions, including some which use `pattern'
constraints, at start-up time. It then reads a succession of
XML documents, each specifying a proposed transaction,
from untrusted sources. Among other validations, the
processor must check to see that each proposed transaction
is an XML document that conforms to the schema definitions.
Part of that validation process involves comparing data
to regular expressions specified for pattern schema.
What, then, is the impact of pattern matching on transaction
processing speed?
We made measurements designed to test the best and worst case
scenarios. In the best case, there is sufficient memory that
all of the relevant DFA states are cached in memory and
matching is performed with O(N) steps. In the worst case,
memory is too limited, almost no DFA states are successfully
cached, and matching requires O(K*N) steps.
Our measurements use the regular expression:
\p{Ll}{4}-\p{Nd}{5}
which matches a string of four lowercase letters (in any
script), followed by "-", followed by five decimal digits
(in any script).
To simulate a sequence of transaction requests, we constructed
5 matching strings and 1 non-matching string. To avoid
skewing test results by exercising only optimizations that
apply to the Basic Latin script, our test strings use
characters from six Unicode Blocks: Armenian, Basic Latin,
Greek, IPA Extensions, Latin-1 Supplement, and Devanagari.
In an inner loop, we successively test each of those strings
for a match. An outer loop repeats those tests either
50000 or 5000 times.
We performed two sets of measurements: one for strings encoded
using UTF-16, and one for strings encoded using UTF-8.
For each encoding form, we made two measurements: once where
Rx-XML is granted enough memory to cache the entire DFA for
the test expression, and once where its allocation is
drastically limited, forcing it to re-construct a DFA state at
almost every step (these are the ample memory and limited
memory scenarios). For the ample memory scenario, 50000
iterations of the outer loop were performed. For the
limited memory scenario, 5000 iterations were performed.
To compute the number of code points processed per second, we
divided the sum of system and user CPU time taken for all
iterations by the number of Unicode code points scanned by the
matcher. That number of code points is the number of
iterations multiplied by the sum of the lengths of the
matching expressions and the length of the non-matching
expression up to and including the first non-matching
character.
Both the library and test program were compiled by gcc (egcs
version 2.91.66), *without* optimization. Tests were
conducted on a 400 Mhz Pentium II running FreeBSD 3.0.
Profiling reports that 96% of the run-time of the
test program was spent in `rx_xml_is_match':
ample limited
memory memory
UTF-16 3,604,194 7,298
UTF-8 3,395,061 4,642
code-points-per-second
processed by rx_xml_is_match
For information about the amount of memory used by theses
tests, see the next section "Run-time Allocation Requirements"
To gain further perspective on matcher throughput, we ran the
test suite described in the section "Accuracy" in stress test
mode (compiling each expression only once), limiting the list
of test cases to 42 representative tests, iterating 500 times.
Each test case performs both UTF-8 UTF-16 matches. Cache
sizes were sufficiently large for an "ample memory" test.
This test performed 84,000 separate matches totalling 730,000
code points at a rate of 2,033,426 code points per second. In
this test `rx_xml_is_match' accounted for 72% of the run time
(for comparison with the preceeding table, (96 / 72) *
2,033,426 == 2,711,235).
Our performance test has limitations: Performance is likely to
vary with the choice of regular expression and with the
strings being matched. While almost all conceivable
expressions should have substantially similar performance, it
is theoretically possible to construct expressions whose DFA
is too large to ever cache; such expressions will always have
"limited memory" performance. Performance (especially memory
allocation, discussed in the next section) is likely to vary
with the strings being matched; by using a variety of scripts
in our test cases, we tried to achieve realistic results --
but a much larger set of test cases would presumably be a more
realistic simulation.
Nevertheless, we conclude that with "ample memory", rx-xml is
fast enough for quite heavy workloads (millions of code points
per second). In (worst case) "limited memory" configurations,
rx-xml is suitable for very light workloads (thousands of code
points per second).
Exact performance will, obviously, vary with choice of
compiler, compiler options, and platform.
Run-time Allocation Requirements:
The amount of memory allocated by Rx at run-time is
dominated by the size of two caches: the "nfa cache" and
the "dfa cache". Roughly speaking, the "nfa cache"
limits the quantity of compiled regexps that are re-used
across calls to "regcomp"; the size of the "dfa cache"
limits the number of DFA transition tables dynamically
constructed by "regexec".
The sizes of these caches are independent and under
application control. They may be varied, trading space
for time.
By default, both cache sizes are set to an advisory
limit of 1MB. (This is an "advisory limit" because it
may be exceeded when necessary for correct operation.
Except in rare circumstances, actual cache sizes closely
approximate the advisory limit.)
We ran the test suite discussed in the section "Accuracy" with
advisory limits of 10K, 1MB, and 8MB (for each cache). The
relation between the advisory limits, the amount of memory
actually used, the combined user and system execution times,
and the number of DFA-state cache misses is displayed in the
following table:
NFA/DFA cache NFA/DFA cache time DFA misses
advisory limit actual usage (seconds)
(bytes) (bytes)
10K / 10K 10,514 / 12,744 0.61 3908
1MB / 1MB 299,948 / 1,048,576 0.27 890
8MB / 8MB 299,796 / 3,139,536 0.32 890
(Note that 10K vs. 1MB is an effective space/time trade-off.
8MB did not improve the cache-hit rate compared to 1MB,
and caused a slight performance loss -- presumably due to
virtual memory effects.)
The "ample memory" and "limited memory" tests described
in the section "Execution Speed" were conducted by varying
the advisory cache limits. The memory used in those
tests is displayed in the following table:
NFA/DFA cache NFA/DFA cache
advisory limit actual usage
(bytes) (bytes)
UTF-16 matching
limited 1024 / 1024 10,513 / 3,604
memory
ample 1MB / 1MB 10,513 / 37,388
memory
UTF-8 matching
limited 1024 / 1024 10,513 / 10,900
memory
ample 1MB / 1MB 10,513 / 75,724
memory
(The section "Execution Speed" describes the dramatic
performance differences between the two cache limits.)
Support:
To purchase an alternative license, request additional
features, or for any kind of support assistance, you can
contact us at "hackerlab@regexps.com" or via our web site
"www.regexps.com". We can also be reached at (412) 401-5204.
We are currently in the midst of relocating from Pennsylvania
to California, so at this time, we have no reliable postal
address.
libhackerlab: The Hackerlab C Library
The Hackerlab at regexps.com