regexps.com
The W3C document XML Schema Part 2: Datatypes defines a regular expression language for use in XML Schema. (See http://www.w3.org/TR/xmlschema-2.)
This chapter presents C functions used for matching XML regular expressions. (See An Introduction to XML Regular Expressions.)
#include <hackerlab/rx-unicode/re.h>
The functions in this section compile regular expressions using the syntax defined for XML Schema. (See also Using Rx in XML Processors.)
enum rx_xml_recomp_errno rx_xml_recomp (rx_xml_rebuf * re, enum uni_encoding_scheme encoding, uni_string * source, size_t length);
Compile a regular expression using XML syntax.
re
is an opaque output parameter which will be filled with
the compiled expression.
encoding
describes the encoding scheme of the source expression.
It may be any of:
uni_iso8859_1, 8-bit characters (u+0..u+255 only) uni_utf8, UTF-8 encoding uni_utf16, UTF-16 in native byte order
Note that the encoding scheme of the source expression has no bearing on the encoding scheme of strings tested for a match.
source
points to the source for the regular expression, a string
encoded in the manner specified by encoding
.
length
is the size of the source expression, measured in code units.
Return 0
upon success, an error code otherwise. See
<hackerlab/rx-unicode/re.h>
for the list of error codes.
The list of error codes is likely to change in future releaes.
Future releases will add a function for translating
error codes to error messages.
enum rx_xml_recomp_errno rx_xml_recomp_branch (rx_xml_rebuf * re, enum uni_encoding_scheme encoding, uni_string source, size_t length);
Compile a Unicode regular expression in XML syntax adding an alternative branch to an already compiled expression.
Parameters and return values are as to rx_xml_recomp except
that re
must contain an already compiled expression.
If, on entry, re
contains a compiled form of expression RE1
,
and source
points to expression RE2
, the result is the same as
if the expression:
(RE1)|(RE2)
were compiled.
The function in this section compiles Unicode regular expressions
using any of a variety of syntaxes which can be selected by
options provided to rx_xml_recomp_opts
.
enum rx_xml_recomp_errno rx_xml_recomp_opts (rx_xml_rebuf * re, enum uni_encoding_scheme encoding, uni_string source, size_t length, t_ulong syntax_options, bits cset);
Compile a Unicode regular expression.
Parameters re
, encoding
, source
, and length
are as to
rx_xml_recomp.
syntax_options
indicates what regular expression syntax to use
and how to interpret that syntax. It is a bit-wise or of any of:
rx_xml_syntax_unicode_escapes Permit numeric character escapes ("\u+xxxx" and "\v+xxxxxx")
rx_xml_syntax_consistent_metacharacters Require all special characters to be escaped in character class expressions. Ordinarily, some special characters do not need to be escaped in character class expressions (as per XML syntax).
rx_xml_syntax_dot_dot_ranges Use ".." instead of "-" to indicate character ranges in character class expressions ("[a..z]")
rx_xml_syntax_carrot_set_difference Use "^" instead of "-" to indicate character set subtraction ("[\p{L}^[a..z]]")
rx_xml_syntax_add_branch `re' must contain a previously compiled expression. Compile this expression as an alternative branch.
rx_xml_syntax_no_newlines rx_xml_syntax_no_cr rx_xml_syntax_no_linesep Do not include newline (carriage return, line separator) in the character set matched by `.' or (implicitly) in negated character set expressions.
rx_xml_syntax_dot_star_prefix Compile the expression as if it were prefixed by:
`(.*)'
Passing 0
or rx_xml_syntax_xml
for syntax_options
causes XML syntax to be used.
cset
is the set of all valid code code points.
Common choices for cset
are:
xml_charset -- the set of code points permitted in XML documents (declared in <hackerlab/xml/charsets.h>
unidata_bitset_universal -- the set of all assigned code points (declared in <hackerlab/unicode/unicode.h>
void rx_xml_free_re (rx_xml_rebuf * re);
Release all storage associated with a previously compiled expression.
This does not free the memory pointed to by re
.
int rx_xml_is_match (enum rx_xml_rematch_errno * errn, rx_xml_rebuf * re, enum uni_encoding_scheme encoding, uni_string * string, size_t length);
Compare the compiled expression re
to string
. Return 1
if the
entire string matches, 0
if it does not, and -1
if an error
occurs.
errn
is an output parameter that returns an error code if an
error occurs (see <hackerlab/rx-unicode/re.h>
for the list of error
codes. The list of error codes is likely to change in future
releaes. Future releases will add a function for translating error
codes to error messages.
re
is a previously compiled regular expression.
encoding
describes the encoding scheme of string
.
It may be any of:
uni_iso8859_1, 8-bit characters (u+0..u+255 only) uni_utf8, UTF-8 encoding uni_utf16, UTF-16 in native byte order
string
is the string to test for a match, encoded according to
encoding
.
length
is the length of that string in code units.
enum rx_xml_longest_status rx_xml_longest_match (enum rx_xml_rematch_errno * errn, size_t * match_len, rx_xml_rebuf * re, enum uni_encoding_scheme encoding, uni_string string, size_t length)
Look for the longest matching prefix of string
. There are
five possible return values:
rx_xml_longest_error An error occured (check *errn).
rx_xml_longest_out_of_input_match A match was found. If the string were longer, a longer match might be found.
rx_xml_longest_out_of_input_nomatch No match was found. If the string were longer, a match might be found.
rx_xml_longest_found The longest possible match was found.
rx_xml_longest_nomatch No prefix matches.
errn
is an output parameter that returns an error code if an
error occurs (see <hackerlab/rx-unicode/re.h>
for the list of error
codes). The list of error codes is likely to change in future
releaes. Future releases will add a function for translating error
codes to error messages.
match_len
returns the length (in code units) of the longest match
found (if any).
re
is a previously compiled regular expression.
encoding
describes the encoding scheme of string
.
It may be any of:
uni_iso8859_1, 8-bit characters (u+0..u+255 only) uni_utf8, UTF-8 encoding uni_utf16, UTF-16 in native byte order
string
is the string to test for a match, encoded according to
encoding
.
length
is the length of that string in code units.
enum rx_xml_prefix_status rx_xml_prefix_match (enum rx_xml_rematch_errno * errn, rx_xml_rebuf * re, enum uni_encoding_scheme encoding, uni_string string, size_t length);
Look for any matching prefix of string
. Note that this
function does not look for the longest matching prefix,
and does not return the length of the prefix found -- it
merely verifies the existence of some matching prefix.
There are four possible return values:
rx_xml_prefix_error An error occured (check *errn).
rx_xml_prefix_out_of_input No match was found. If the string were longer, a match might be found.
rx_xml_prefix_found A matching prefix was found.
rx_xml_prefix_nomatch No prefix matches.
errn
is an output parameter that returns an error code if an
error occurs (see <hackerlab/rx-unicode/re.h>
for the list of error
codes). The list of error codes is likely to change in future
releaes. Future releases will add a function for translating error
codes to error messages.
re
is a previously compiled regular expression.
encoding
describes the encoding scheme of string
.
It may be any of:
uni_iso8859_1, 8-bit characters (u+0..u+255 only) uni_utf8, UTF-8 encoding uni_utf16, UTF-16 in native byte order
string
is the string to test for a match, encoded according to
encoding
.
length
is the length of that string in code units.
XML Schema datatype definitions can use regular expressions in
pattern
facets to define the syntax of a value space. For
information on this topic, see the W3C document XML Schema: Part
2, http://www.w3.org/TR/xmlschema-2.
The functions rx_xml_recomp, rx_xml_recomp_branch, and
rx_xml_is_match were designed specifically for pattern
schema.
rx_xml_recomp
compiles the regular expression syntax of pattern
facets. (See Comments on the XML Schema Regular Expression Syntax.)
If a pattern
facet contains more than one regular expression, they
are supposed to be combined as alternative branches (see section 5.2.4 of XML Schema: Part 2). The function rx_xml_recomp_branch
provides an easy implementation of this functionality.
Finally, rx_xml_is_match
compares an entire string to a regular
expression -- the test required to validate a pattern
constraint.
regexps.com