The Hackerlab at regexps.com

XML Regular Expressions

up: libhackerlab
next: Tuning Posix Regexp and XML Regular Expression Performance
prev: Posix Regexps

The W3C document XML Schema Part 2: Datatypes defines a regular expression language for use in XML Schema. (See http://www.w3.org/TR/xmlschema-2.)

This chapter presents C functions used for matching XML regular expressions. (See An Introduction to XML Regular Expressions.)


XML Regular Expression Functions

up: XML Regular Expressions
next: Using Rx in XML Processors


#include <hackerlab/rx-unicode/re.h>


Compiling XML Regular Expressions

up: XML Regular Expression Functions
next: Compiling With Syntax Options

The functions in this section compile regular expressions using the syntax defined for XML Schema. (See also Using Rx in XML Processors.)

Function rx_xml_recomp

enum rx_xml_recomp_errno
rx_xml_recomp (rx_xml_rebuf * re,
               enum uni_encoding_scheme encoding,
               uni_string * source,
               size_t length);

Compile a regular expression using XML syntax.

re is an opaque output parameter which will be filled with the compiled expression.

encoding describes the encoding scheme of the source expression. It may be any of:

     uni_iso8859_1,          8-bit characters (u+0..u+255 only)
     uni_utf8,               UTF-8 encoding
     uni_utf16,              UTF-16 in native byte order

Note that the encoding scheme of the source expression has no bearing on the encoding scheme of strings tested for a match.

source points to the source for the regular expression, a string encoded in the manner specified by encoding .

length is the size of the source expression, measured in code units.

Return 0 upon success, an error code otherwise. See <hackerlab/rx-unicode/re.h> for the list of error codes. The list of error codes is likely to change in future releaes. Future releases will add a function for translating error codes to error messages.



Function rx_xml_recomp_branch

enum rx_xml_recomp_errno
rx_xml_recomp_branch (rx_xml_rebuf * re,
                      enum uni_encoding_scheme encoding,
                      uni_string source,
                      size_t length);

Compile a Unicode regular expression in XML syntax adding an alternative branch to an already compiled expression.

Parameters and return values are as to rx_xml_recomp except that re must contain an already compiled expression.

If, on entry, re contains a compiled form of expression RE1 , and source points to expression RE2 , the result is the same as if the expression:

             (RE1)|(RE2)

were compiled.




Compiling With Syntax Options

up: XML Regular Expression Functions
next: Freeing a Compiled XML Regular Expression
prev: Compiling XML Regular Expressions

The function in this section compiles Unicode regular expressions using any of a variety of syntaxes which can be selected by options provided to rx_xml_recomp_opts .

Function rx_xml_recomp_opts

enum rx_xml_recomp_errno
rx_xml_recomp_opts (rx_xml_rebuf * re,
                    enum uni_encoding_scheme encoding,
                    uni_string source,
                    size_t length,
                    t_ulong syntax_options,
                    bits cset);

Compile a Unicode regular expression.

Parameters re , encoding , source , and length are as to rx_xml_recomp.

syntax_options indicates what regular expression syntax to use and how to interpret that syntax. It is a bit-wise or of any of:

     rx_xml_syntax_unicode_escapes
         Permit numeric character escapes ("\u+xxxx" and "\v+xxxxxx")

     rx_xml_syntax_consistent_metacharacters
         Require all special characters to be escaped in 
         character class expressions.  Ordinarily,
         some special characters do not need to be 
         escaped in character class expressions (as per
         XML syntax).

     rx_xml_syntax_dot_dot_ranges
         Use ".." instead of "-" to indicate character ranges
         in character class expressions ("[a..z]")

     rx_xml_syntax_carrot_set_difference
         Use "^" instead of "-" to indicate character set
         subtraction ("[\p{L}^[a..z]]")

     rx_xml_syntax_add_branch
         `re' must contain a previously compiled expression.
         Compile this expression as an alternative branch.

     rx_xml_syntax_no_newlines
     rx_xml_syntax_no_cr
     rx_xml_syntax_no_linesep
         Do not include newline (carriage return, line separator) 
         in the character set matched by `.' or (implicitly) in
         negated character set expressions.

     rx_xml_syntax_dot_star_prefix
         Compile the expression as if it were prefixed by:

             `(.*)'

Passing 0 or rx_xml_syntax_xml for syntax_options causes XML syntax to be used.

cset is the set of all valid code code points.

Common choices for cset are:

     xml_charset     -- the set of code points permitted in 
                        XML documents (declared in 
                        <hackerlab/xml/charsets.h>

     unidata_bitset_universal -- the set of all assigned
                        code points (declared in
                        <hackerlab/unicode/unicode.h>




Freeing a Compiled XML Regular Expression

up: XML Regular Expression Functions
next: Comparing a String To An XML Regular Expression
prev: Compiling With Syntax Options

Function rx_xml_free_re

void rx_xml_free_re (rx_xml_rebuf * re);

Release all storage associated with a previously compiled expression. This does not free the memory pointed to by re .




Comparing a String To An XML Regular Expression

up: XML Regular Expression Functions
prev: Freeing a Compiled XML Regular Expression

Function rx_xml_is_match

int rx_xml_is_match (enum rx_xml_rematch_errno * errn,
                     rx_xml_rebuf * re,
                     enum uni_encoding_scheme encoding,
                     uni_string * string,
                     size_t length);

Compare the compiled expression re to string . Return 1 if the entire string matches, 0 if it does not, and -1 if an error occurs.

errn is an output parameter that returns an error code if an error occurs (see <hackerlab/rx-unicode/re.h> for the list of error codes. The list of error codes is likely to change in future releaes. Future releases will add a function for translating error codes to error messages.

re is a previously compiled regular expression.

encoding describes the encoding scheme of string . It may be any of:

     uni_iso8859_1,          8-bit characters (u+0..u+255 only)
     uni_utf8,               UTF-8 encoding
     uni_utf16,              UTF-16 in native byte order

string is the string to test for a match, encoded according to encoding .

length is the length of that string in code units.



Function rx_xml_longest_match

enum rx_xml_longest_status
rx_xml_longest_match (enum rx_xml_rematch_errno * errn,
                      size_t * match_len,
                      rx_xml_rebuf * re,
                      enum uni_encoding_scheme encoding,
                      uni_string string,
                      size_t length)

Look for the longest matching prefix of string . There are five possible return values:

    rx_xml_longest_error
             An error occured (check *errn).

    rx_xml_longest_out_of_input_match
             A match was found.  If the string were longer,
             a longer match might be found.

    rx_xml_longest_out_of_input_nomatch
             No match was found.  If the string were longer,
             a match might be found.

    rx_xml_longest_found
             The longest possible match was found.

    rx_xml_longest_nomatch
             No prefix matches.

errn is an output parameter that returns an error code if an error occurs (see <hackerlab/rx-unicode/re.h> for the list of error codes). The list of error codes is likely to change in future releaes. Future releases will add a function for translating error codes to error messages.

match_len returns the length (in code units) of the longest match found (if any).

re is a previously compiled regular expression.

encoding describes the encoding scheme of string . It may be any of:

     uni_iso8859_1,          8-bit characters (u+0..u+255 only)
     uni_utf8,               UTF-8 encoding
     uni_utf16,              UTF-16 in native byte order

string is the string to test for a match, encoded according to encoding .

length is the length of that string in code units.



Function rx_xml_prefix_match

enum rx_xml_prefix_status
rx_xml_prefix_match (enum rx_xml_rematch_errno * errn,
                     rx_xml_rebuf * re,
                     enum uni_encoding_scheme encoding,
                     uni_string string,
                     size_t length);

Look for any matching prefix of string . Note that this function does not look for the longest matching prefix, and does not return the length of the prefix found -- it merely verifies the existence of some matching prefix.

There are four possible return values:

    rx_xml_prefix_error
             An error occured (check *errn).

    rx_xml_prefix_out_of_input
             No match was found.  If the string were longer,
             a match might be found.

    rx_xml_prefix_found
             A matching prefix was found.

    rx_xml_prefix_nomatch
             No prefix matches.

errn is an output parameter that returns an error code if an error occurs (see <hackerlab/rx-unicode/re.h> for the list of error codes). The list of error codes is likely to change in future releaes. Future releases will add a function for translating error codes to error messages.

re is a previously compiled regular expression.

encoding describes the encoding scheme of string . It may be any of:

     uni_iso8859_1,          8-bit characters (u+0..u+255 only)
     uni_utf8,               UTF-8 encoding
     uni_utf16,              UTF-16 in native byte order

string is the string to test for a match, encoded according to encoding .

length is the length of that string in code units.




Using Rx in XML Processors

up: XML Regular Expressions
prev: XML Regular Expression Functions

XML Schema datatype definitions can use regular expressions in pattern facets to define the syntax of a value space. For information on this topic, see the W3C document XML Schema: Part 2, http://www.w3.org/TR/xmlschema-2.

The functions rx_xml_recomp, rx_xml_recomp_branch, and rx_xml_is_match were designed specifically for pattern schema.

rx_xml_recomp compiles the regular expression syntax of pattern facets. (See Comments on the XML Schema Regular Expression Syntax.)

If a pattern facet contains more than one regular expression, they are supposed to be combined as alternative branches (see section 5.2.4 of XML Schema: Part 2). The function rx_xml_recomp_branch provides an easy implementation of this functionality.

Finally, rx_xml_is_match compares an entire string to a regular expression -- the test required to validate a pattern constraint.

libhackerlab: The Hackerlab C Library
The Hackerlab at regexps.com