The Hackerlab at regexps.com

An Introduction to XML Regular Expressions

up: libhackerlab
next: Comments on the XML Schema Regular Expression Syntax
prev: An Introduction to Posix Regexps

This chapter informally describes "XML regular expressions" as supported by Rx.

A regular expression describes a textual pattern: a string of characters either matches (fits the pattern) or does not match (does not fit the pattern). Regular expressions have many applications associated with searching, editing, and parsing text. Typically, a program allows users to specify a regular expression, then it searches for text that matches that expression. In XML Schema, regular expressions are used to define "lexical spaces" (the set of valid literals for a datatype).

The syntax and semantics of XML regular expressions is defined in the W3C document XML Schema Part 2: Datatypes (http://www.w3.org/TR/xmlschema-2).


Literal and Special Characters

up: An Introduction to XML Regular Expressions
next: Single Character Escapes

In the simplest cases, a regular expression is just a literal string that must match exactly. For example, the pattern:

     regular expression

matches the string "regular expression" and no others.

Some characters have a special meaning when they occur in a regular expression. They aren't matched literally as in the previous example, but instead denote a more general pattern. For example, the character * is used to indicate that the preceding element of a regular expression may be repeated 0 , 1 , or more times. In the pattern:

     smooo*th

the * indicates that the preceding o can be repeated 0 or more times. So the pattern matches:

     smooth
     smoooth
     smooooth
     smoooooth
     ...

The same pattern does not match these examples:

     smoth      -- The pattern requires at least two o's
     smoo*th    -- The pattern doesn't match the *

Suppose you want to write a pattern that literally matches a special character like * ; in other words, you don't want to * to indicate a permissible repetition, but to match * literally. This is accomplished by quoting the special character with a backslash. The pattern:

     smoo\*th

matches the string:

     smoo*th

and no other strings.

These characters are special (their meaning is described in this chapter):

     . \ | ? * + { } ( ) [ ]

The remaining sections of this chapter introduce and explain the various special characters that can occur in regular expressions.


Single Character Escapes

up: An Introduction to XML Regular Expressions
next: Dot
prev: Literal and Special Characters

These three expressions each match a single, literal character:

        \t      -- matches tab (U+0008)
        \n      -- matches newline (U+000A)
        \r      -- matches carriage return (U+000D)


Dot

up: An Introduction to XML Regular Expressions
next: Multi-character Escapes
prev: Single Character Escapes

. ordinarily matches nearly any character.

     p.ck

matches

     pick
     pack
     puck
     pbck
     pcck
     p.ck

etc.

To be more specific, . matches any character which is valid in an XML document except for newline (U+000A) and carriage return (U+000D).


Multi-character Escapes

up: An Introduction to XML Regular Expressions
next: Category Escapes
prev: Dot

Literal characters match a specific character. Multi-character escapes match any of a set of characters. The multi-character escapes understood by Rx are:

        \s      -- spaces
        \i      -- first character of a "name"
        \c      -- subsequent character of a "name"
        \d      -- digits
        \w      -- word characters

\s matches a single space, tab, newline, or carriage return.

The set of characters valid as the first or subsequent characters in a name (matched by \i and \c respectively) are as defined in XML.

The set of characters matched by \d is as defined in XML and includes most Unicode characters with general category Nd .

The set of characters matched by \w includes all characters which are valid in an XML document except those with a general category in class P (punctuation), Z (separators), and C (other, such as control characters and unassigned code points).

Each of the multi-character escapes can be negated by writing it as a capital letter:

        \S      -- non-spaces
        \I      -- not the first character of a "name"
        \C      -- not a subsequent character of a "name"
        \D      -- non-digits
        \W      -- non-word characters


Category Escapes

up: An Introduction to XML Regular Expressions
next: Character Sets
prev: Multi-character Escapes

Category escapes are similar to multi-character escapes: they match any of a set of characters. The default set of category escapes refer to the character categories and block names of Unicode. (For information about categories and blocks, see a recent version of The Unicode Standard or visit http://www.unicode.org).

A category escape based on a Unicode category is be written:

        \p{<category name>}

for example:

        \p{L}           -- letters
        \p{Lu}          -- upper case letters
        \p{Nd}          -- digits

A category escape based on a Unicode block name can is written:

        \p{is<block name>}

for example:

        \p{isBasicLatin}
        \p{isCherokee}
        \p{isUnifiedCanadianAboriginalSyllabics}

Any category can be negated by use '\P' instead of '\p':

        \P{L}           -- non-Letters
        \P{isCherokee}  -- characters which are not in
                           the Cherokee block

In all cases, the set of characters is further restricted to include only characters that are valid in XML documents.


Character Sets

up: An Introduction to XML Regular Expressions
next: Parenthesized Expressions
prev: Category Escapes

[ begins a "character set". A character set matches any of a set of characters which are explicitly enumerated in the regular expression.

There are two basic forms a character set can take.

The first form is a plain character set:

     [<cset-spec>]      -- every character in <cset-spec> is in the set.

the second form is a negated character set:

     [^<cset-spec>]     -- every character *not* in <cset-spec>
                           is in the set.

A <cset-spec> is a more or less an explicit enumeration of a set of characters. It can be written as a string of individual characters:

     [ABC]              -- matches 'A', 'B', or 'C'

or as a range of characters:

     [0-9]              -- matches any decimal digit

These two forms can be mixed:

     [A-za-z0-9_$]      -- any Basic Latin letter (either case), 
                           any Basic Latin digit, `_' or `$'

Negation allows you to specify which characters are not in the set:

     [^0-9]             -- match any character which is not a 
                           Basic Latin digit

Special characters can be included within a character set, in the usual way, by quoting them with \ :

     [\-abc]            -- match '-', 'a', 'b', or 'c'

Single character escapes, multi-character escapes, and category escapes can all be used in a <cset-spec> :

    [\t\n\r]            -- match tab, newline, or return 

    [\s\d]              -- match any space character or digit

    [^\p{Nd}\p{Ll}]     -- match all but digits and lowercase 
                           letters

From the two basic forms of character set, two additional forms of character set can be formed, called "character set subtractions":

     [<cset-spec>-<character set>]
                        -- every character in <cset-spec> except
                           those in in <character set>

     [^<cset-spec>-<character set>]
                        -- every character not in <cset-spec> 
                           except those in <character set>

For example:

     [\p{L}-[\p{isBasicLatin}\p{isGreek}]]
                        -- match any letter except for letters
                           in the Basic Latin or Greek blocks

     [^\p{L}-[\p{isBasicLatin}\p{isGreek}]]
                        -- match any non-letter except for
                           non-letters in the Basic Latin or
                           Greek blocks


Parenthesized Expressions

up: An Introduction to XML Regular Expressions
next: Repeated Expressions
prev: Character Sets

A subexpression is a regular expression enclosed in ( and ) . A subexpression can be used anywhere a single character or character set can be used.

Subexpressions are useful for grouping regular expression constructs. For example, the repeat operator, * , usually applies to just the preceding character. Recall that:

     smooo*th

matches

     smooth
     smoooth
     ...

Using a subexpression, we can apply * to a longer string:

     banan(an)*a

matches

     banana
     bananana
     banananana
     ...


Repeated Expressions

up: An Introduction to XML Regular Expressions
next: Optional Expressions
prev: Parenthesized Expressions

As previously explained, * is the repeat operator. It applies to the preceding character, character set, or subexpression. It indicates that the preceding element can be matched 0 or more times:

     bana(na)*

matches

     bana
     banana
     bananana
     banananana
     ...

+ is similar to * except that + requires the preceding element to be matched at least once. So on the one hand:

     bana(na)*

matches

     bana

but

     bana(na)+

does not. Both match

     banana
     bananana
     banananana
     ...


Optional Expressions

up: An Introduction to XML Regular Expressions
next: Counted Iteration
prev: Repeated Expressions

? indicates that the preceding character, character set, or subexpression is optional. It is permitted to match, or to be skipped:

     CSNY?

matches both

     CSN

and

     CSNY


Counted Iteration

up: An Introduction to XML Regular Expressions
next: Alternative Expressions
prev: Optional Expressions

The interval operator, {m,n} (where m and n are decimal integers and 0 <= m <= n ) indicates that the preceding character, character set, or subexpression must match at least m times and may match as many as n times.

For example:

     c([ad]){1,4}r

matches

     car
     cdr
     caar
     cdar
     ...
     caaar
     cdaar
     ...
     cadddr
     cddddr

but it doesn't match:

     cdddddr

because that has five d's, and the pattern only permits four, and it doesn't match

     cr

because the pattern requires at least one a or d .

A pattern of the form:

        R{M,}

matches M or more iterations of R .

A pattern of the form:

        R{M}

matches exactly 'M' iterations of 'R'.


Alternative Expressions

up: An Introduction to XML Regular Expressions
next: A Summary of Regular Expression Syntax
prev: Counted Iteration

An alternative is written:

     regular expression-1|regular expression-2|regular expression-3|...

It matches anything matched by some regular expression-n . For example:

     Crosby, Stills, (and Nash|Nash, and Young)

matches

     Crosby, Stills, and Nash

and also

     Crosby, Stills, Nash, and Young


A Summary of Regular Expression Syntax

up: An Introduction to XML Regular Expressions
prev: Alternative Expressions

In summary, regular expressions can be:

   `abcd'               - a string literal

   `\*'                 - an escaped special character

   `\n'                 - a single character escape

   `\s'                 - a multi-character escape

   `\p{isCherokee}'     - a category escape

   `.'                  - the universal character set.

   `[a-z_?]'            - a character set

   `[^abz_?]'           - an negated character set

   `[\p{L}-[\p{isBasicLatin}\p{isGreek}]]' -
                          character set subtraction

   `(subexp)'           - an expression grouped to form a
                          parenthesized subexpression. 

The following special characters and sequences can be applied to a character, character set, or subexpression:

   `*'                  - repeat the preceding element 0 
                          or more times.

   `+'                  - repeat the preceding element 1 
                          or more times.

   `?'                  - match the preceding element 0 
                          or 1 time.

   `{m,n}'              - match the preceding element at 
                          least `m', and as many as `n' times,
                          where 0 <= m <= n

   `{m,}'               - match the preceding element at 
                          least `m' times, where 0 <= m 

   `{m}'                - match the preceding element exactly
                          `m' times, where 0 <= m 

Finally, a list of patterns built from the preceeding operators can be combined using | :

   `regular expression-1|regular expression-2|..' 
                        - match regular expression-1 or regex-2 or ...

A special character, like . or * can be made into a literal by prefixing it with \ .

A special sequence, like + or ? can be made into a literal character by dropping the \ .

libhackerlab: The Hackerlab C Library
The Hackerlab at regexps.com