The Hackerlab at regexps.com

Tuning Posix Regexp and XML Regular Expression Performance

up: libhackerlab
next: A Virtual Unix File-System Interface
prev: XML Regular Expressions

Some programs use Posix regexps or XML Regular Expressions heavily. The performance of pattern matching (both speed and memory use) may be important in such programs.

The Hackerlab C Library pattern matching functions afford programs the opportunity to make space-for-time trade-offs: you can minimize the amount of memory used for pattern matching at the cost of speed, or maximize speed, at the cost of memory. This chapter provides the details.


Tuning the NFA Cache Size

up: Tuning Posix Regexp and XML Regular Expression Performance
next: Tuning the DFA Cache Size


#include <hackerlab/rx/nfa-cache.h>

When Rx compiles a regexp or regular expression, it builds a tree structure that describes the syntax of the expression. Later, some or all of the tree is converted to a graph, representing a non-deterministic finite automata (NFA).

Rx maintains a cache so that whenever two expressions have equivalent tree structure, they are likely to share a single NFA. This cache speeds up the processing of regexps by avoiding redundant NFA construction.

Note that the NFA cache can preserve NFA beyond the lifetime of a single compiled expression. If an expression is compiled, then matched, then freed, then recompiled, the recompiled expression will sometimes re-use the NFA cached from the first compile.

During matching, NFA are incrementally converted to deterministic automata (DFA). Another cache is kept of DFA fragments (see Tuning the DFA Cache Size). Here again, the NFA cache speeds up processing: when a single NFA is re-used, the DFA cache is made more effective.

This chapter presents functions which are used to monitor and tune the performance of the NFA cache. Be sure to also read The Impact of NFA and DFA Cache Sizes.

The NFA Cache Replacement Strategy

NFA cache entries are approximately sorted from most to least recently used. When cache space is exhausted, the least recently used entries are discarded.

The Advisory NFA Cache Limit

The size of the NFA cache is regulated by an advisory limit called the cache threshold . The threshold is a size (expressed in bytes) which represents an ideal limit on the amount of memory used by the NFA cache.

If an allocation within the NFA cache would cause the total amount of memory used by the cache to exceed the threshold, Rx attempts to discard sufficient cache entries to avoid exceeding the threshold. This is not always possible. When necessary for correct operation, Rx will exceed the cache threshold: usually by a small amount; rarely by a large amount. (That is why the threshold is called an advisory limit.)

The default threshold is 1MB .

Function rx_set_nfa_cache_threshold

void rx_set_nfa_cache_threshold (size_t n);

Set the advisory NFA cache limit to n .



Function rx_nfa_cache_threshold

size_t rx_nfa_cache_threshold (void);

Return the current NFA cache limit.



NFA Cache Statistics

These functions report statistics about the NFA cache.

Function rx_nfa_cache_in_use

size_t rx_nfa_cache_in_use (void);

Return the amount of memory currently in use by the NFA cache.



Function rx_nfa_cache_high_water_mark

size_t rx_nfa_cache_high_water_mark (void);

Return the largest amount of memory ever used at one time by the NFA cache.



Function rx_nfa_cache_statistics

void rx_nfa_cache_statistics (size_t * threshold,
                              size_t * ign,
                              size_t * in_use,
                              size_t * high_water_mark,
                              int * hits,
                              int * misses,
                              int * ign2);

Return statistics about the effectiveness of the NFA cache.

All parameters are used to return values. Any parameter may be 0 .

threshold returns the NFA cache threshold.

ign is reserved for future use and should be ignored.

in_use returns the number of bytes currently used by the NFA cache.

high_water_mark returns the largest number of bytes ever used by the NFA cache.

hits returns the number of cache hits that have occured within the NFA cache.

misses returns the number of cache misses that have occured within the NFA cache.

ign2 is reserved for future use and should be ignored.



Flushing the NFA Cache

Function rx_flush_nfa_cache

size_t rx_flush_nfa_cache (void);

Attempt to flush all entries from the NFA cache. If there exist compiled regexps (that have not been freed), it may not be possible to entirely empty the NFA cache.

Return the number of bytes still allocated to the NFA cache after the flush.




Tuning the DFA Cache Size

up: Tuning Posix Regexp and XML Regular Expression Performance
next: The Impact of NFA and DFA Cache Sizes
prev: Tuning the NFA Cache Size


#include <hackerlab/rx/dfa-cache.h>

When Rx compiles a regexp or regular expression, it builds a tree structure that describes the syntax of the expression. Later, some or all of the tree is converted to a graph, representing a non-deterministic finite automata (NFA). During matching, NFA are incrementally converted to deterministic finite automata (DFA).

Rx maintains a cache of DFA fragments. When part of a DFA is needed, the cache is used to avoid redundant construction. Because DFA can be quite large, DFA fragments are sometimes flushed from the cache to make room.

Note that the DFA cache can preserve DFA fragments beyond the lifetime of a single compiled expression. If an expression is compiled, then matched, then freed, then recompiled, the recompiled expression will sometimes re-use DFA fragments cached from the first compile.

This chapter presents functions which are used to monitor and tune the performance of the DFA cache. Be sure to also read The Impact of NFA and DFA Cache Sizes.

The DFA Cache Replacement Strategy

DFA cache entries are approximately sorted from most to least recently used. When cache space is exhausted, the least recently used entries are discarded.

The advisory DFA cache Limit

The size of the DFA cache is regulated by an advisory limit called the cache threshold . The threshold is a size (expressed in bytes) which represents an ideal limit on the amount of memory used by the DFA cache.

If an allocation within the DFA cache would cause the total amount of memory used by the cache to exceed the threshold, Rx attempts to discard sufficient cache entries to avoid exceeding the threshold. This is not always possible. When necessary for correct operation, Rx will exceed the cache threshold: usually by a small amount; sometimes by a large amount. (That is why the threshold is called an advisory limit.)

The default threshold is 1MB .

Function rx_set_dfa_cache_threshold

void rx_set_dfa_cache_threshold (size_t n);

Set the advisory DFA cache limit to n .



Function rx_dfa_cache_threshold

size_t rx_dfa_cache_threshold (void);

Return the current DFA cache limit.



DFA Cache Statistics

These functions report statistics about the DFA cache.

Function rx_dfa_cache_in_use

size_t rx_dfa_cache_in_use (void);

Return the amount of memory currently in use by the DFA cache.



Function rx_dfa_cache_high_water_mark

size_t rx_dfa_cache_high_water_mark (void);

Return the largest amount of memory ever used at one time by the DFA cache.



Function rx_dfa_cache_statistics

void rx_dfa_cache_statistics (size_t * threshold,
                              size_t * ign,
                              size_t * in_use,
                              size_t * high_water_mark,
                              int * hits,
                              int * misses,
                              int * total_hits,
                              int * total_misses);

Return statistics about the effectiveness of the DFA cache.

All parameters are used to return values. Any parameter may be 0 .

threshold returns the DFA cache threshold.

ign is reserved for future use and should be ignored.

in_use returns the number of bytes currently used by the DFA cache.

high_water_mark returns the largest number of bytes ever used by the DFA cache.

hits returns an indication of the number of cache hits that have occured within the DFA cache. (See below.)

misses returns an indication the number of cache misses that have occured within the DFA cache. (See below.)

Note: The values returned in hits and misses are scaled to give greater weight to recent cache activity, and reduced weight to older cache activity. It is the ratio of hits to misses , not their absolute values, that is interesting.

total_hits returns the exact number of cache hits that have occured within the DFA cache over the lifetime of the process.

total_misses returns the exact number of cache misses that have occured within the DFA cache over the lifetime of the process.



Flushing the DFA Cache

Function rx_flush_dfa_cache

size_t rx_flush_dfa_cache ();

Attempt to flush all entries from the DFA cache. If there exist locked DFA states, it may not be possible to entirely empty the DFA cache. (It is not possible to create locked DFA states using only the portion of the interface to Rx that is currently documented.)

Return the number of bytes still allocated to the DFA cache after the flush.




The Impact of NFA and DFA Cache Sizes

up: Tuning Posix Regexp and XML Regular Expression Performance
prev: Tuning the DFA Cache Size

This chapter discusses strategies for choosing sizes for the NFA and DFA caches (see Tuning the NFA Cache Size and Tuning the DFA Cache Size).

What Do Caches Effect?

Both caches predominately effect the speed of regexp matching functions such as the Posix function regexec or the Unicode function rx_xml_is_match . Both NFA and DFA construction takes place during matching -- not during compiling.

What Determines Cache Effectiveness?

The variety of expressions matched and the frequency of their use determines the effectiveness of both caches. An expression that is frequently re-used will tend to remain in the caches. An expression that is infrequently used will tend to be flushed from the caches.

Thrashing can occur if no expressions are frequently enough re-used to remain in the caches even though some expressions are re-used many times. Future releases of Rx will contain improved caching strategies to reduce the likely-hood of this kind of thrashing.

Thrashing can also occur if either or both of the caches is too small.

DFA cache entries are keyed by sets of NFA states. Thus, the effectiveness of the DFA cache is limited by the effectiveness of the NFA cache. If an NFA is flushed from the NFA cache, all cached DFA fragments for that NFA become useless (and are eventually flushed from the cache.)

NFA cache entries are keyed on syntax trees for patterns and sub-patterns. Thus, the effectiveness of the NFA cache is limited by the frequency with which patterns (and sub-patterns) having equivalent syntax trees are re-used.

If the same pattern is compiled twice, yielding distinct but equivalent syntax trees, both compilations will re-use the same NFA cache entries. Nevertheless, in some applications performance can be improved by avoiding recompilation (keeping a separate cache of compiled expresses) in order to avoid the cost of redundantly parsing expressions and building syntax trees.

Choosing Cache Sizes

Choosing cache sizes can be tricky: two values (the cache sizes), whose effects are not independent, must be determined.

Complicating matters further, cache usage is heavily dependent on the particular expressions compiled and matches performed, and the order in which those compilations and matches occur. There is no mathematically simple relation between cache size and overall performance.

Caches may be too small, which leads to expensive cache misses, or too large, which leads to wasted memory. Both conditions can be detected by varying the cache sizes on successive test runs and observing the values returned by rx_nfa_cache_statistics and rx_dfa_cache_statistics . If lowering a cache size has little effect on the ratio of cache hits to cache misses, the cache was too large. If raising a cache size increases the hit/miss ratio, the cache was too small. Once again, note that improving the hit/miss ratio for the NFA cache may, as a side effect, improve the hit/miss ratio for the DFA cache.

One possible strategy for choosing cache sizes is to simply accept the default (1MB for each cache). For many applications, the default will yield acceptable performance.

Another possible strategy is as follows:

Choose A Maximum Decide on a total amount of memory that you can afford to dedicate to both caches. Having made that decision, you must then decide how to divide that memory between the NFA and DFA caches. Unfortunately, there is no simple answer which is guaranteed to be optimal: the best ratio of NFA cache size to DFA cache size depends on the regexp usage patterns of your particular application.

Determine the Ratio of NFA Cache Size to DFA Cache Size You can simply divide memory equally between the two caches, or you can experiment. If you decide to experiment, make sure that you can run repeatable tests which use regexps in a characteristic way. If your application uses a fixed set of regexps in a regular way, designing tests will be easy. If your application allows users to choose the regexps and how they are used, designing tests will be a challenge.

Run your test programs repeatedly, varying the ratio of NFA cache size to DFA cache size. Compare the ratios of hits to misses in both caches and the overall throughput of your application. Presumably your goal is to optimize overall throughput; the hit/miss ratios will help you understand how the cache size ratio effects the behavior of Rx.

Optionally, Minimize the Cache Sizes After choosing a ratio of NFA to DFA cache sizes, you have the opportunity to reduce the size of the caches. Run your test program, reducing the size of one cache at a time, until the hit/miss ratio for that cache worsens significantly. Then, choose the smallest cache size that does not adversely effect the hit/miss ratio. Done correctly, this should have no adverse effect on the throughput of your program.

libhackerlab: The Hackerlab C Library
The Hackerlab at regexps.com