The Hackerlab at regexps.com
Data Sheet for the Hackerlab Unicode Database
up: libhackerlab
prev: Data Sheet for the Hackerlab Rx XML Regular Expression Matcher
Package:
Hackerlab Unicode Database (the Unicode Character Database in C)
Supplier:
regexps.com
Function:
The Hackerlab Unicode Database provides a C interface
to data taken from the Unicode Character Database
published by the Unicode Consortium.
Key Features:
Space and time efficient.
Accurate (based on comparisons to the Unicode 3.0 databases).
Clean and simple "classic C" interface.
Provides:
assigned test -- is a code point assigned?
general category
decimal digit value
bidirectional category
bidirectional mirroring property
canonical combining class
default case mappings
character decomposition mappings
Unicode blocks
general category bitsets
Easy to upgrade: most data is automatically extracted from
the standard file ``unidata.txt'' published by the
Unicode Consortium.
Ready for Unicode 3.1: Database data structures are designed
for a character set with 2^21 code points.
Validation tests are included.
Postscript and HTML documentation is included.
Variable depth trie structures for the database, selectable
at compile time. (A space/time trade-off.)
Licensing:
The Hackerlab Unicode Database is part of the Hackerlab C
Library, which is distributed under the terms of the GNU
General Public License, Version 2, as published by the Free
Software Foundation.
Alternative licenses (such as licenses permitting binary-only
re-distribution) can be purchased from regexps.com.
Prerequisites:
standard C compiler
Posix libc.a (standard C library)
GNU Make
Recommended and Disrecommended Applications:
Recommended for all applications needing access to Unicode
Character Database properties.
Limitations:
This is the first public release. The database has been
carefully tested but is not in widespread use.
Some important features, which are likely to be added in
future releases, are not present in this release. These
include:
locale-sensitive case mappings
pre-computed normalization forms
Contribution to Executable Size:
On a Pentium based machine, running gcc (egcs version
2.91.66) we compiled this simple (nonsense) program and
linked it against the Hackerlab C Library:
#include <hackerlab/unicode/unicode.h>
int
main (int argc, char * argv[])
{
t_unicode c;
int value;
c = argv[0][0];
value = unidata_is_assigned_code_point (c);
value += (int)unidata_general_category (c);
value += (int)unidata_decimal_digit_value (c);
value += (int)unidata_bidi_category (c);
value += (int)unidata_is_mirrored (c);
value += (int)unidata_canonical_combining_class (c);
value += (int)unidata_to_upper (c);
value += (int)unidata_to_lower (c);
value += (int)unidata_to_title (c);
value +=
(int)unidata_character_decomposition_mapping (c)->type;
value += (int)uni_blocks[0].start;
value +=
bits_population
(uni_general_category_bitset
(uni_general_category_Sm));
return value;
}
Both the library and the program were compiled with the "-g"
option.
The program forces linkage with all of the Hackerlab Unicode
Database.
Total executable size:
text data
43120 221064
The following list of object files from the Hackerlab
C library were linked into the executable:
alloc-limits.o bits.o bitset-lookup.o bitset-tree.o
bitset.o bitsets.o blocks.o case-db-inlines.o case-db.o
char-class.o combine-db.o cvt.o db-inlines.o db.o
decomp-db.o mem.o must-malloc.o panic-exit.o panic.o str.o
The contribution of those files to the executable size is:
text data
41541 218056
Those sizes represent maximums: they are for a program
that access all of the information in the database,
and that uses both the general category mapping and
precomputed bitsets for each general category.
A realistic scenario is a program which does not
use the bidi properties, the decomposition mapping,
or the category bitsets. To measure size in that
scenario, we compiled this test program:
#include <hackerlab/unicode/unicode.h>
int
main (int argc, char * argv[])
{
t_unicode c;
int value;
c = argv[0][0];
value = unidata_is_assigned_code_point (c);
value += (int)unidata_general_category (c);
value += (int)unidata_decimal_digit_value (c);
value += (int)unidata_canonical_combining_class (c);
value += (int)unidata_to_upper (c);
value += (int)unidata_to_lower (c);
value += (int)unidata_to_title (c);
value += (int)uni_blocks[0].start;
return value;
}
and obtained the total executable size:
text data
13864 104980
Sizes may differ slightly from the latest release
of Rx. Sizes will obviously vary with platform
compiler, and compiler options.
External Linkage Dependencies:
When compiled under FreeBSD 3.0, the simple program used to
test executable sizes depends on the following symbols
defined in "libc":
write
malloc
realloc
free
_exit
Accuracy:
We test the accuracy of the Hackerlab Unicode database by
using it to print a text file containing a subset of the
fields present in "unidata.txt". We then extract those fields
from "unidata.txt" itself using a sed script and compare the
resulting files.
There are no known bugs in the Hackerlab Unicode database.
Execution Speed:
All of the functions that access the Hackerlab Unicode
Database are very fast.
To give a quantitative sense of what "very fast" means, we have
measured the size of some functions in instructions, as
compiled by gcc (egcs version 2.91.66) for a Pentium
architecture machine with the compiler option `-g'. None of
the functions contain loops.
function instruction
count
-----------------------------------------------
unidata_is_assigned_code_point 28
unidata_general_category 29
unidata_decimal_digit_value 41
unidata_bidi_category 31
unidata_is_mirrored 31
unidata_to_upper 56
unidata_to_lower 56
unidata_to_title 56
Most of those functions performs two array look-ups and a
number of bit-manipulation operations. The three
case-conversion functions perform five array look-ups
and a number of bit-manipulation operations.
`unidata_decimal_digit_value' performs a conditional test.
Instruction counts include function prologues and epilogues.
For programs compiled with GCC, those functions are also
available as inlinable functions.
`unidata_character_decomposition_mapping' is a macro which
performs five array look-ups and a number of bit-manipulation
operations.
These instruction counts may, obviously, vary with choice of
compiler, compiler options, and platform. Cited
implementation details may change in future releases.
Run-time Allocation Requirements:
The Hackerlab Unicode database does not perform a significant
amount of memory allocation at run-time.
Support:
To purchase an alternative license, request additional
features, or for any kind of support assistance, you can
contact us at "hackerlab@regexps.com" or via our web site
"www.regexps.com". We can also be reached at (412) 401-5204.
We are currently in the midst of relocating from Pennsylvania
to California, so at this time, we have no reliable postal
address.
Available support includes (but is not limited to):
- porting assistance
- customized extensions
- consultation concerning regexp-intensive applications
- bug fixing
In most cases, support is offered as a commercial
service.
libhackerlab: The Hackerlab C Library
The Hackerlab at regexps.com