regexps.com
This chapter describes the foundation of support for the Unicode character set in the Hackerlab C library.
This chapter is not a tutorial introduction to Unicode. We presume that readers are already somewhat familiar with Unicode. A very brief introduction can be found in An Absurdly Brief Introduction to Unicode.
enum uni_encoding_schemes;
Values of the enumerated type uni_encoding_schemes
are used in
interfaces throughout the Hackerlab C library to identify encoding
schemes for strings or streams of Unicode characters. (See
An Absurdly Brief Introduction to Unicode.)
enum uni_encoding_schemes { uni_iso8859_1, uni_utf8, uni_utf16, uni_utf16be, uni_utf16le, };
uni_iso8859_1
refers to a degenerate encoding scheme. Each
character is stored in one byte. Only characters in the
range U+0000 .. U+00FF
can be represented.
uni_utf8
refers to the UTF-8 encoding scheme.
uni_utf16
refers to UTF-16 in the native byte order of
the machine.
uni_utf16be
refers to UTF-16, explicitly in big-endian order.
uni_utf16le
refers to UTF-16, explicitly in little-endian order.
Some low-level functions in the Hackerlab C library work with
any of these five encodings. Higher-level functions work
only with uni_iso8859_1
, uni_utf8
, and uni_utf16
.
Code units in a uni_utf8
string are of type t_uchar
(unsigned,
8-bit integer). Code units in a uni_utf16
string are of type
t_unichar
(unsigned 16-bit integer). Unicode code points are
of type t_unicode
. (See Machine-Specific Definitions.)
The Hackerlab C Library is designed to operate correctly for programs which internally use any combination of the encodings iso8859-1, utf-8, and utf-16. (Future releases are likely to add support for utf-32.)
typedef struct uni__undefined_struct * uni_string;
The type uni_string
is pointer to a value of unknown size. It is
used to represent the address of a Unicode string or an address
within a Unicode string.
Any two uni_string
pointers may be compared for equality.
uni_string
pointers within a single string may be compared
using any relational operator (<
, >
, etc.).
uni_string
pointers are created from UTF-8 pointers (t_uchar *
)
and from UTF-16 pointers (t_unichar *
) by means of a cast:
uni_string s = (uni_string)utf_8_string; uni_string t = (uni_string)utf_16_string;
By convention, all functions that operate on Unicode strings accept two parameters for each string: an encoding form, and a string pointer as in this function declaration:
void uni_fn (enum uni_encoding_scheme encoding, uni_string s);
By convention, the length of a Unicode string is always measured in code units, no matter what the size of those code units. Integer string indexes are also measured in code units.
These functions were not ready for the current release of the Hackerlab C Library. They will be included in future releases.
The functions and macros in this chapter present programs with an interface to various properties extracted from the Unicode Character Database as published by the Unicode consortium.
For information about the version of the database used and the implications of using these functions on program size, see Data Sheet for the Hackerlab Unicode Database.
Function
unidata_is_assigned_code_point
int unidata_is_assigned_code_point (t_unicode c);
Return 1
if c
is an assigned code point, 0
otherwise.
A code point is assigned if it has an entry in unidata.txt
or is part of a range of characters whose end-points are
defined in unidata.txt
.
Type
enum unidata_general_category
enum uni_general_category;
The General Category of a Unicode character is represented by an enumerated value of this type.
The primary category values are:
uni_general_category_Lu Letter, uppercase uni_general_category_Ll Letter, lowercase uni_general_category_Lt Letter, titlecase uni_general_category_Lm Letter, modifier uni_general_category_Lo Letter, other"
uni_general_category_Mn Mark, nonspacing uni_general_category_Mc Mark, spacing combining uni_general_category_Me Mark, enclosing
uni_general_category_Nd Number, decimal digit uni_general_category_Nl Number, letter uni_general_category_No Number, other
uni_general_category_Zs Separator, space uni_general_category_Zl Separator, line uni_general_category_Zp Separator, paragraph
uni_general_category_Cc Other, control uni_general_category_Cf Other, format uni_general_category_Cs Other, surrogate uni_general_category_Co Other, private use uni_general_category_Cn Other, not assigned
uni_general_category_Pc Punctuation, connector uni_general_category_Pd Punctuation, dash uni_general_category_Ps Punctuation, open uni_general_category_Pe Punctuation, close uni_general_category_Pi Punctuation, initial quote uni_general_category_Pf Punctuation, final quote uni_general_category_Po Punctuation, other
uni_general_category_Sm Symbol, math uni_general_category_Sc Symbol, currency uni_general_category_Sk Symbol, modifier uni_general_category_So Symbol, other
Seven additional synthetic categories are defined. These are:
uni_general_category_L Letter uni_general_category_M Mark uni_general_category_N Number uni_general_category_Z Separator uni_general_category_C Other uni_general_category_P Punctuation uni_general_category_S Symbol
No character is given a synthetic category as its general category. Rather, the synthetic categories are used in some interfaces to refer to all characters having a general category within one of the synthetic categories.
Function
unidata_general_category
enum uni_general_category unidata_general_category (t_unicode c);
Return the general category of c
.
The category returned for unassigned code points is
uni_general_category_Cn
(Other, Not Assigned).
Function
unidata_decimal_digit_value
int unidata_decimal_digit_value (t_unicode c);
If c
is a decimal digit (regardless of script) return
its digit value. Otherwise, return -1
.
Type
enum unidata_bidi_category
enum uni_bidi_category;
The Bidrectional Category of a Unicode character is represented by an enumerated value of this type.
The bidi category values are:
uni_bidi_L Left-to-Right uni_bidi_LRE Left-to-Right Embedding uni_bidi_LRO Left-to-Right Override uni_bidi_R Right-to-Left uni_bidi_AL Right-to-Left Arabic uni_bidi_RLE Right-to-Left Embedding uni_bidi_RLO Right-to-Left Override uni_bidi_PDF Pop Directional Format uni_bidi_EN European Number uni_bidi_ES European Number Separator uni_bidi_ET European Number Terminator uni_bidi_AN Arabic Number uni_bidi_CS Common Number Separator uni_bidi_NSM Non-Spacing Mark uni_bidi_BN Boundary Neutral uni_bidi_B Paragraph Separator uni_bidi_S Segment Separator uni_bidi_WS Whitspace uni_bidi_ON Other Neutrals
Function
unidata_bidi_category
enum uni_bidi_category unidata_bidi_category (t_unicode c);
Return the bidirectional category of c
.
The category returned for unassigned code points is
uni_bidi_ON
(other neutrals).
int unidata_is_mirrored (t_unicode c);
Return 1
if c
is mirrored in bidirectional text, 0
otherwise.
Macro
unidata_canonical_combining_class
#define unidata_canonical_combining_class(C)
Return the canonical combining class of a Unicode character.
Combining classes are represented as unsigned 8-bit integers.
These functions use the case mappings in unidata.txt
.
t_unicode unidata_to_upper (t_unicode c);
If c
has a default uppercase mapping, return that mapping.
Otherwise, return c
.
t_unicode unidata_to_lower (t_unicode c);
If c
has a default lowercase mapping, return that mapping.
Otherwise, return c
.
t_unicode unidata_to_title (t_unicode c);
If c
has a default titlecase mapping, return that mapping.
Otherwise, return c
.
Type
enum uni_decomposition_type
enum uni_decomposition_type;
The decomposition mapping of a character is described by values of this enumerated type:
uni_decomposition_none uni_decomposition_canonical uni_decomposition_font uni_decomposition_noBreak uni_decomposition_initial uni_decomposition_medial uni_decomposition_final uni_decomposition_isolated uni_decomposition_circle uni_decomposition_super uni_decomposition_sub uni_decomposition_vertical uni_decomposition_wide uni_decomposition_narrow uni_decomposition_small uni_decomposition_square uni_decomposition_fraction uni_decomposition_compat
The value uni_decomposition_none
indicates that a character
has no decomposition mapping.
Type
struct uni_decomposition_mapping
struct uni_decomposition_mapping;
A character's decomposition mapping is described by this structure. It has the fields:
enum uni_decomposition_type type; t_unicode * decomposition;
type
is the type of decomposition.
If type
is not uni_decomposition_none
, then decomposition
is a 0-termianted array of code points which are the decomposition
of the character.
Macro
unidata_character_decomposition_mapping
#define unidata_character_decomposition_mapping(C)
Return the decomposition mapping of C
. This macro returns
a pointer to a struct uni_decomposition_mapping
.
struct uni_block;
Structures of this type describe one of the standard blocks of
Unicode characters ("Basic Latin"
, "Latin-1 Supplement"
, etc.)
struct uni_block { t_uchar * name; /* name of the block */ t_unichar start; /* first character in the block */ t_unichar end; /* last character in the block */ };
extern struct uni_block uni_blocks[];
The names of the standard Unicode blocks. This array is sorted in code-point order, from least to greatest.
n_uni_blocks
is the number of blocks in uni_blocks
.
uni_blocks[n_uni_blocks].name == 0
extern const struct uni_block uni_blocks[]; extern const int n_uni_blocks;
bits uni_universal_bitset (void);
Return the set of all assigned code points which are not surrogate code points and are not private use code points. The set is represented as a shared bitset tree. (See Shared Bitset Trees.)
The shared bitset tree returned by this function uses the tree
structure defined by uni_bits_tree_rule
. (See Unicode Character Bitsets.)
Programs should not attempt to modify the set returned by this function.
Function
uni_general_category_bitset
bits uni_general_category_bitset (enum uni_general_category c);
Return the set of all assigned code points having the indicated general category or synthetic general category. The set is represented as a shared bitset tree. (See Shared Bitset Trees.)
c
indicates which category to return. It may be a Unicode
general category or a synthetic general category. (See
General Category.)
The shared bitset tree returned by this function uses the tree
structure defined by uni_bits_tree_rule
. (See Unicode Character Bitsets.)
Programs should not attempt to modify the set returned by this function.
regexps.com