The Hackerlab at regexps.com

Unicode

up: libhackerlab
next: Posix Regexps
prev: Converting Between Numbers and Strings

This chapter describes the foundation of support for the Unicode character set in the Hackerlab C library.

This chapter is not a tutorial introduction to Unicode. We presume that readers are already somewhat familiar with Unicode. A very brief introduction can be found in An Absurdly Brief Introduction to Unicode.


Naming Unicode Coding Forms

up: Unicode
next: Representing Unicode Strings

Type uni_encoding_schemes

enum uni_encoding_schemes;

Values of the enumerated type uni_encoding_schemes are used in interfaces throughout the Hackerlab C library to identify encoding schemes for strings or streams of Unicode characters. (See An Absurdly Brief Introduction to Unicode.)

     enum uni_encoding_schemes
      {
        uni_iso8859_1,
        uni_utf8,
        uni_utf16,
        uni_utf16be,
        uni_utf16le,
      };     

uni_iso8859_1 refers to a degenerate encoding scheme. Each character is stored in one byte. Only characters in the range U+0000 .. U+00FF can be represented.

uni_utf8 refers to the UTF-8 encoding scheme.

uni_utf16 refers to UTF-16 in the native byte order of the machine.

uni_utf16be refers to UTF-16, explicitly in big-endian order.

uni_utf16le refers to UTF-16, explicitly in little-endian order.

Some low-level functions in the Hackerlab C library work with any of these five encodings. Higher-level functions work only with uni_iso8859_1 , uni_utf8 , and uni_utf16 .

Code units in a uni_utf8 string are of type t_uchar (unsigned, 8-bit integer). Code units in a uni_utf16 string are of type t_unichar (unsigned 16-bit integer). Unicode code points are of type t_unicode . (See Machine-Specific Definitions.)




Representing Unicode Strings

up: Unicode
next: Unicode Character Properties
prev: Naming Unicode Coding Forms

The Hackerlab C Library is designed to operate correctly for programs which internally use any combination of the encodings iso8859-1, utf-8, and utf-16. (Future releases are likely to add support for utf-32.)

Type uni_string

typedef struct uni__undefined_struct * uni_string;

The type uni_string is pointer to a value of unknown size. It is used to represent the address of a Unicode string or an address within a Unicode string.

Any two uni_string pointers may be compared for equality.

uni_string pointers within a single string may be compared using any relational operator (< , > , etc.).

uni_string pointers are created from UTF-8 pointers (t_uchar * ) and from UTF-16 pointers (t_unichar * ) by means of a cast:

     uni_string s = (uni_string)utf_8_string;
     uni_string t = (uni_string)utf_16_string;

By convention, all functions that operate on Unicode strings accept two parameters for each string: an encoding form, and a string pointer as in this function declaration:

     void uni_fn (enum uni_encoding_scheme encoding,
                  uni_string s);

By convention, the length of a Unicode string is always measured in code units, no matter what the size of those code units. Integer string indexes are also measured in code units.



Basic Unicode String Functions

These functions were not ready for the current release of the Hackerlab C Library. They will be included in future releases.


Unicode Character Properties

up: Unicode
prev: Representing Unicode Strings

The functions and macros in this chapter present programs with an interface to various properties extracted from the Unicode Character Database as published by the Unicode consortium.

For information about the version of the database used and the implications of using these functions on program size, see Data Sheet for the Hackerlab Unicode Database.


Assigned Code Points

up: Unicode Character Properties
next: General Category

Function unidata_is_assigned_code_point

int unidata_is_assigned_code_point (t_unicode c);

Return 1 if c is an assigned code point, 0 otherwise.

A code point is assigned if it has an entry in unidata.txt or is part of a range of characters whose end-points are defined in unidata.txt .




General Category

up: Unicode Character Properties
next: Unicode Decimal Digit Values
prev: Assigned Code Points

Type enum unidata_general_category

enum uni_general_category;

The General Category of a Unicode character is represented by an enumerated value of this type.

The primary category values are:

     uni_general_category_Lu         Letter, uppercase
     uni_general_category_Ll         Letter, lowercase
     uni_general_category_Lt         Letter, titlecase
     uni_general_category_Lm         Letter, modifier
     uni_general_category_Lo         Letter, other"

     uni_general_category_Mn         Mark, nonspacing
     uni_general_category_Mc         Mark, spacing combining
     uni_general_category_Me         Mark, enclosing

     uni_general_category_Nd         Number, decimal digit
     uni_general_category_Nl         Number, letter
     uni_general_category_No         Number, other

     uni_general_category_Zs         Separator, space
     uni_general_category_Zl         Separator, line
     uni_general_category_Zp         Separator, paragraph

     uni_general_category_Cc         Other, control
     uni_general_category_Cf         Other, format
     uni_general_category_Cs         Other, surrogate
     uni_general_category_Co         Other, private use
     uni_general_category_Cn         Other, not assigned

     uni_general_category_Pc         Punctuation, connector
     uni_general_category_Pd         Punctuation, dash
     uni_general_category_Ps         Punctuation, open
     uni_general_category_Pe         Punctuation, close
     uni_general_category_Pi         Punctuation, initial quote
     uni_general_category_Pf         Punctuation, final quote
     uni_general_category_Po         Punctuation, other

     uni_general_category_Sm         Symbol, math
     uni_general_category_Sc         Symbol, currency
     uni_general_category_Sk         Symbol, modifier
     uni_general_category_So         Symbol, other

Seven additional synthetic categories are defined. These are:

     uni_general_category_L          Letter
     uni_general_category_M          Mark
     uni_general_category_N          Number
     uni_general_category_Z          Separator
     uni_general_category_C          Other
     uni_general_category_P          Punctuation
     uni_general_category_S          Symbol

No character is given a synthetic category as its general category. Rather, the synthetic categories are used in some interfaces to refer to all characters having a general category within one of the synthetic categories.



Function unidata_general_category

enum uni_general_category unidata_general_category (t_unicode c);

Return the general category of c .

The category returned for unassigned code points is uni_general_category_Cn (Other, Not Assigned).




Unicode Decimal Digit Values

up: Unicode Character Properties
next: Unicode Bidirectional Properties
prev: General Category

Function unidata_decimal_digit_value

int unidata_decimal_digit_value (t_unicode c);

If c is a decimal digit (regardless of script) return its digit value. Otherwise, return -1 .




Unicode Bidirectional Properties

up: Unicode Character Properties
next: Canonical Combining Class
prev: Unicode Decimal Digit Values

Type enum unidata_bidi_category

enum uni_bidi_category;

The Bidrectional Category of a Unicode character is represented by an enumerated value of this type.

The bidi category values are:

     uni_bidi_L      Left-to-Right
     uni_bidi_LRE    Left-to-Right Embedding
     uni_bidi_LRO    Left-to-Right Override
     uni_bidi_R      Right-to-Left
     uni_bidi_AL     Right-to-Left Arabic
     uni_bidi_RLE    Right-to-Left Embedding
     uni_bidi_RLO    Right-to-Left Override
     uni_bidi_PDF    Pop Directional Format
     uni_bidi_EN     European Number
     uni_bidi_ES     European Number Separator
     uni_bidi_ET     European Number Terminator
     uni_bidi_AN     Arabic Number
     uni_bidi_CS     Common Number Separator
     uni_bidi_NSM    Non-Spacing Mark
     uni_bidi_BN     Boundary Neutral
     uni_bidi_B      Paragraph Separator
     uni_bidi_S      Segment Separator
     uni_bidi_WS     Whitspace
     uni_bidi_ON     Other Neutrals



Function unidata_bidi_category

enum uni_bidi_category unidata_bidi_category (t_unicode c);

Return the bidirectional category of c .

The category returned for unassigned code points is uni_bidi_ON (other neutrals).



Function unidata_is_mirrored

int unidata_is_mirrored (t_unicode c);

Return 1 if c is mirrored in bidirectional text, 0 otherwise.




Canonical Combining Class

up: Unicode Character Properties
next: Simple Unicode Case Conversions
prev: Unicode Bidirectional Properties

Macro unidata_canonical_combining_class

#define unidata_canonical_combining_class(C)

Return the canonical combining class of a Unicode character.

Combining classes are represented as unsigned 8-bit integers.




Simple Unicode Case Conversions

up: Unicode Character Properties
next: Character Decomposition Mapping
prev: Canonical Combining Class

These functions use the case mappings in unidata.txt .

Function unidata_to_upper

t_unicode unidata_to_upper (t_unicode c);

If c has a default uppercase mapping, return that mapping. Otherwise, return c .



Function unidata_to_lower

t_unicode unidata_to_lower (t_unicode c);

If c has a default lowercase mapping, return that mapping. Otherwise, return c .



Function unidata_to_title

t_unicode unidata_to_title (t_unicode c);

If c has a default titlecase mapping, return that mapping. Otherwise, return c .




Character Decomposition Mapping

up: Unicode Character Properties
next: Unicode Blocks
prev: Simple Unicode Case Conversions

Type enum uni_decomposition_type

enum uni_decomposition_type;

The decomposition mapping of a character is described by values of this enumerated type:

     uni_decomposition_none
     uni_decomposition_canonical
     uni_decomposition_font
     uni_decomposition_noBreak
     uni_decomposition_initial
     uni_decomposition_medial
     uni_decomposition_final
     uni_decomposition_isolated
     uni_decomposition_circle
     uni_decomposition_super
     uni_decomposition_sub
     uni_decomposition_vertical
     uni_decomposition_wide
     uni_decomposition_narrow
     uni_decomposition_small
     uni_decomposition_square
     uni_decomposition_fraction
     uni_decomposition_compat

The value uni_decomposition_none indicates that a character has no decomposition mapping.



Type struct uni_decomposition_mapping

struct uni_decomposition_mapping;

A character's decomposition mapping is described by this structure. It has the fields:

     enum uni_decomposition_type type;
     t_unicode * decomposition;

type is the type of decomposition.

If type is not uni_decomposition_none , then decomposition is a 0-termianted array of code points which are the decomposition of the character.



Macro unidata_character_decomposition_mapping

#define unidata_character_decomposition_mapping(C)

Return the decomposition mapping of C . This macro returns a pointer to a struct uni_decomposition_mapping .




Unicode Blocks

up: Unicode Character Properties
next: Unicode Category Bitsets
prev: Character Decomposition Mapping

Type struct uni_block

struct uni_block;

Structures of this type describe one of the standard blocks of Unicode characters ("Basic Latin" , "Latin-1 Supplement" , etc.)


struct uni_block
{
  t_uchar * name;       /* name of the block */
  t_unichar start;      /* first character in the block */
  t_unichar end;        /* last character in the block */
};



Variable uni_blocks

extern struct uni_block uni_blocks[];

The names of the standard Unicode blocks. This array is sorted in code-point order, from least to greatest.

n_uni_blocks is the number of blocks in uni_blocks .

     uni_blocks[n_uni_blocks].name == 0


extern const struct uni_block uni_blocks[];
extern const int n_uni_blocks;





Unicode Category Bitsets

up: Unicode Character Properties
prev: Unicode Blocks

Function uni_universal_bitset

bits uni_universal_bitset (void);

Return the set of all assigned code points which are not surrogate code points and are not private use code points. The set is represented as a shared bitset tree. (See Shared Bitset Trees.)

The shared bitset tree returned by this function uses the tree structure defined by uni_bits_tree_rule . (See Unicode Character Bitsets.)

Programs should not attempt to modify the set returned by this function.



Function uni_general_category_bitset

bits uni_general_category_bitset (enum uni_general_category c);

Return the set of all assigned code points having the indicated general category or synthetic general category. The set is represented as a shared bitset tree. (See Shared Bitset Trees.)

c indicates which category to return. It may be a Unicode general category or a synthetic general category. (See General Category.)

The shared bitset tree returned by this function uses the tree structure defined by uni_bits_tree_rule . (See Unicode Character Bitsets.)

Programs should not attempt to modify the set returned by this function.



libhackerlab: The Hackerlab C Library
The Hackerlab at regexps.com