Ethiopic Character Classes

Why Ethiopic Character Classes?

Because they're not there! Why character classes at all? The whole point of RE languages is to make pattern matching simpler than writing your own matching algorithms. Standardized RE languages are also easier to read, write, maintain and share than the alternatives. Syntax for character classes exist to exploit in simple notation the shared properties of a group of characters.

So what properties do Ethiopic characters have that RE languages do not detect? Ethiopic letters each contain two properties that should be matched independently. Each letter is a syllable, a "CV" pattern, a means to detect either the "C" part or the "V" part is highly desirable. There are 7 basic "V" forms shared by 37 "C" bases. This gives us 7 classes each containing 37 members and the inversion of 37 "C" classes each of 7 members. We need a simple way to specify these groups without typing out every member of the group in a bracketed list.

But isn't this just a matter of convenience? Yes, but we like convenience, that's why we have character classes and REs to begin with. Since Perl 5.6 Unicode derived character classes have been introduced in the form of \p{IsDigit}. While this is great for working across scripts in a multilingual document or archive, it is less nice when working with documents you know are English only. Really, would you use \p{IsDigit} when you could more easily use \d (or [0-9] or [:digit:])??

This package offers overloading of the Perl regular expressions mechanism to provide syllabic style character class definitions with the convenience and ease of use POSIX notation. See the examples/ directory and module documentation for details.

Character Classes From Regexp::Ethiopic

Syllabic Classes: Equivalence of Family (row) and Equivalence of Form (column)

  [#1#] [#2#] [#3#] [#4#] [#5#] [#6#] [#7#] [#8#] [#9#] [#10#] [#11#] [#12#]
[#ሀ#]          
[#ለ#]        
[#ሐ#]        
[#መ#]        
[#ሠ#]        
[#ረ#]        
[#ሰ#]        
[#ሸ#]        
[#ቀ#]
[#ቐ#]
[#በ#]        
[#ቨ#]        
[#ተ#]        
[#ቸ#]        
[#ኀ#]
[#ነ#]        
[#ኘ#]        
[#አ#]        
[#ከ#]
[#ኸ#]
[#ወ#]          
[#ዐ#]          
[#ዘ#]        
[#ዠ#]        
[#የ#]          
[#ደ#]        
[#ዸ#]        
[#ጀ#]        
[#ገ#]
[#ጘ#]          
[#ጠ#]        
[#ጨ#]        
[#ጰ#]        
[#ጸ#]        
[#ፀ#]          
[#ፈ#]        
[#ፐ#]        

 

Character Class Aliases

Alias Equivalence of Form
[:ግዕዝ:]
[:geez:]
[#1#]
[:ካዕብ:]
[:kaib:]
[#2#]
[:ሣልስ:]
[:salis:]
[#3#]
[:ራብዕ:]
[:rabi:]
[#4#]
[:ኃምስ:]
[:hamis:]
[#5#]
[:ሳድስ:]
[:sadis:]
[#6#]
[:ሳብዕ:]
[:sabi:]
[#7#]
[:ዘመደ፡ግዕዝ:]
[:zemede:geez:]
[#8#]
[:ዘመደ፡ካዕብ:]
[:zemede:kaib:]
[#9#]
[:ዘመደ፡ሣልስ:]
[:zemede:salis:]
[#10#]
[:ዘመደ፡ራብዕ:]
[:zemede:rabi:]
[#11#]
[:ዘመደ፡ኃምስ:]
[:zemede:hamis:]
[#12#]
 
Other Class Expansion
[:አኃዝ:]
[:ahaz:]
[፩-፼]

 


Character Classes From Regexp::Ethiopic::Amharic

Equivalence in Phono-Orthography

Class Perl RE Expansion
[=ሀ=]
[=ሐ=]
[=ሃ=]
[=ሓ=]
[=ኀ=]
[=ኃ=]
[=ኻ=]
[ሀሃሐሓኀኃኻ]
[=ሁ=]
[=ሑ=]
[=ኁ=]
[=ኹ=]
[ሁሑኁኹ]
[=ሂ=]
[=ሒ=]
[=ኂ=]
[=ኺ=]
[ሂሒኂኺ]
[=ሄ=]
[=ሔ=]
[=ኄ=]
[=ኼ=]
[ሄሔኄኼ]
[=ህ=]
[=ሕ=]
[=ኅ=]
[=ኽ=]
[ህሕኅኽ]
[=ሆ=]
[=ሖ=]
[=ኆ=]
[=ኈ=]
[=ኾ=]
[=ዀ=]
[ሆሖኆኈኾዀ]
[=ሗ=]
[=ኋ=]
[=ዃ=]
[ሗኋዃ]
[=ሰ=]
[=ሠ=]
[ሰሠ]
[=ሱ=]
[=ሡ=]
[ሱሡ]
[=ሲ=]
[=ሢ=]
[ሲሢ]
[=ሳ=]
[=ሣ=]
[ሳሣ]
[=ሴ=]
[=ሤ=]
[ሴሤ]
[=ስ=]
[=ሥ=]
[ስሥ]
[=ሶ=]
[=ሦ=]
[ሶሦ]
[=ሷ=]
[=ሧ=]
[ሷሧ]
[=ቁ=]
[=ቍ=]
[ቁቍ]
[=ቆ=]
[=ቈ=]
[ቆቈ]
[=አ=]
[=ኣ=]
[=ዐ=]
[=ዓ=]
[አኣዐዓ]
[=ኡ=]
[=ዑ=]
[ኡዑ]
[=ኢ=]
[=ዒ=]
[ኢዒ]
[=ኤ=]
[=ዔ=]
[ኤዔ]
[=እ=]
[=ዕ=]
[እዕ]
[=ኦ=]
[=ዖ=]
[ኦዖ]
[=ኮ=]
[=ኰ=]
[ኮኰ]
[=ጎ=]
[=ጐ=]
[ጎጐ]
[=ጸ=]
[=ፀ=]
[ጸፀ]
[=ጹ=]
[=ፁ=]
[ጹፁ]
[=ጺ=]
[=ፂ=]
[ጺፂ]
[=ጻ=]
[=ፃ=]
[ጻፃ]
[=ጼ=]
[=ፄ=]
[ጼፄ]
[=ጽ=]
[=ፅ=]
[ጽፅ]
[=ጾ=]
[=ፆ=]
[ጾፆ]
 

Equivalence of Families

Class Perl RE Expansion
[=#ሀ#=]
[=#ሐ#=]
[=#ኀ#=]
[=#ኸ#=]
[ሀ-ሆሐ-ሗኀ-ኆኈ-ኍኸ-ኾዀ-ዅ]
[=#ሰ#=]
[=#ሠ#=]
[ሰ-ሷሠ-ሧ]
[=#አ#=]
[=#ዐ#=]
[አ-ኧዐ-ዖ]
[=#ጸ#=]
[=#ፀ#=]
[ጸ-ጿፀ-ፆ]

 


Character Classes From Regexp::Ethiopic::Tigrigna

Equivalence in Phono-Orthography

Class Perl RE Expansion
[=ሀ=]
[=ሃ=]
[=ኀ=]
[=ኃ=]
[ሀሃኀኃ]
[=ሁ=]
[=ኁ=]
[ሁኁ]
[=ሂ=]
[=ኂ=]
[ሂኂ]
[=ሄ=]
[=ኄ=]
[ሄኄ]
[=ህ=]
[=ኅ=]
[ህኅ]
[=ሆ=]
[=ኆ=]
[=ኈ=]
[ሆኆኈ]
[=ሐ=]
[=ሓ=]
[ሐሓ]
[=ሰ=]
[=ሠ=]
[ሰሠ]
[=ሱ=]
[=ሡ=]
[ሱሡ]
[=ሲ=]
[=ሢ=]
[ሲሢ]
[=ሳ=]
[=ሣ=]
[ሳሣ]
[=ሴ=]
[=ሤ=]
[ሴሤ]
[=ስ=]
[=ሥ=]
[ስሥ]
[=ሶ=]
[=ሦ=]
[ሶሦ]
[=ሷ=]
[=ሧ=]
[ሷሧ]
[=ቁ=]
[=ቍ=]
[ቁቍ]
[=ቆ=]
[=ቈ=]
[ቆቈ]
[=አ=]
[=ኣ=]
[አኣ]
[=ኮ=]
[=ኰ=]
[ኮኰ]
[=ዐ=]
[=ዓ=]
[ዐዓ]
[=ጎ=]
[=ጐ=]
[ጎጐ]
[=ጸ=]
[=ፀ=]
[ጸፀ]
[=ጹ=]
[=ፁ=]
[ጹፁ]
[=ጺ=]
[=ፂ=]
[ጺፂ]
[=ጻ=]
[=ፃ=]
[ጻፃ]
[=ጼ=]
[=ፄ=]
[ጼፄ]
[=ጽ=]
[=ፅ=]
[ጽፅ]
[=ጾ=]
[=ፆ=]
[ጾፆ]
 

Equivalence of Families

Class Perl RE Expansion
[=#ሀ#=]
[=#ኀ#=]
[ሀ-ሆኀ-ኆኈ-ኍ]
[=#ሰ#=]
[=#ሠ#=]
[ሰ-ሷሠ-ሧ]
[=#ጸ#=]
[=#ፀ#=]
[ጸ-ጿፀ-ፆ]

 


Character Classes From Regexp::Ethiopic::Geez

Equivalence in Phono-Orthography

Class Perl RE Expansion
[=ሀ=]
[=ሃ=]
[ሀሃ]
[=ሐ=]
[=ሓ=]
[ሐሓ]
[=ኀ=]
[=ኃ=]
[ኀኃ]
[=አ=]
[=ኣ=]
[አኣ]
[=ኮ=]
[=ኰ=]
[ኮኰ]
[=ዐ=]
[=ዓ=]
[ዐዓ]
[=ጎ=]
[=ጐ=]
[ጎጐ]
 

Equivalence of Families

none ocurring

 


Matching by Example

The name "ዓለምፀሐይ" is a compound word formed from "ዓለም" (world) and "ፀሐይ" (sun). It is a particularly useful word to test with because it has a great many possible renderings depending on the language and writing conventions. Not all renderings that are possible are also probable:

Probability Distribution of the 56 Renderings of ዓለምፀሐይ
Amharic Tigrigna-ER Tigrigna-ET Ge'ez
Valid and Probable ዓለምፀሐይ ዓለምጸሐይ ዓለምጸሃይ ዓለምፀሃይ አለምፀሐይ አለምጸሐይ አለምጸሃይ አለምፀሃይ ዓለምጸሓይ ዓለምፀሐይ ዓለምጸሐይ ዓለምፀሐይ
Intermediate Probablity አለምጸሀይ አለምፀሀይ ዓለምጸሀይ ዓለምፀሀይ ዐለምፀሐይ ዐለምጸሐይ ዐለምጸሃይ ዐለምፀሃይ ዐለምጸሀይ ዐለምፀሀይ ዓለምጸሐይ ዐለምጸሓይ ዐለምጸሐይ ዓለምፀሐይ ዓለምጸሓይ ዓለምፀሓይ ዐለምፀሐይ
Valid but Improbable ዓለምጸኃይ ዓለምጸኀይ ዓለምፀኃይ ዓለምፀኀይ ዓለምጸሓይ ዓለምፀሓይ ዓለምጸኻይ ዓለምፀኻይ አለምጸኃይ አለምጸኀይ አለምፀኃይ አለምፀኀይ አለምጸሓይ አለምፀሓይ አለምጸኻይ አለምፀኻይ ዐለምጸኃይ ዐለምጸኀይ ዐለምፀኃይ ዐለምፀኀይ ዐለምጸሓይ ዐለምፀሓይ ዐለምጸኻይ ዐለምፀኻይ ኣለምፀሐይ ኣለምጸሐይ ኣለምጸሃይ ኣለምፀሃይ ኣለምጸሀይ ኣለምጸኃይ ኣለምጸኀይ ኣለምፀሀይ ኣለምፀኃይ ኣለምፀኀይ ኣለምጸሓይ ኣለምፀሓይ ኣለምጸኻይ ኣለምፀኻይ ዓለምፀሓይ ዐለምፀሐይ ዐለምፀሓይ ዐለምፀሐይ ዐለምጸሐይ ዐለምጸሓይ ዐለምፀሓይ ዓለምፀሓይ ዐለምፀሓይ
Impossible (Invalid Phonemes)   ዓለምጸሃይ ዓለምፀሃይ ዓለምጸሀይ ዓለምጸኃይ ዓለምጸኀይ ዓለምፀሀይ ዓለምፀኃይ ዓለምፀኀይ ዓለምጸኻይ ዓለምፀኻይ አለምፀሐይ አለምጸሐይ አለምጸሃይ አለምፀሃይ አለምጸሀይ አለምጸኃይ አለምጸኀይ አለምፀሀይ አለምፀኃይ አለምፀኀይ አለምጸሓይ አለምፀሓይ አለምጸኻይ አለምፀኻይ ዐለምጸሃይ ዐለምፀሃይ ዐለምጸሀይ ዐለምጸኃይ ዐለምጸኀይ ዐለምፀሀይ ዐለምፀኃይ ዐለምፀኀይ ዐለምጸኻይ ዐለምፀኻይ ኣለምፀሐይ ኣለምጸሐይ ኣለምጸሃይ ኣለምፀሃይ ኣለምጸሀይ ኣለምጸኃይ ኣለምጸኀይ ኣለምፀሀይ ኣለምፀኃይ ኣለምፀኀይ ኣለምጸሓይ ኣለምፀሓይ ኣለምጸኻይ ኣለምፀኻይ ዓለምጸሃይ ዓለምፀሃይ ዓለምጸሀይ ዓለምጸኃይ ዓለምጸኀይ ዓለምፀሀይ ዓለምፀኃይ ዓለምፀኀይ ዓለምጸኻይ ዓለምፀኻይ አለምፀሐይ አለምጸሐይ አለምጸሃይ አለምፀሃይ አለምጸሀይ አለምጸኃይ አለምጸኀይ አለምፀሀይ አለምፀኃይ አለምፀኀይ አለምጸሓይ አለምፀሓይ አለምጸኻይ አለምፀኻይ ዐለምጸሃይ ዐለምፀሃይ ዐለምጸሀይ ዐለምጸኃይ ዐለምጸኀይ ዐለምፀሀይ ዐለምፀኃይ ዐለምፀኀይ ዐለምጸኻይ ዐለምፀኻይ ኣለምፀሐይ ኣለምጸሐይ ኣለምጸሃይ ኣለምፀሃይ ኣለምጸሀይ ኣለምጸኃይ ኣለምጸኀይ ኣለምፀሀይ ኣለምፀኃይ ኣለምፀኀይ ኣለምጸሓይ ኣለምፀሓይ ኣለምጸኻይ ኣለምፀኻይ ዓለምጸሐይ ዓለምጸሃይ ዓለምፀሃይ ዓለምጸሀይ ዓለምጸኃይ ዓለምጸኀይ ዓለምፀሀይ ዓለምፀኃይ ዓለምፀኀይ ዓለምጸሓይ ዓለምጸኻይ ዓለምፀኻይ አለምፀሐይ አለምጸሐይ አለምጸሃይ አለምፀሃይ አለምጸሀይ አለምጸኃይ አለምጸኀይ አለምፀሀይ አለምፀኃይ አለምፀኀይ አለምጸሓይ አለምፀሓይ አለምጸኻይ አለምፀኻይ ዐለምጸሐይ ዐለምጸሃይ ዐለምፀሃይ ዐለምጸሀይ ዐለምጸኃይ ዐለምጸኀይ ዐለምፀሀይ ዐለምፀኃይ ዐለምፀኀይ ዐለምጸሓይ ዐለምጸኻይ ዐለምፀኻይ ኣለምፀሐይ ኣለምጸሐይ ኣለምጸሃይ ኣለምፀሃይ ኣለምጸሀይ ኣለምጸኃይ ኣለምጸኀይ ኣለምፀሀይ ኣለምፀኃይ ኣለምፀኀይ ኣለምጸሓይ ኣለምፀሓይ ኣለምጸኻይ ኣለምፀኻይ

The following shows by example how the character classes may be applied to the ዓለምፀሐይ example. The majority of the examples also appear in the overload.pl script found in the examples directory.

Basic Property Matching

/([=አ=])ለም[=ጸ=][=ሃ=]ይ/
Will match ለምፀሐይ as ዓ is in the Amharic equivalence set of አ, ፀ is in the Amharic & Tigrigna equivalence of ጸ and ሐ is in the Amharic equivalence of ሃ. This expression only matches for Amharic equivalence classes.

 
/([=ዐ=])ለም[=ጸ=][=ሓ=]ይ/
Will match ለምፀሐይ as ዓ is in the Amharic & Tigrigna equivalence set of ዐ, ፀ is in the Amharic & Tigrigna equivalence of ጸ and ሐ is in the Amharic & Tigrigna equivalence of ሓ. This expression matches for Amharic and Tigrigna equivalence classes.

 
/[#ለ#]ም/
Will match ዓለምፀሐይ as ለ is a member of the "ለ family" and preceeds ም.

 
/[#4#]ለ/
Will match ዓለምፀሐይ as ዓ is a of the "4th form" and preceeds ለ.

Form Range Matching

/[#4-6#]ለ/
Will match ዓለምፀሐይ as ዓ is a of the "4th through 6th form" and preceeds ለ.

 
/[#4,6#]ለ/
Will match ዓለምፀሐይ as ዓ is a of the "4th or 6th form" and preceeds ለ.

 
/[#ዘ-ገ#]/
Will match ዓለምፀሐas ይ is a member of the "ዘ through ገ family range".

 
/[#መነበ#]/
Will match ዓለፀሐይ as ም is a member of the "መ oror በ family".

Form Range Operator Matching

/መ{%4-7}/
Will match ዓለፀሐይ as ም is the "6th form" of the "መ family" and is within the specified form range of 4-7.

 
/ፀ{%1,3-5,7}/
Will match ዓለምሐይ as ፀ is the "1st form" of the "ፀ family" and is within the specified form range of 1 or 3-5 or 7.

 
/ጸ{%3}/
Fails to match ዓለምፀሐይ as ጸ does not appear in the 3rd form.

 
/[ሐጸ]{%4}/
Fails to match ዓለምፀሐይ as ሐ or ጸ do not appear in the 4th form.

 
/[ሐጸ]{%3,5}/
Fails to match ዓለምፀሐይ as ሐ or ጸ do not appear in the 3rd or 5th form.

 
/[ለ-መ]{%6}/
Will match ዓለፀሐይ as ም is of the "ለ through መ" families and is of the "6th form".

 
/[ለ-መ]{%3,5}/
Fails to match ዓለምፀሐይ as no member of the "ለ through መ families" are present in the 3rd or 5th form.

 
/[ለ-መ]{%3,5-7}/
Will match ዓለፀሐይ as ም is of the "ለ through መ families" and is of the "6th form" which is within the specified range of 3rd or 5th through 7th.

 
/[ጸለ-መ]{%4}/
Fails to match ዓለምፀሐይ as no member of the "ጸ or ለ through መ families" are present in the 4th form.

 
/[ጸለ-መ]{%3,5-7}/
Will match ዓለፀሐይ as ም is of the "ጸ or ለ through መ families" and is of the "6th form" which is within the specified range of 3rd or 5th through 7th.

Negation Matching

/[ሐጸ]{^%4}/
Will match ዓለምፀይ as ሐ is a member of "ሐ or ጸ family" and not of the 4th form.

 
/[^ሐጸ]{%4}/
Will match ለምፀሐይ as ዓ is not a member of "ሐ or ጸ family" and is of the 4th form.

 
/[^ሐጸ]{^%4}/
Is equivalent to the last expression. Only one negative is interpretted, this is because a double negative remains negative in Ethio-Semetic grammar.

Ethiopic Class Matching

if ( "ኩ" =~ /[#ከ#]/ )
Is a true expression because ኩ is a member of the ከ family.

 
if ( "ኩ" =~ /[#2#]/ )
Is a true expression because ኩ is a of the 2nd form.

 
if ( "ኩ" =~ /[:ካዕብ:]/ )
Is a true expression because ኩ is in the class of ካዕብ.

 
if ( "ኩ" =~ /[:kaib:]/ )
Is a true expression because ኩ is in the class of kaib ("kaib" is a transcription of "ካዕብ").

Amharic Family Equivalence Matching

/[=#ጸ#=]/
Will match ዓለምሐይ as ሐ is a member of "ጸ family equivalence" in Amharic and Tigrigna.
 
/[=#ሀ#=]/
Will match ዓለምፀይ as ሐ is a member of "ሀ family equivalence" in Amharic. The expression does not match for languages.
 
/[=#ጸ#=]{%3-5}/
Fails to match ዓለምፀሐይ as ጸ equivalence family members do not appear in the 3rd through 5th forms.

 

 


Utility Functions

The Regexp::Ethiopic package (as well as subclasses) export a number of utility functions that operate on Ethiopic characters and strings. Specify ":utils" as an import option to bring the functions into the local namespace.

getForm( char )

A utility function to query the "form" of an Ethiopic syllable. It will return an integer between 1 and 12 corresponding to the [#\d+#] classes. The function used in the following example:

print getForm ( "አ" ), "\n";

will print the number 1.

setForm( char, form )

A utility function to set the form number of a syllable. The form number must be an integer between 1 and 12 corresponding to the [#\d+#] classes. Useful in substitutions as per:

s/(.)/setForm($1, 1)/eg;

where every matching character found will be converted into the first form.

subForm( char, form )

A utility function to set the form number of a syllable based on the form of another syllable. Useful in substitutions as per:

s/(\w+)([#ፀ#])/$1.subForm('ጸ', $2)/eg;

where every character in the class of [#ፀ#]) will be converted to the 'ጸ' family in the form number of the matched character. That is, if 'ፂ' is matched it will be converted to 'ጺ'.

formatForms( format, string )

A utility function somewhat analogous to sprintf for a sequence of syllables. The first argument is the format where the desired symbol sequence is provided. The second argument is the string to format. For example, the format string "%1%2%3%4" indicates that the first character of the argument should be in the first form, %1, the second character in the seconf form, %2, the third character in the third form %3, and the fourth character in the fourth form, %4. For the argument "አበገደ":

print formatForms ( "%1%2%3%4", "አበገደ" ), "\n";

the output would be "አቡጊዳ".

Alias Strings

Alias strings are also exported by the Regexp::Ethiopic package (as well as subclasses) that are assigned the the values 1-12 corresponding to their form. Specify ":forms" as an import option to bring the strings into the local namespace. The names are assignments are given by:

($ግዕዝ, $ካዕብ, $ሣልስ, $ራብዕ, $ኃምስ, $ሳድስ, $ሳብዕ, $ዘመደ_ግዕዝ, $ዘመደ_ካዕብ, $ዘመደ_ሣልስ, $ዘመደ_ራብዕ, $ዘመደ_ኃምስ) = (1 .. 12);

Example use:

if ( getForm ( $x ) == $ካዕብ ) {
:
:
}

 


Notes on Functional Use

The overloading of Perl's regular expressions mechanism is the preferred usage for the Regexp::Ethiopic package. However, the overloading mechanism only applies to the constant part of the RE. The following would not be handled by the Regexp::Ethipic package as expected:

use Regexp::Ethiopic 'overload';

my $x = "ከ";
    :
    :
if ( /[#$x#]/ ) {  # $x is a variable, /[#ከ#]/ is constant
         :
         :
}

The above expression is not identical to /[#ከ#]/ because 'ከ' is constant whereas $x is a variable. The package never gets to see the variable $x to then perform the RE expansion. The work around is to use the package as per:

use Regexp::Ethiopic 'overload';

my $x = "ከ";
    :
    :
my $re = Regexp::Ethiopic::getRe ( "[#$x#]" );

if ( /$re/ ) {
       :
       :
}

This works as expected at the cost of one extra step. The overloading and functional modes of the Regexp::Ethiopic package may be used together without conflict.

 


Notes on Notation

The initial philosophy applied to syllabic character class development was to stick with existing POSIX definitions and notation ([=x=], [:x:], etc) and simply apply them in the context of a syllabary. Shoe-horning syllabic classes into POSIX norms has proven at times to be both awkward and confusing. After a lenghty experimentaiton period a clean break was made from the POSIX class symbols and class symbols are applied that appear to be intuitive and easy to type.

In large part, a complication for working with Ethiopic character classes easily has been the difference between the greater number of Ethiopic classes and available (while somewhat applicable) POSIX abstractions. There are four types of character equivalence that are of interest in Ethiopic regular expressions:

The syllable x is:

The choice of # has been made at this time for no other reason than that symbol itself looks like the grid that the syllables are invariably presented in. The interpretation of the character between #s is made by the character's context as either a letter or numeral. This notation has been stable for some time and has proven to be a good neumanic device.


References