NAME cyrillic - Library for fast and easy cyrillic text manipulation SYNOPSIS use cyrillic qw/866 win2dos convert locase upcase detect/; print convert( 866, 1251, $str ); print convert( 'dos','win', \$str ); print win2dos $str; DESCRIPTION This module includes cyrillic string converting functions from one and to another charset, to upper and to lower case without locale switching. Also included single-byte charsets detection routine. It is easy to add new code pages. For this purpose it is necessary only to add appropriate string of a code page. Supported charsets: ibm866, koi8-r, cp855, windows-1251, MacWindows, iso_8859-5, unicode, utf8; If the first imported parameter - number of a code page, then locale will be switched to it. FUNCTIONS * cset_factory - between charsets convertion function generator * case_factory - case convertion function generator * convert - between charsets convertor * upcase - convert to upper case * locase - convert to lower case * upfirst - convert first char to upper case * lofirst - convert first char to lower case * detect - detect codepage number * charset - returns charset name for codepage number At importing list also might be listed named convertors. For Ex.: use cyrillic qw/dos2win win2koi mac2dos ibm2dos/; NOTE! Specialisations (like win2dos, utf2win) call faster then convert. NOTE! Only convert function and they specialisation work with Unicode and UTF-8 strings. All others function work only with single-byte sharsets. Names for using in named charset convertors: dos ibm866 866 koi koi8-r 20866 ibm cp855 855 win windows-1251 1251 mac ms-cyrillic 10007 iso iso-8859-5 28585 uni Unicode utf UTF-8 The following rules are correct for converting functions: VAR may be SCALAR or REF to SCALAR. If VAR is REF to SCALAR then SCALAR will be converted. If VAR is ommited then $_ operated. If function called to void context and VAR is not REF then result placed to $_. CONVERSION METHODS cset_factory SRC_CP, DST_CP Generates between codepages convertor function, from SRC_CP to DST_CP, and returns reference to his. The converting Unicode or UTF-8 data requires presence of installed Unicode::String and Unicode::Map. case_factory CODEPAGE, [TO_UP], [ONLY_FIRST_LETTER] Generates case convertor function for single-byte CODEPAGE and returns reference to his. convert SRC_CP, DST_CP, [VAR] Convert VAR from SRC_CP codepage to DST_CP codepage and returns converted string. Internaly calls cset_factory. upcase CODEPAGE, [VAR] Convert VAR to uppercase using CODEPAGE table and returns converted string. Internaly calls case_factory. locase CODEPAGE, [VAR] Convert VAR to lowercase using CODEPAGE table and returns converted string. Internaly calls case_factory. upfirst CODEPAGE, [VAR] Convert first char of VAR to uppercase using CODEPAGE table and returns converted string. Internaly calls case_factory. lofirst CODEPAGE, [VAR] Convert first char of VAR to lowercase using CODEPAGE table and returns converted string. Internaly calls case_factory. MAINTAINANCE METHODS charset CODEPAGE Returns charset name for CODEPAGE. detect ARRAY Detect single-byte codepage of data in ARRAY and returns codepage number. If first element of ARRAY is REF to array of codepages numbers, then detecting will made between these codepages, otherwise - between all single-byte codepages. If codepage not detected then returns undefined value; EXAMPLES use cyrillic qw/convert locase upcase detect dos2win win2dos/; $_ = "\x8F\xE0\xA8\xA2\xA5\xE2 \xF0\xA6\x88\xAA\x88!"; printf " dos: '%s'\n", $_; upcase 866; printf " upcase: '%s'\n", $_; dos2win; printf "dos2win: '%s'\n", $_; win2dos; printf "win2dos: '%s'\n", $_; locase 866; printf " locase: '%s'\n", $_; printf " detect: '%s'\n", detect $_; # detect between 866 and 20866 codepages printf " detect: '%s'\n", detect [866, 20866], $_; # CONVERTING TEST: use cyrillic qw/utf2dos mac2utf dos2mac win2dos utf2win/; $_ = "Хелло Ворльд!\n"; print "UTF-8: $_"; print " DOS: ", utf2dos mac2utf dos2mac win2dos utf2win $_; # EQVIVALENT CALLS: dos2win( $str ); # called to void context -> result placed to $_ $_ = dos2win( $str ); dos2win( \$str ); # called with REF to string -> direct converting $str = dos2win( $str ); dos2win(); # with ommited param called -> $_ converted dos2win( \$_ ); $_ = dos2win( $_ ); my $convert = cset_factory 866, 1251; &$convert( $str ); # faster call convertor function via ref to his convert( 866, 1251, $str ); # slower call convertor function # FOR EASY SWITCH LOCALE CODEPAGE use cyrillic qw/866/; # locale switched to Russian_Russia.866 use locale; print $str =~ /(\w+)/; no locale; print $str =~ /(\w+)/; FAQ * Q: Why module say: Can't create Unicode::Map for 'koi8-r' charset! A: Your Unicode::Map module can't find map file for 'koi8-r' charset. Copy file koi8-r.map to site/lib/Unicode/Map and add to file site/lib/Unicode/Map/registry followings three strings: name: KOI8-R map: $UnicodeMappings/koi8-r.map alias: csKOI8R * Q: Why perl say: "Undefined subroutine koi2win called" ? A: The function B is specialization of the function B, which is created at inclusion it of the name in the list of import. AUTHOR Albert MICHEEV COPYRIGHT Copyright (C) 2000, Albert MICHEEV This module is free software; you can redistribute it or modify it under the same terms as Perl itself. AVAILABILITY The latest version of this library is likely to be available from: http://www.perl.com/CPAN SEE ALSO Unicode::String, Unicode::Map.