NAME
Unicode::UTF8 - Encoding and decoding of UTF-8 encoding form
SYNOPSIS
use Unicode::UTF8 qw[decode_utf8 encode_utf8];
$string = decode_utf8($octets);
$octets = encode_utf8($string);
DESCRIPTION
This module provides functions to encode and decode UTF-8 encoding form
as specified by Unicode and ISO/IEC 10646:2011.
FUNCTIONS
decode_utf8
$string = decode_utf8($octets);
$string = decode_utf8($octets, $fallback);
Returns an decoded representation of $octets in UTF-8 encoding as a
character string.
Issues a warning using warnings category "utf8" if $octets contains
ill-formed UTF-8 sequences or encoded code points which can't be
interchanged.
$fallback is an optional "CODE" reference which provides a
error-handling mechanism, allowing customization of error handling. The
default error-handling mechanism is to replace any ill-formed UTF-8
sequences or encoded code points which can't be interchanged with
REPLACEMENT CHARACTER (U+FFFD).
$string = $fallback->($octets, $is_usv);
$fallback is invoked with two arguments, $octets and $is_usv. $octets is
a sequence of one or more octets containing the maximal subpart of the
ill-formed subsequence or encoded code point which can't be
interchanged. $is_usv is a boolean indicating whether or not $octets
represent a encoded Unicode scalar value. $fallback must return a
character string consisting of zero or more Unicode scalar values.
Unicode scalar values consist of code points in the range U+0000..U+D7FF
and U+E000..U+10FFFF.
encode_utf8
$octets = encode_utf8($string);
$octets = encode_utf8($string, $fallback);
Returns an encoded representation of $string in UTF-8 encoding as an
octet string.
Issues a warning using warnings category "utf8" if $string contains code
points which can't be interchanged or represented in UTF-8 encoding
form.
$fallback is an optional "CODE" reference which provides a
error-handling mechanism, allowing customization of error handling. The
default error-handling mechanism is to replace any code points which
can't be interchanged or represented in UTF-8 encoding form with
REPLACEMENT CHARACTER (U+FFFD).
$string = $fallback->($codepoint, $is_usv);
$fallback is invoked with two arguments, $codepoint and $is_usv.
$codepoint is a unsigned integer containing the code point which can't
be interchanged or represented in UTF-8 encoding form. $is_usv is a
boolean indicating whether or not $codepoint is a Unicode scalar value.
$fallback must return a character string consisting of zero or more
Unicode scalar values. Unicode scalar values consist of code points in
the range U+0000..U+D7FF and U+E000..U+10FFFF.
EXPORTS
None by default. All functions can be exported using the ":all" tag or
individually.
DIAGNOSTICS
Can't decode a wide character string
(F) Wide character in octets.
Can't decode ill-formed UTF-8 octet sequence <%s> in position %u
(W utf8) Encountered an ill-formed UTF-8 octet sequence. <%s>
contains a hexadecimal representation of the maximal subpart of the
ill-formed subsequence.
Can't interchange noncharacter code point U+%.4X at position %u
(W utf8, nonchar) Noncharacters is permanently reserved for internal
use and that should never be interchanged. Noncharacters consist of
the values U+nFFFE and U+nFFFF (where n is from 0 to 10^16) and the
values U+FDD0..U+FDEF.
Can't represent surrogate code point U+%.4X at position %u in UTF-8
encoding form
(W utf8, surrogate) Surrogate code points are designated only for
surrogate code units in the UTF-16 character encoding form.
Surrogates consist of code points in the range U+D800 to U+DFFF.
Can't represent super code point \x{%X} at position %u in UTF-8 encoding
form
(W utf8, non_unicode) Code points greater than U+10FFFF. Perl's
extended codespace.
Can't decode ill-formed UTF-X octet sequence <%s> in position %u
(F) Encountered an ill-formed octet sequence in Perl's internal
representation of wide characters.
Please note that the sub-categories of utf8 warning "nonchar",
"surrogate" and "non_unicode" is only available on Perl 5.14 or greater.
See perllexwarn for available categories and hierarchies.
COMPARISON
Here is a summary of features for comparison with Encode's UTF-8
implementation:
* simple API which makes use of Perl's standard warning categories.
* recognizes all noncharacters regardless of perl version
* implements Unicode's recommended practice for using U+FFFD
* good diagnostics in warnings messages
* detects and reports inconsistency in perl's internal encoding
(UTF-X)
* preserves taintedness of decoded $octets or encoded $string
* better performance ~ 600% - 1200% (JA: 600%, AR: 700%, SV: 900%, EN:
1200%, see benchmarks directory in git repository)
CONFORMANCE
It's the author's believe that this UTF-8 implementation is conformant
with the Unicode Standard Version 6.0. Any deviations from the Unicode
Standard is to be considered a bug.
SEE ALSO
Encode
SUPPORT
BUGS
Please report any bugs by email to "bug-unicode-utf8 at rt.cpan.org", or
through the web interface at
. You
will be automatically notified of any progress on the request by the
system.
SOURCE CODE
This is open source software. The code repository is available for
public review and contribution under the terms of the license.
git clone http://github.com/chansen/p5-unicode-utf8
AUTHOR
Christian Hansen "chansen@cpan.org"
COPYRIGHT
Copyright 2011 by Christian Hansen.
This is free software; you can redistribute it and/or modify it under
the same terms as the Perl 5 programming language system itself.