Dual-license your content for inclusion in The Perl 5 Wiki using this HOWTO, or join us for a chat on irc.freenode.net#PerlNet.
Unicode
From PerlNet
Contents |
What is unicode
Unicode is very large character set, with tens of thousands of characters in it, for many, many languages.
Unicode brings the wide range of characters from many languages (like English and Japanese) and fields (like mathematics or physics) into a single character set. A document using characters from the unicode character set can include text from English, French, Arabic, Japanese, mathematical symbols, and other characters.
One of the key concepts in Unicode is the "code point". Basically, for each glyph (what a character looks like, or, more formally, its typographic representation) there exists a single code point. So in Unicode, the code point 65 has the typogaphic representation 'A' — two sloping lines that meet at the top, with a horizontal bar joining them around half way down.
The first 128 code points in Unicode match the gylph's for the ASCII encoding standard, which most developers would be familiar with. These glyph's are refered collectively as belonging to the Unicode Basic Latin Code Chart.
The next 128 code points in Unicode match the glyph's for the Extended ASCII encoding, and are referred to collectively as belonging to the Unicode Latin-1 Code Chart.
To see the glyph's and their code points for the first 256 code points, see Basic Latin and Latin-1 Supplement. Note the URL's for these links - they are the first code point (in hex) for the Code Chart - so you can access glyph tables in chunks of 0x80 it seems. Actually this isnt true - some supplements ( like the next two - 0100 and 0180 ) do follow this pattern, and are in 128 glyph chunks, but after that all bets are off - it just depends on the number of symbols required for the Code Charts topic area. For example, the Thai supplement has only 39 glyphs, the Unified CJK Ideographs is huge ( 83 page PDF plus numerous extensions ).
So, to summarise, a Unicode document is made up of all the Unicode code points necessary to represent all the glyph's in the document.
Unicode and Perl
Perl supports unicode strings, and provides methods to convert unicode to or from its internal representation.
The important concept in dealing with Unicode in programs and data is the 'encoding'. Unicode code points are just numbers. If the code point is 65, that is the code point for the typographic symbol 'A'.
And as has been stated before, there are thousands and thousands of code points. But, and its a big but, most computers deal with bytes, which can only represent 256 unique values. How can we map something that can contain thousands of different values into something that can contain only 256 values ? We use a Unicode encoding.
The most popular encoding is UTF-8. UTF-8 is basically a way to say "the next X bytes represent a code point", where X ranges from 1 to 6.
There are many other other encodings (EBCDIC, CP1251, ISO-8859-1), but if you are dealing with Unicode, UTF-8 is the one you probably want.
So when we read a 'Unicode file', what is most probably meant is 'this file contains Unicode code points encoded with the UTF-8 encoding'.
When we read a UTF-8 encoded file, we want to convert from the multibyte UTF-8 encoding of a code point TO the code point itself. Lets say that code point 0x0E52 encodes to 5 bytes in UTF-8 ( I have no idea if it does or not, its just an example ). In reading this code point from the UTF-8 encoded file, we want read all 5 bytes, and decode to the code point 0x0E52. Similarly, in writing a UTF-8 encoded file, if our data contains the code point 0x0E52, we need to write this out as the 5 bytes that result from the UTF-8 encoding of code point 0x0E52. Thats it - thats all there is to working with UTF-8 encoded files.
So how is that to/from process done in Perl. If you know some file your reading from is UTF-8 you can
use Encode qw(encode decode);
my $characterString = decode('UTF-8', <>);
my $utf8Encoding = encode('UTF-8', $characterString);
Once you've decode()'d, you can deal with character string in perl as per normal, using substr(), uc() etc. length($characterString) tells you how many characters in $characterString, not how many bytes it encodes to. The number of characters is most probably what you want when dealing with the data in the 'character context'. Also, once in a character string, those code points can be dealt with correctly in regexes - like the Thai \d example above.
Once you've encode()'d, the function length($utf8Encoding) tells you how many bytes it encoded to. When dealing with 'byte context' the number of bytes is most interesting, not the number of characters those bytes originally represented.
Recently the TPF gave a grant to improve Unicode support in Perl5, and we can expect full support for Unicode in Perl6 as well.
Use locale
The use locale pragma allows locale-specific features to be enabled in Perl. Regular expressions will use locale-specific sets for matching words (\w) and digits (\d). Sorting and string comparison operators such as lt or gt will use locale-specific rules, and numerical formatting code such as printf will use locale-specific formats for printing numbers.
For example, the concept of "a digit character" expressed by "\d" are given full weight in Unicode - so Unicode can "tell" the regex engine that the single character at code point 0E3F (฿) is not matched by \d, because its actually the Thai Baht currency symbol ( see the Thai Unicode chart ). The code point 0E52 (๒) would match "\d", as it is the Thai Digit 2, according to the Thai Unicode chart.
The \p{property} operator is a powerful matching tool for use in regular expressions. For example, to match a "lowercase character" in any foreign language character set, use \p{Lowercase_Letter} (or \p{LowercaseLetter} or even \p{Lowercase Letter} — they're all the same, so pick a notation and stick with it). There are subtleties with the '.' (any character) operator within Unicode strings, especially when using combiners (like adding accents to letters), so read your documentation carefully, and test strenuously.
See also
- The perluniintro manual page.
- The perlunicode manual page.
- The perllocale manual page.
- The perlunitut page at Perl Monks
- The The Unicode Inc. home page

