Dual-license your content for inclusion in The Perl 5 Wiki using this HOWTO, or join us for a chat on irc.freenode.net#PerlNet.

Unicode

From PerlNet

Jump to: navigation, search

Contents

What is unicode

Unicode is very large character set, with tens of thousands of characters in it, for many, many languages.

Unicode brings the wide range of characters from many languages (like English and Japanese) and fields (like mathematics or physics) into a single character set. A document using characters from the Unicode character set can include text from English, French, Arabic, Japanese, mathematical symbols, and other characters.

Unicode uses the concept of a grapheme when identifying 'underlying characters'. For example the underlying character '0' has at least two glyphs (typographic representations) - one a closed loop, and the other a closed loop with a forward slash through it. The two glyphs are understood to be 'the same character' by those with reading and writing knowledge of that writing system. Hence the two glyphs represent the same grapheme. Grapheme's are sometimes referred to as 'platonic characters'.

Unicode gives each grapheme a unique "code point".

The first 128 code points in Unicode match the grapheme's for the ASCII encoding standard, which most developers would be familiar with. These grapheme's are refered collectively as belonging to the Unicode Basic Latin Code Chart.

The next 128 code points in Unicode match the grapheme's for the Extended ASCII encoding, and are referred to collectively as belonging to the Unicode Latin-1 Code Chart.

To see the grapheme's and sample glyph's for the first 256 Unicode code points, see Basic Latin and Latin-1 Supplement.

So, to summarise and simplify, a Unicode document is made up of all the Unicode code points necessary to represent all the characters's in the document.

Unicode and Perl

Perl supports Unicode strings, and provides methods to encode or decode Unicode to or from a byte representation.

The important concept in dealing with Unicode in programs and data is the 'encoding'. Unicode code points are just numbers. The code point U+0065 is the code point for the typographic symbol 'A'.

As has been stated before, there are thousands and thousands of code points. Most computers however deal with bytes. A single byte can only represent 256 unique values. How can we map something that can contain thousands of different values into something that can contain only 256 values?

We use an encoding.

The most popular Unicode encoding is UTF-8. UTF-8 is basically a way to say "the next X bytes represent a Unicode code point", where X ranges from 1 to 6.

When we hear someone say 'just read in the Unicode file', what is most probably meant is 'read the file that contains Unicode code points encoded with the UTF-8 encoding'. But it is possible the Unicode code points encoded using another encoding e.g. UTF-16, UTF-32 etc. You should gather the encodings used by various inputs and outputs as part of your technical design.

When we read a UTF-8 encoded file, we want to convert from the multibyte UTF-8 encoding of a code point to the code point itself. Let's say that code point 0x0E52 encodes to 5 bytes in UTF-8 ( I have no idea if it does or not, its just an example ). In reading this code point from the UTF-8 encoded file, we want read all 5 bytes, and decode to the code point 0x0E52. Similarly, in writing a UTF-8 encoded file, if our data contains the code point 0x0E52, we need to write this out as the 5 bytes that result from the UTF-8 encoding of code point 0x0E52. Thats it - thats all there is to working with UTF-8 encoded files.

So how is that to/from process done in Perl. If you know some file your reading from has encoded the Unicode code points in UTF-8, you can

use Encode qw(encode decode);

my $characterString = decode('UTF-8', <>); # read and decode the UTF-8 data, placing the Unicode code points in the var

my $utf8Encoding = encode('UTF-8', $characterString); # encode the Unicode code points in the var into UTF-8 data

Another way to think of this is to invoke the Perl concept of 'context'.

$characterString is the data in the context of 'characters'. length($characterString) tells you how many characters in $characterString, not how many bytes it encodes to. The number of characters is most probably what you want when dealing with the data in the 'character context'. Also, once in a character string, those code points can be dealt with correctly in regexes - so \d will be matched against graphemes for a Thai digit, for example. Once you've decode()'d, you can deal with character string in perl as per normal, using substr(), uc() etc.

$utf8Encoding is the data in the context of 'bytes'. length($utf8Encoding) tells you how many bytes it encoded to. When dealing with 'byte context' the number of bytes is most interesting, not the number of characters those bytes originally represented.

The TPF have given a grant to improve Unicode support in Perl5, and we can expect full support for Unicode in Perl6 as well.

There are a number of encodings of Unicode data to byte strings. Which one you use depends on what your priorities are.

  • UTF-8 - prioritise on using the minimum number of bytes for encoding the ASCII character set - characters beyond the first 127 occupy from 2-6 bytes.
  • UTF-32 - priorise on using a fixed-width encoding. Every code point expands to 4 bytes. Computationally simple, but size-inefficient
  • UTF-16 - compromise encoding. Most code points encode to 2 bytes, some expand to 4 bytes.

<It may be interesting to add to this article with a dicussion of what to use when storing Unicode strings in a database - use straight code points or UTF-x encoding, how to indicate to users of the DB what the encoding is etc - or is that out of scope ?>

Use locale

The use locale pragma allows locale-specific features to be enabled in Perl. Regular expressions will use locale-specific sets for matching words (\w) and digits (\d). Sorting and string comparison operators such as lt or gt will use locale-specific rules, and numerical formatting code such as printf will use locale-specific formats for printing numbers.

For example, the concept of "a digit character" expressed by "\d" are given full weight in Unicode - so Unicode can "tell" the regex engine that the single character at code point 0E3F (฿) is not matched by \d, because its actually the Thai Baht currency symbol ( see the Thai Unicode chart ). The code point 0E52 (๒) would match "\d", as it is the Thai Digit 2, according to the Thai Unicode chart.

The \p{property} operator is a powerful matching tool for use in regular expressions. For example, to match a "lowercase character" in any foreign language character set, use \p{Lowercase_Letter} (or \p{LowercaseLetter} or even \p{Lowercase Letter} — they're all the same, so pick a notation and stick with it). There are subtleties with the '.' (any character) operator within Unicode strings, especially when using combiners (like adding accents to letters), so read your documentation carefully, and test strenuously.


See also

Personal tools