(Presented herein, a few notes about Unicode in Perl, written up for
XML::Writer
support and recorded for posterity.) Perl has fairly good
Unicode support, but the default behaviour is a little
problematic (and also has changed between releases). Being clear about
how you want to deal with Unicode avoids many problems.
Firstly, non-ASCII characters in Perl source should probably be avoided in favour of Unicode literals.
my $currency = "£";
This can break if the code ever moves to an environment with a different default encoding, whereas:
my $currency = "\x{A3}"; # U+00A3 POUND SIGN
is totally safe. As with HTML and XML numeric references, it’s unambiguously a Unicode character, rather than a sequence of bytes. (Is there a module to automate this conversion? Java has native2ascii.) This is particularly a problem with distributed development – if different developers use different native encodings, things will start to fail when code is combined.
Output is a case where Perl’s DWIMmery gets it into trouble.
print "\$\n"; # Dollar sign print "\x{A3}\n"; # Pound sign print "\x{20AC}\n"; # Euro sign
This turns out as 24 0a
, a3 0a
, e2 82 ac 0a
: dollar fits into seven bits; no problem. Pound fits into eight bits, so
it gets written as a single octet.
The Euro sign
doesn’t, so it gets written as UTF-8; that creates an output stream that uses
multiple encodings, which is not only wrong but even breaks recovery heuristics
on the reading side. Choose an encoding, declare it explicitly, and stick to
it. Just set the encoding before printing any output:
binmode(STDOUT, ':encoding(iso-8859-1)');
or
binmode(STDOUT, ':encoding(utf-8)');
(or ':encoding(windows-1252)'
if you’re more of a pragmatist).
Input is much the same:
$ { echo $; echo £; echo €; } | perl -ne 'chomp; print length($_),"\n";' 1 2 3
Here Perl is counting bytes, rather than characters; in a UTF-8 environment that’s not ideal. In another case, for an 8-bit Euro sign, this code was run with an ISO 8859-15 locale:
$ { echo $; echo £; echo €; } | perl -ne 'chomp; print sprintf("U+%04X", ord($_)),"\n";' U+0024 U+00A3 U+00A4
It’s subtle, but failure to declare the encoding has turned the Euro sign into U+00A4, or ‘¤’ – the generic “currency sign”. It’s the same problem that turns Windows-1252 “smart” quotes into the control characters U+0093/U+0094.
Again, explicit declaration of encoding makes it all okay:
$ { echo $; echo £; echo €; } | perl -e 'binmode(STDIN, ":encoding(utf-8)"); while (<>) { chomp; print length($_),"\n";}' 1 1 1
and
$ { echo $; echo £; echo €; } | perl -e 'binmode(STDIN, ":encoding(iso-8859-15)"); while (<>) { chomp; print sprintf("U+%04X", ord($_)),"\n";}' U+0024 U+00A3 U+20AC
If you’re in an all-UTF-8 environment, giving the command-line switch ‘-CDA’
to perl
will make UTF-8 the default for all input, output and command-line arguments;
setting PERL_UNICODE
to ‘DA’ does the same. ‘-C’ on its own tries
to do the right thing according to your locale; however, this only works for
UTF-8 – there’s no special handling for other encodings.
As soon as you use anything outside ASCII, you need to think about encoding; be wary of default behaviour. The currency symbols are great test characters, too: there’s something reassuringly commercial about them, in case anyone tries to spin Unicode as something that only comes into play when you start blogging about maths.
FileCache::Handle
Seems like
FileCache
doesn’t work with XML::Writer
; I took
the opportunity to learn some more about PerlIO and wrote
FileCache::Handle
.
It’s a module that provides an unlimited number of writeable filehandles,
opening and closing the underlying files as necessary to avoid OS-imposed
limits on numbers of open files. Rather than doing clever things with symbols,
it uses instances of IO::Handle
.