Michael Kaplan
writes about the
interaction between UCS-2 and UTF-16
for Windows NT.
Java has much the same issue as NT, being based around a 16-bit
character type that many treated as UCS-2, but which is now strictly
defined as UTF-16. They are both 16-bit character encodings, but UTF-16
includes pairs of “surrogate” characters, which allow it to encode characters
beyond the 16-bit limit. This means that a pair of char
values
may be used to represent a single underlying codepoint.
Java 5.0
includes APIs to support working with surrogates,
and good advice
is, wherever possible, to work with strings, rather than individual characters.
His post also uses a code point outside the
BMP
that highlights a bug in my home grown aggregator:
U+10480, or 𐒀. At some point it was decomposed into a pair of numeric
references to surrogates (i.e., ��), which was correctly
rejected by Firefox.
(Looks like the problem is in
the Java XML Writer;
writeEsc
unconditionally escapes any code unit outside US-ASCII,
including surrogates.
To deal with
UTF-16 encoding,
it should probably cache high surrogates until the following code unit is
passed, then produce a numeric reference for the code point. Patch submitted.)
While adding US-ASCII encoding to XML::Writer, another edge case turned up:
what should a serialiser do when asked to render ‘<Überprüfung/>’ in a
US-ASCII document? Numeric references are not permissible here. I’ve tried to
check for, and fail when, this case occurs. This
is unlikely if you’re generating XML yourself, but it’s another reason to
prefer UTF-8 or UTF-16 as a target when processing arbitrary XML.
(echo '<expérience/>' | xmllint --encode us-ascii -
.)