writes about the
interaction between UCS-2 and UTF-16
for Windows NT.
Java has much the same issue as NT, being based around a 16-bit
character type that many treated as UCS-2, but which is now strictly
defined as UTF-16. They are both 16-bit character encodings, but UTF-16
includes pairs of “surrogate” characters, which allow it to encode characters
beyond the 16-bit limit. This means that a pair of
may be used to represent a single underlying codepoint.
includes APIs to support working with surrogates,
and good advice
is, wherever possible, to work with strings, rather than individual characters.
His post also uses a code point outside the
that highlights a bug in my home grown aggregator:
U+10480, or 𐒀. At some point it was decomposed into a pair of numeric
references to surrogates (i.e., ��), which was correctly
rejected by Firefox.
(Looks like the problem is in
the Java XML Writer;
writeEsc unconditionally escapes any code unit outside US-ASCII,
To deal with
it should probably cache high surrogates until the following code unit is
passed, then produce a numeric reference for the code point. Patch submitted.)
While adding US-ASCII encoding to XML::Writer, another edge case turned up:
what should a serialiser do when asked to render ‘<Überprüfung/>’ in a
US-ASCII document? Numeric references are not permissible here. I’ve tried to
check for, and fail when, this case occurs. This
is unlikely if you’re generating XML yourself, but it’s another reason to
prefer UTF-8 or UTF-16 as a target when processing arbitrary XML.
echo '<expérience/>' | xmllint --encode us-ascii -.)