kafsemo.org: 2005-05-16: UCS-4 Clean, US-ASCII Clean

Michael Kaplan writes about the interaction between UCS-2 and UTF-16 for Windows NT. Java has much the same issue as NT, being based around a 16-bit character type that many treated as UCS-2, but which is now strictly defined as UTF-16. They are both 16-bit character encodings, but UTF-16 includes pairs of “surrogate” characters, which allow it to encode characters beyond the 16-bit limit. This means that a pair of char values may be used to represent a single underlying codepoint. Java 5.0 includes APIs to support working with surrogates, and good advice is, wherever possible, to work with strings, rather than individual characters.

His post also uses a code point outside the BMP that highlights a bug in my home grown aggregator: U+10480, or 𐒀. At some point it was decomposed into a pair of numeric references to surrogates (i.e., &#55297;&#56448;), which was correctly rejected by Firefox. (Looks like the problem is in the Java XML Writer; writeEsc unconditionally escapes any code unit outside US-ASCII, including surrogates. To deal with UTF-16 encoding, it should probably cache high surrogates until the following code unit is passed, then produce a numeric reference for the code point. Patch submitted.)

While adding US-ASCII encoding to XML::Writer, another edge case turned up: what should a serialiser do when asked to render ‘<Überprüfung/>’ in a US-ASCII document? Numeric references are not permissible here. I’ve tried to check for, and fail when, this case occurs. This is unlikely if you’re generating XML yourself, but it’s another reason to prefer UTF-8 or UTF-16 as a target when processing arbitrary XML. (echo '<expérience/>' | xmllint --encode us-ascii -.)

(Music: Nine Inch Nails, “The Hand That Feeds”)