Currently, it uses Expat as its underlying XML parser which, specifically, doesn’t handle GB2312. My solution? Move XML character encoding detection, and decoding, outside the parser, then parse the Unicode string as re-encoded UTF-8 with the encoding declaration removed. This also means that the report page is Unicode-clean, which is a bonus.
This brings up a question, though: given that the purpose of the validator is to sign off on a feed’s quality, what to do with an “obscure” encoding? XML requires exactly two encodings to be supported: UTF-8 and UTF-16. Anything else is optional, though most XML parsers will deal with much more. It’s a judgement call, and I decided to give a warning for anything outside a quickly-sketched list of commonly-used encodings. The official Chinese encoding is GB18030, a Unicode-aware successor to GB2312, so it’s on the list too (no, a list of arbitrary identifiers is never going to be politically neutral).
Of course, this is leaving the scope of specifications and entering the realm of profiles. Syndic8’s statistics give some idea of what people are using, but it’s no substitute for practical knowledge (notably, there’s little input from real zh_CN or zh_TW developers). Still, some good discussion. (And, in a slight moment of geek pride, Norman Walsh described a warning I added as “a bit pedantic.” Yes, that Norm Walsh.)
Take my word for this as a linguist and an accessibility obsessif: This stuff is more detailed and pedantic than trainspotting, and almost as addictive to susceptible personalities. Just keep in mind that dinner-party guests are never really as interested in this topic as we are.
I’m sure some of them must be interested...