kafsemo.org: 2004-04-07: Implementing TrackBack

I’ve added TrackBack to this site. Individual article pages include TrackBack links, so we’ll see how that goes. The TrackBack service accepts POSTed pings and, if accessed with GET (e.g., a browser), serves up a page of RDF metadata about the page (which is then presented with XSL). It’s very RESTful, and more than a little semantic. Of course, there’s nothing in there yet, so I may go through and populate it with existing links.

I’ve added the command-line tools I developed against to my hacks page; I fully expect to learn an Important Lesson about trusting user input. Technical issues? Oh yes.

For TrackBack, you need some way to identify the service URL for a page; essentially making a statement associating a posting’s permalink with another URL. The W3C’s preferred statement-making language is RDF, but embedding RDF directly in HTML (even in XHTML) is disallowed by the W3C Validator. To work around this, most pages simply comment out the RDF, with clients picking it up through regular expressions. The RDFCore Working Group’s recommended solution is a little cleaner: use HTML ‘link’ tags, a relationship of ‘meta’ and the (as-yet unregistered) ‘application/rdf+xml’ media type. I’m using both, just on the offchance that there are tools out there that don’t yet support the latter.

One look at the TrackBacks on the demonstration page tells you that character encoding isn’t being done properly (mostly ISO-8859-1 being treated as UTF-8). The problem (see footer) is that character encoding is unspecified when submitting forms. All that’s sent is bytes, at least by IE and Mozilla. (Mozilla had, but then actually removed the ‘charset’ declaration because it broke too many badly-written CGI scripts. However, they really should fix this bug, when ‘multipart/form-data’ is used.) By default, browsers tend to use the encoding of the page the form was on, although they do seem to honor the ‘accept-encoding’ attribute. Conclusion? As far as I can see: on the client, specify UTF-8 for ‘accept-encoding’ in web pages and supply a ‘charset’ parameter when constructing requests in code. Send everything as UTF-8. On the server, honour declared charsets (including the _charset_ hack?) and presume untagged data is UTF-8 but, if it’s invalid, fall back to ISO-8859-1. A relevant Python snippet:

utf8Dec = codecs.getdecoder('utf-8')
iso8859Dec = codecs.getdecoder('iso-8859-1')

def defaultDec(x):
  try:
    return utf8Dec(x)
  except UnicodeError:
    return iso8859Dec(x)

try:
  if charset:
    dec = codecs.getdecoder(charset)
  else:
    dec = defaultDec

  field = dec(field)[0]
except UnicodeError:
  Fail gracefully

Like most of interoperability, these aren’t really very interesting problems: it is, unarguably, someone else’s fault. Of course, when it all works properly, people don’t notice. And that should be something to strive for.

(Music: Blonde Redhead, “Equus”)