kafsemo.org

Strict Atom parsing

2011-08-01

I originally wrote Clay Pigeon as a quick exercise in turning feeds into data. Getting what I need from Atom is thirty-six lines of Python, using libxml2 to parse and evaluate xpath. Throw in a few tests, then move on to the storage and front end.

I wanted to see how far that design decision would take me: a simple implementation limited to well-formed Atom feeds. I was shocked (shocked!) to find non-well-formed feeds, and tweaked the logging to act with a little less surprise.

But then, there are still sites out there without Atom feeds. So an extra thirty-one lines of code and a quick import of the Universal Feed Parser and it’s up to speed again, this time with RSS support but sadly no Hot RSS. Take a look if you want to visualise your posting schedule.

There’s a nice bit of duck typing in feedparser.py: as the argument url_file_stream_or_string to open_resource suggests, you can pass in pretty much anything as a source. Anything with a read method is used directly, otherwise it’s tried as a URL, a file and then data. If it was opened from a URL, the stream object will have a url property. I’ve got my content in a string. So:

  stream = StringIO.StringIO(s)
  stream.url = base

  d = feedparser.parse(stream)

This means I can manage my own file handling but also get relative URLs resolved against the correct base. Compare with Java’s InputSource, which pulls the same trick with a static type system.

(Music: They Might Be Giants, “Can’t Keep Johnny Down”)
(More from this year, or the front page? [K])