Hacks

download-files.pl

Another by-product from my RSS aggregator. The feeds I subscribe to currently consist of around 3 megabytes of files. Downloading all this, every hour, would be unbearably slow (and, multiplied by the number of subscribers, a major bandwidth hog for the sites). This is a standalone file downloader that attempts to do everything right: it uses threads to download in parallel, uses If-Modified-Since and If-None-Match to avoid downloading unchanged files, asks for and deals with Content-Encoding: gzip to save bandwidth and honours Expires (as well as user preferences) to avoid excessively frequent checks.

Usage

Run as ‘download-files.pl --output output-filename [--auth auth-filename]’. It expects input of the form:

filename
Location: url
X-Refresh-Minutes: 360

filename2
Location: url2

...

That is, stanzas consisting of a filename, any number of HTTP headers (that must include a ‘Location’), and terminated with a blank line. The pseudo-header X-Refresh-Minutes, if present, indicates when a URL should be checked again. It defaults to 60, but is overriden by a server-supplied expiry time. (Observation: very few sites are setting realistic expiry times on their RSS feeds.) Output is a file of the same format, that should be passed as input on the next run. The names of updated local files will be printed to standard output, and bafflingly verbose diagnostics will be sent to standard error.

If you give an authentication file as well, the passwords in it will be used when a challenge is received from a site. (For example, for LiveJournal “friends-only” posts.) Lines should be of the form site,username,password:

www.livejournal.com,bradfitz,171k3g04t5

will allow URLs of the form http://www.livejournal.com/users/USER/data/rss?auth=digest to work (assuming that you are bradfitz, and that’s your password, and USER considers you a friend).

I’ve seen a few threading problems (including core dumps) when running under Linux 2.4. Upgrading to 2.6, with the threading model changes, seems to have fixed those problems.

de-xhtml.xsl

As Hixie says, sending XHTML as ‘text/html’ is a Bad Idea. Unless you serve it as ‘application/xhtml+xml’, user agents will treat it as tag soup, resulting in some surprises if you ever try to treat it as real XHTML. However, processing XML is easy, while processing proper HTML is far more complicated. Hence this: a simple XSL transformation to turn XHTML into plain HTML, so you can keep an all-XML CMS, but switch to HTML for publishing.

Usage

With libxslt installed:

xsltproc de-xhtml.xsl input.xhtml >output.html

Or, more helpfully, at the end of a whole pipeline of scripts and transformations:

generate-raw-xml.pl <data | transform-xml-to-xhtml.py | xsltproc de-xhtml.xsl - >output.html

(xsltproc accepts HTML input with the --html argument, which may help with pipelines, although it leaves the elements in the default namespace.)

TrackBack Toys

In adding TrackBack to my site, I needed a couple of simple client programs to test against: these are they. There’s TrackBack discovery, in Python, and a simple Perl script to register Tracks Back.

Discovery includes searching any RDF with a relation of ‘meta’, as recommended by the RDFCore Working Group. (It will follow ‘Link’ headers in the HTTP response, too. Very useful. Edged out of the HTTP spec; it would be nice to get it back in for the next revision.) It uses Sean Palmer’s rdfxml.py, and therefore inherits its GNU GPL licence. Yes, it also uses regular expressions to spot anything even vaguely like RDF anywhere on the page.

Posting TrackBacks is very, very simple indeed: a single POST request. The only complication is encoding so this script uses UTF-8, the de facto standard for internationalised form submission, and declares so with a ‘charset’ attribute.

CPAN Popularity

This script summarises vsftpd logs from a CPAN mirror. The result is a YAML file that can be processed directly, or after aggregation with statistics from other mirrors, using a script like this simple chart generator to show the most popular modules.

Unpolluted

If you’re serving RSS feeds to a large audience, there are a number of steps you can take to reduce bandwidth and server load: compression, supporting conditional requests, and so on. This program tests a feed URL and reports back on the most important cases.

Examples

The Wired and MSDN feeds are interesting – they supply an ETag, but sometimes ignore it during a conditional request (the results are from the cases that failed). It looks like the web servers are clusters, with different versions of the file (or just different ETags due to different inodes on different filing systems, for example). Of course, this would mean that your cache hits go down proportionally to the number of machines in your cluster, but I’m just guessing here.

Usage

Implementation is Python, with Perl for presentation formatting.

./unpolluted.py <url>

or

./unpolluted.py <url> | ./present-html.pl >unpolluted-report.xhtml

for HTML formatting.

Notes

Some of the Best Practices here are still moot. Yes, the name was a random word with “poll” in it. It would be great to make this available as an online service.

Joseph Walton, 12th September 2004 [K]