Another by-product from my RSS aggregator. The feeds I subscribe to
currently consist of around 3 megabytes of files. Downloading all this,
every hour, would be unbearably slow (and, multiplied by the number of
subscribers, a major bandwidth hog for the sites). This is a standalone file
downloader that attempts to do everything right: it uses threads to download
If-None-Match to avoid
downloading unchanged files, asks for and deals with
Content-Encoding: gzip to save bandwidth and honours
Expires (as well
as user preferences) to avoid excessively frequent checks.
Run as ‘
download-files.pl --output output-filename [--auth auth-filename]’.
It expects input of the form:
filename Location: url X-Refresh-Minutes: 360 filename2 Location: url2 ...
That is, stanzas consisting of a filename, any number of HTTP headers
(that must include a ‘
Location’), and terminated with a blank line.
X-Refresh-Minutes, if present, indicates
when a URL should be checked again. It defaults to 60, but is overriden
by a server-supplied expiry time. (Observation: very few sites are
setting realistic expiry times on their RSS feeds.)
Output is a file of the same format, that should be passed as input on the next
run. The names of updated local files will be printed to standard output,
and bafflingly verbose diagnostics will be sent to standard error.
If you give an authentication file as well, the passwords in it will be used when a challenge is received from a site. (For example, for LiveJournal “friends-only” posts.) Lines should be of the form site,username,password:
will allow URLs of the form
to work (assuming that you are bradfitz, and that’s your
password, and USER considers you a friend).
I’ve seen a few threading problems (including core dumps) when running under Linux 2.4. Upgrading to 2.6, with the threading model changes, seems to have fixed those problems.
As Hixie says, sending XHTML as ‘text/html’ is a Bad Idea. Unless you serve it as ‘application/xhtml+xml’, user agents will treat it as tag soup, resulting in some surprises if you ever try to treat it as real XHTML. However, processing XML is easy, while processing proper HTML is far more complicated. Hence this: a simple XSL transformation to turn XHTML into plain HTML, so you can keep an all-XML CMS, but switch to HTML for publishing.
With libxslt installed:
xsltproc de-xhtml.xsl input.xhtml >output.html
Or, more helpfully, at the end of a whole pipeline of scripts and transformations:
generate-raw-xml.pl <data | transform-xml-to-xhtml.py | xsltproc de-xhtml.xsl - >output.html
xsltproc accepts HTML input with the
argument, which may help with pipelines, although it leaves the elements
in the default namespace.)
In adding TrackBack to my site, I needed a couple of simple client programs to test against: these are they. There’s TrackBack discovery, in Python, and a simple Perl script to register Tracks Back.
Discovery includes searching any RDF with a relation of ‘meta’, as recommended by the RDFCore Working Group. (It will follow ‘Link’ headers in the HTTP response, too. Very useful. Edged out of the HTTP spec; it would be nice to get it back in for the next revision.) It uses Sean Palmer’s rdfxml.py, and therefore inherits its GNU GPL licence. Yes, it also uses regular expressions to spot anything even vaguely like RDF anywhere on the page.
Posting TrackBacks is very, very simple indeed: a single POST request. The only complication is encoding so this script uses UTF-8, the de facto standard for internationalised form submission, and declares so with a ‘charset’ attribute.
This script summarises vsftpd logs from a CPAN mirror. The result is a YAML file that can be processed directly, or after aggregation with statistics from other mirrors, using a script like this simple chart generator to show the most popular modules.
If you’re serving RSS feeds to a large audience, there are a number of steps you can take to reduce bandwidth and server load: compression, supporting conditional requests, and so on. This program tests a feed URL and reports back on the most important cases.
The Wired and MSDN feeds are interesting – they supply an ETag, but sometimes ignore it during a conditional request (the results are from the cases that failed). It looks like the web servers are clusters, with different versions of the file (or just different ETags due to different inodes on different filing systems, for example). Of course, this would mean that your cache hits go down proportionally to the number of machines in your cluster, but I’m just guessing here.
Implementation is Python, with Perl for presentation formatting.
./unpolluted.py <url> | ./present-html.pl >unpolluted-report.xhtml
for HTML formatting.
Some of the Best Practices here are still moot. Yes, the name was a random word with “poll” in it. It would be great to make this available as an online service.