Unpolluted – Tests for RSS Scalability


There seems to be some debate over whether, or not, RSS scales to large numbers of clients. It’s common to hear people suggesting alternatives to its polled HTTP model: multicast, decentralised P2P networks, and of course “push.” (Not necessarily good ideas: as Joshua Allen says, “pull is adaptive to sparse client activity.” Maintaining subscriber state is expensive and complicated.) Before resorting to these extremes it’s worth making sure that current use of HTTP is as efficient as possible.

Unpolluted is a simple audit for a pollable resource (like my RSS feed): it checks compression, cacheability and conditional requests. If you’re expecting a large audience, these are all ways to help well-written clients to make good use of your bandwidth. As Syndic8 shows, compression isn’t widely supported, but the benefits are obvious: bandwidth reduced by an average of 70% (and no server-side CPU cost, if you compress ahead-of-time). Conditional requests mean that you only need to send the whole feed when it has changed.

MSDN recently had a problem with bandwidth costs for their RSS, prompting Scoble to conclude that “RSS is broken.” From this, it looks like their conditional request handling isn’t working properly. Although they’re sending an ETag, passing it back often causes them to send the whole feed anyway. It seems that they’re using a cluster of servers, each of which has a different value for the ETag, so if you ever hit a different server it will act as if the feed has changed. They don’t have this problem with If-Modified-Since, and it looks like they could dramatically reduce the load on their servers by dropping this unreliable ETag from all responses. So let’s not confuse HTTP implementation bugs with architectural flaws. (Wired News has the same problem. In fact, their cluster is actually serving slightly different versions of the file, with different RSS build dates.)

(Update 2004-09-13: Apparently the problem was with the aggregated blog feed. This is a feed that includes all blog entries, and so updates extremely frequently. The fix? Missing out full text. For something with this kind of volume, RSS works perfectly well as a notification mechanism. If you’re interested (and a sparse set of subscribers will be), then click to read the full article.)

Fixing all of this doesn’t “solve” the problem. RSS still means lots of clients, connecting often. (I guess that’s why we have hardware vendors.) But just a little careful thought, and some examination of existing infrastructure, is a good idea before declaring that the sky is falling.

(Music: Bloodhound Gang, “Along Comes Mary”)
(More from this year, or the front page? [K])