kafsemo.org

Three approaches to storing RDF/XML on AppEngine

2010-04-25

One of the first things AppEngine takes away is access to the filing system. In its place are several other ways to store data. For an application that relies on loading a zip file full of RDF/XML, this presents an opportunity to redesign.

BlobstoreService

The Blobstore service is for “storage of large, immutable blobs”. Sounds close enough to me. Write access is through POSTing a multipart upload form to special framework-created URLs.

My implementation stores the blob ID. I created a simple InputStream wrapper for the blobstore and treated it essentially like a local file. Servlets retrieve, unzip and parse the data.

This makes uploading new data a push process but a fairly low-overhead one.

In the development SDK this worked well. On production however: com.google.apphosting.api.ApiProxy$FeatureNotEnabledException: The Blobstore API will be enabled for this application once billing has been enabled in the admin console. Okay. I want to solve this without getting my credit card out.

Static resource

Make things as simple as possible. All we need is an input stream pointing at a zip file. Take a snapshot of the data, include it in the WAR file along with the code and use getResource at runtime.

Plus points: it works. The code runs on production. We still have a push process, albeit a more complicated one. Minus points: any change to the data requires a whole new upload of the application. For anything that changes infrequently this is probably acceptable.

Anything better?

DatastoreService

The Datastore service is intended for persistent storage of large collections of objects with named, typed properties. The Blobstore is intended for large objects - many megabytes. Let’s think realistically about how large our data file is. Right now, it’s less than 400K in total and the individual RDF/XML files top out at 15K compressed. The Datastore has a smaller-scale blob concept that can hold up to a megabyte. Our data’s not too far behind that but, if we explode the zip and store each file as a separate entity, we’ve got plenty of headroom.

This implementation polls a remote URL for a zip file and, unless it gets a ‘not modified’, takes every file in the zip and stores it as an entity with ‘name’ and ‘data’ fields. Old entities with the same name are updated and any entity not in the new zip file is deleted. When we need our data, each entity is taken, a base URI constructed (from the remote URL and the entry name) and the inflated data is parsed.

This moves to a pull model which will pick up any changes to the central data; less maintenance means more reliable. If a download fails it’ll just keep the most recent version.

Native SAIL?

The “proper” solution is probably a direct implementation of Sesame‘s ‘SAIL‘ abstraction on top of Datastore. Entities would store RDF triples directly, with some indexing on top to keep the whole thing from crawling. This is currently read-only RDF, so concurrent modification isn’t an issue. Modification presents a whole other set of consistency problems.

(Music: R.E.M., “Accelerate”)
(More from this year, or the front page? [K])