kafsemo.org

Sorting RDF for readable output

2014-06-19

As with rows in SQL, the tuples in RDF have no inherent ordering. However, when transcribing RDF in different notations, the ordering may influence the output. Selecting an ordering to take advantage of that can dramatically improve the readability of the final document.

Consider this example from Wikipedia’s RDF article. Here’s a description of the article on Tony Benn in Turtle:

@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix dc:   <http://purl.org/dc/elements/1.1/> .

<http://en.wikipedia.org/wiki/Tony_Benn>
    dc:publisher "Wikipedia" ;
    dc:title "Tony Benn" ;
    foaf:primaryTopic [
        a foaf:Person ;
        foaf:name "Tony Benn"
    ] .

This defines five statements, shown here as N-Triples:

_:genid1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> .
_:genid1 <http://xmlns.com/foaf/0.1/name> "Tony Benn" .
<http://en.wikipedia.org/wiki/Tony_Benn> <http://purl.org/dc/elements/1.1/publisher> "Wikipedia" .
<http://en.wikipedia.org/wiki/Tony_Benn> <http://purl.org/dc/elements/1.1/title> "Tony Benn" .
<http://en.wikipedia.org/wiki/Tony_Benn> <http://xmlns.com/foaf/0.1/primaryTopic> _:genid1 .

Semantically, these are identical. The ordering has no effect on the meaning. We can also write those same statements as RDF/XML:

<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:nodeID="genid1">
    <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person"/>
  </rdf:Description>
  <rdf:Description rdf:nodeID="genid1">
    <foaf:name>Tony Benn</foaf:name>
  </rdf:Description>
  <rdf:Description rdf:about="http://en.wikipedia.org/wiki/Tony_Benn">
    <dc:publisher>Wikipedia</dc:publisher>
  </rdf:Description>
  <rdf:Description rdf:about="http://en.wikipedia.org/wiki/Tony_Benn">
    <dc:title>Tony Benn</dc:title>
  </rdf:Description>
  <rdf:Description rdf:about="http://en.wikipedia.org/wiki/Tony_Benn">
    <foaf:primaryTopic rdf:nodeID="genid1"/>
  </rdf:Description>
</rdf:RDF>

Despite containing the same information as that first form, this is clearly less human readable. The structure is no longer apparent and there’s more repetition. We’re not taking advantage of any syntactic sugar that RDF/XML provides.

Consider the <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person"/>. With the correct namespaces defined, we can use a typed node element to write this as <foaf:Person/>. Multiple statements with the same subject can be grouped. Using Sesame, if we write that document using BufferedGroupingRDFHandler, it will sort the statements to take advantage of that syntax:

<rdf:RDF
        xmlns:dc="http://purl.org/dc/elements/1.1/"
        xmlns:foaf="http://xmlns.com/foaf/0.1/"
        xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
        xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<rdf:Description rdf:about="http://en.wikipedia.org/wiki/Tony_Benn">
        <dc:publisher>Wikipedia</dc:publisher>
        <dc:title>Tony Benn</dc:title>
</rdf:Description>
<foaf:Person rdf:nodeID="node18qnobpi2x1">
        <foaf:name>Tony Benn</foaf:name>
</foaf:Person>
<rdf:Description rdf:about="http://en.wikipedia.org/wiki/Tony_Benn">
        <foaf:primaryTopic rdf:nodeID="node18qnobpi2x1"/>
</rdf:Description>

</rdf:RDF>

That’s better. We’re now taking advantage of a shorthand for the type, and we’re grouping statements about the same resource. We’re down from fifteen to ten lines of statements in the XML.

However, we’re still splitting the topic into two parts: the statement that this is the topic of the article, and the definition of the topic. We’re not taking advantage of striping, which lets us chain the first use of a resource as the object of a statement with the statements that use it as a subject.

Let’s sort topologically. That is, place all the statements about something at the point it’s first used. Essentially, where possible, we want a depth-first traversal of a tree.

<rdf:RDF
        xmlns:dc="http://purl.org/dc/elements/1.1/"
        xmlns:foaf="http://xmlns.com/foaf/0.1/"
        xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
        xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<rdf:Description rdf:about="http://en.wikipedia.org/wiki/Tony_Benn">
        <dc:publisher>Wikipedia</dc:publisher>
        <dc:title>Tony Benn</dc:title>
        <foaf:primaryTopic>
                <foaf:Person rdf:nodeID="node18qnos7a4x1">
                        <foaf:name>Tony Benn</foaf:name>
                </foaf:Person>
        </foaf:primaryTopic>
</rdf:Description>

</rdf:RDF>

Neat. In fact, it’s starting to look like the original Turtle document: this is the order you’d choose if you were authoring a document like this by hand.

Why?

I’m converting between two notations, and as part of that conversion, I create statements in an arbitrary order. Sorting them before rendering means I can separate generation and presentation and still get nice output.

Here’s RDFTripleTopologicalSorter.java and here’s an example of using it.

(Music: Future of the Left, “Donny of the Decks”)
(More from this year, or the front page? [K])