IVOA logo

An RDF version of the VO Registry

IVOA Note (V1.0, 2007 September 20)

Interest/Working Group
Not applicable
Author
Norman Gray, Euro-VOTech project and University of Leicester

Abstract

We describe the initial implementation of an RDF version of the IVOA Resource Registry, serving the registry data via a SPARQL query endpoint, including the creation of the ontology analogues of an important subset of the relevant XML Schemas, and the mechanics of the conversion process. The result is an experimental service, and this is an interim document.

Status of this document

This is an IVOA Note.

This document is an IVOA Note expressing suggestions from and opinions of the authors. The first release of this document was 2007 September 20.

It is intended to share best practices, possible approaches, or other perspectives on interoperability with the Virtual Observatory. It should not be referenced or otherwise interpreted as a standard specification.

A list of current IVOA Recommendations and other technical documents can be found at http://www.ivoa.net/Documents/.

Acknowledgments

Thanks are due to Kevin Benson for the dump of the MSSL registry, and to Ray Plante for the script to convert the version 0.10 resource entries to version 1.0.

Table of Contents


1 Introduction

The VO has developed and deployed a network of metadata repositories known, collectively, as the Registry. These contain information about data archives, services, organisations and other objects, allowing data owners to create and manage data they are responsible for, and supporting the replication of the data between cooperating registry servers. There are now production servers deployed in the UK, the USA and Europe (and plans to create a `Registry of Registries' to help find them), managing 12-13000 structured records and supporting queries from a variety of user-facing and server applications.

The Registry Working Group has produced standards for Resource Metadata (RM) [std:rm], the registry update protocol [std:regint], the VOResource schema [std:voresource], and others. At the time of writing, most of the registries serve records conforming to version 0.10 of the metadata standard, but should be fully converted to version 1.0 records by the end of 2007. The registries are currently queried either using SQL or XQuery [std:xquery].

This Note describes an experimental version of the registry as an RDF triple store (see section 1.1 RDF technologies for an introduction to RDF and related technologies), queriable through a SPARQL [std:sparql] endpoint. The goals of this experiment are:

  1. to investigate how straightforwardly the RM schemas, and the registry data itself, can be converted to RDF, both in terms of the conversion itself and the technologies required to store and distribute the resulting RDF;
  2. to discover what benefits this offers to registry clients, through the expressiveness of the SPARQL query language;
  3. to explore the costs and benefits of the inferencing possibilities of RDF Schemas and OWL, with the expectation that these will allow both richer queries and better targeted results than are possible with the current registry query facilities;
  4. to explore the links with other semantic technologies being developed by the VO, such as the vocabularies work and the Ontology of Astronomical Object Types [std:ivoa-astro-onto]
  5. in the longer term to explore the links with other RDF-based technologies and projects, such as a VO engagement with the Linking Open Data movement.

This Note describes preliminary results of the experiment, and will be extended in future versions. The RM schema and the registry data have been reasonably straightforwardly converted to RDF, and are at present available behind a SPARQL endpoint at http://thor.roe.ac.uk/quaestor (this is an experimental service, and should not be relied on in the long term). Performance seems acceptable, but has not been examined in detail. For examples of use, see 4 SPARQL queries.

1.1 RDF technologies

The Resource Description Framework (RDF [std:rdf]) is a family of technologies standardised by the W3C from 1999 (see http://www.w3.org/RDF for tutorials and further references), building on a large volume of previous work in computing science and library science. RDF is an abstract data model; it has a small number of alternate notations; and it has a lightweight schema language (RDF Schemas or RDFS [std:rdfs]) for articulating subclass and subproperty relations. Associated with it are a range of ontology languages (various levels of OWL) and associated formalisms building on it.

The RDF abstract data model represents all knowledge as a set of triples: resources have properties whose values are either resources or literals. All resources are named by URIs, most typically http: URIs, but also including mailto: URIs and other schemes. Properties are also named by URIs. RDF introduces the notion of a class, or type, for a resource, which is associated with a resource with the standard predicate rdf:type.

RDFS adds to this the properties rdfs:subClassOf and rdfs:subPropertyOf, making it possible to express a hierarchy of classes and properties, such that if B is a subclass of A then any object of type B is necessarily also of type A (with an analogous relationship for properties).

The Web Ontology Language (OWL) [std:owl] takes this further, adding mechanisms for defining classes (for example as the union of two other classes), declaring relationships between classes (for example that they are equivalent or disjoint), and defining properties with various logical features (a symmetric property p, for example, is one such as 'has sibling', which is such that if resource A has a property p with value B, then B can be deduced to have a property p with value A). OWL contains three levels of language, OWL Full, OWL DL and OWL Lite, with different implemention costs.

An ontology is, in the now-standard description ultimately attributable to [gruber93], a formal specification of a shared conceptualisation, that is, a set of classes and properties which articulate a model of the world (see also [baader04]). It can range from an elaborate set of definitions and restrictions, to a lightweight model which is barely more than a set of subclass relationships. For example, one might define the classes of Person, Male and Female, declare the the latter as subclasses of the former, and that a Person will have precisely one geneticFather and one geneticMother properties, which have Male and Female as their respective domains.

RDF is useful by itself, as a useful lowest-common-demoninator data aggregation format: everything can be translated into RDF, at the cost of spectacularly increased (though generally hidable) verbosity, vocabularies and data sources can be mixed freely, and SPARQL allows the result to be queried flexibly. In order to use the extra structure declared in an RDFS or OWL ontology one must employ a reasoner, which can consume an ontology and a set of asserted facts (for example that http://siegfried is a Male and has geneticMother http://sieglinde) and either implicitly or explicitly add the implied facts (in this case that http://siegfried is also a Person, and that http://siglinde must be a Female and thus also a Person). A reasoner which can make the deductions required for RDFS is a lightweight and generally fast thing; at the other end of the scale it is possible to create an ontology using OWL Full expressing relationships which a reasoner cannot be guaranteed to discover in polynomial time.

In this current RM work, the ontological work was done using only RDFS, with only a light garnishing of OWL annotations.

2 Conversion of registry metadata to RDF

The conversion of the registry metadata to RDF required two parallel strands, namely the conversion of the current resource schemas from XML Schemas to the analogous RDFS and OWL ontologies, and the conversion of the registry records themselves to RDF triples; these are described in sections 2.1 Converting the registry resource schemas and 2.2 Converting the registry records below. The resulting triples are served using a specialised `triple-store', briefly described in 3 The triple-store.

2.1 Converting the registry resource schemas

The IVOA maintains a number of XML Schemas (see http://www.ivoa.net/xml/ for the collection) at various levels of standardisation. As of September 2007, the production registries are using 0.x versions of the schemas for their records, though there is an ongoing campaign to move the registries to version 1.0+ schemas before the end of 2007.

We have worked, on this occasion, with the XSchemas ConeSearch, RegistryInterface, SIA, STC (v1.30), TabularDB (v0.3), VODataService, VORegistry, and VOResource (all are v1.0 except where noted). Although the registry records use a broader range of schemas than this, this set covers 11894 out of the 12561 registry records we converted.

The VOResource XSchema is the core schema for the registry, and the one with the richest structure; the corresponding ontology was therefore developed by hand, using Protégé [protege07]. Most of this development was rather mechanical, but part of the motivation with this conversion was to explore the extent to which OWL idioms could express the RM concepts more precisely, more expressively, or more intelligibly. To this end we made a few non-obvious adjustments in the conversion, which include the following:

Although the VOResource ontology uses a number of OWL features, none of them are essential, and it would be straightforward to define an alternative pure RDFS variant of the ontology with relatively little loss of functionality (at the time of writing, the MyGrid OWL Validator classifies the ontology as OWL Full, but this is accidental and we aim to change this in future).

The other XSchemas in the list were all converted directly to RDFS by an XSLT transformation of the .xsd file, assisted by a few additional XSD appinfo annotations inserted into the file.

2.2 Converting the registry records

The resource registry records were converted from an XML dump of the contents of the MSSL registry, comprising records conforming to version 0.10 of the VOResource schema. Since the conversion from XML to RDF presumed VOResource 1.0 records, the contents of the dump were pre-converted to VOResource 1.0 entries using Ray Plante's conversion script.

As with the ontologies described in 2.1 Converting the registry resource schemas, the conversion from the VOResource schema was done using a hand-written XSLT transform, and the conversions for the other supported schemas done by XSLT scripts generated from the XSchema files.

Initial testing suggests that the transformation is accurate, but RDF does not have a notion of `validity' which is as straightforward as that for XML, so more testing and debugging is certainly necessary.

Of the 12561 registry records in the supplied dump, 11894 converted without obvious error in this way; the remainder used XSchemas not supported so far, or included constructions not supported in the conversion scripts. These records turned into 656104 triples.

3 The triple-store

The RDF resulting from the transformation in section 2.2 Converting the registry records is stored in, and made available by, a triple-store.

Triple-stores typically do very little reasoning, if they do any at all. Typically, they will do only RDFS reasoning; that is, those deductions following from subclass, subproperty, domain and range assertions. In the example of 1.1 RDF technologies, this would include the deductions that http://sieglinde is a Female and a Person, but it would be unable to make use of any information about the (OWL) symmetricProperty nature of hasSibling (in the case of this particular example, too much reasoning about family relationships might be injurious!).

RDF cannot usefully be persisted using an RDB in a naïve way. It seems obvious to store RDF triples in a single three-column table, but it seems that this is in fact infeasible. As I understand it, the pattern of accesses of such a table, when servicing a SPARQL query, implies a large number of self-joins on the table, which is a pattern sufficiently unlike typical RDB accesses that it counfounds RDBMS query planners and becomes impossibly inefficient. Triple-stores therefore have two options: they can use an RDBMS as a persistance mechanism, as long as they store the triples in a sufficiently clever way; or they can create a storage engine from scratch, optimised for storing RDF.

We evaluated a number of triple-stores:

The following triple-stores were not evaluated, but should be.

4 SPARQL queries

For example:

prefix vor: <http://www.ivoa.net/xml/VOResource/v1.0#>
prefix sia: <http://www.ivoa.net/xml/SIA/v1.0#>

select ?r
where {
 ?r vor:capability [ sia:imageServiceType [ a sia:ImageServiceTypeAtlas ]].
 ?r vor:content [ vor:contentLevel [ a vor:ResearchContentLevel ] ].
}

Using the (experimental, temporary) SPARQL endpoint at http://thor.roe.ac.uk/quaestor, and using curl to post to the service, we may query the RDF Registry as follows:

% curl --data-binary @all-research-atlases.rq \
    --header content-type:application/sparql-query \
    http://thor.roe.ac.uk/quaestor/kb/rm

(the RDF registry metadata has been uploaded to a Quaestor knowledgebase named `rm'; see the documentation at http://thor.roe.ac.uk/quaestor for discussion). This XML response conforms to the SPARQL standard [std:sparql]. Note that the Content-Type of the POSTed query is given as application/sparql-query. If we add an Accept header via the curl option --header accept:text/tab-separated-values, we retrieve a simple list of hits.

Appendices

Bibliography

[baader04] Franz Baader, Ian Horrocks, and Ulrike Sattler.
Description logics. In Steffen Staab and Rudi Studer, editors, Handbook on Ontologies, International Handbooks on Information Systems, chapter 1, pages 3-28. Springer Verlag, 2004.
[gruber93] T Gruber.
A translation approach to portable ontology specification. Knowledge Acquisition, 5 no. 2 pp. 199-220, 1993.
[protege07] The Protégé ontology editor and knowledge aquisition system.
[Online, cited 14 sep 2007].
[std:ivoa-astro-onto] L. Cambrésy, S. Derriere, P. Padovani, A. Preite Martinez, and A. Richard.
Ontology of astronomical object types. IVOA Working Draft, 2007. [Online].
[std:owl] World Wide Web Consortium.
The web ontology language. [Online].
[std:rdf] World Wide Web Consortium.
Resource Description Framework. [Online, cited February 2005].
[std:rdfs] Dan Brickley and R V Guha.
RDF vocabulary description language 1.0: RDF Schema. W3C Recommendation, feb 2004. [Online].
[std:regint] Kevin Benson, Elizabeth Auden, Matthew Graham, Gretchen Greene, Martin Hill, Tony Linde, Dave Morris, Wil O'Mullane, Ray Plante, Guy Rixon, and Kona Andrews.
IVOA registry interfaces. IVOA Working Draft, 2006. [Online].
[std:rm] Robert Hanisch, IVOA Resource Registry Working Group, and NVO Metadata Working Group.
Resource metadata for the virtual observatory. IVOA Recommendation, mar 2007. [Online].
[std:sparql] Eric Prud'hommeaux and Andy Seaborne.
SPARQL query language for RDF. W3C Candidate Recommendation, apr 2006. [Online].
[std:voresource] Raymond Plante, Kevin Benson, Matthew Graham, Gretchen Greene, Paul Harrison, Gerard Lemson, Tony Linde, Guy Rixon, Aurelien Stebe, and IVOA Resource Registry Working Group.
VOResource: an XML encoding schema for resource metadata. IVOA Proposed Recommendation, 2007. [Online].
[std:xquery] Scott Boag, Don Chamberlin, Mary F. Fernández, Daniela Florescu, Jonathan Robie, and Jérôme Siméon.
XQuery 1.0: An XML query language. W3C Recommendation, jan 2007. [Online].

$Revision: 1.2 $ $Date: 2007/09/15 01:44:20 $