We describe the initial implementation of an RDF version of the IVOA Resource Registry, serving the registry data via a SPARQL query endpoint, including the creation of the ontology analogues of an important subset of the relevant XML Schemas, and the mechanics of the conversion process. The result is an experimental service, and this is an interim document.
This is an IVOA Note.
This document is an IVOA Note expressing suggestions from and opinions of the authors. The first release of this document was 2007 September 20.
It is intended to share best practices, possible approaches, or other perspectives on interoperability with the Virtual Observatory. It should not be referenced or otherwise interpreted as a standard specification.
A list of current IVOA Recommendations and other technical
documents can be found at
http://www.ivoa.net/Documents/.
Thanks are due to Kevin Benson for the dump of the MSSL registry, and to Ray Plante for the script to convert the version 0.10 resource entries to version 1.0.
The VO has developed and deployed a network of metadata repositories known, collectively, as the Registry. These contain information about data archives, services, organisations and other objects, allowing data owners to create and manage data they are responsible for, and supporting the replication of the data between cooperating registry servers. There are now production servers deployed in the UK, the USA and Europe (and plans to create a `Registry of Registries' to help find them), managing 12-13000 structured records and supporting queries from a variety of user-facing and server applications.
The Registry Working Group has produced standards for Resource Metadata (RM) [std:rm], the registry update protocol [std:regint], the VOResource schema [std:voresource], and others. At the time of writing, most of the registries serve records conforming to version 0.10 of the metadata standard, but should be fully converted to version 1.0 records by the end of 2007. The registries are currently queried either using SQL or XQuery [std:xquery].
This Note describes an experimental version of the registry as an RDF triple store (see section 1.1 RDF technologies for an introduction to RDF and related technologies), queriable through a SPARQL [std:sparql] endpoint. The goals of this experiment are:
This Note describes preliminary results of the experiment, and will be extended in future versions. The RM schema and the registry data have been reasonably straightforwardly converted to RDF, and are at present available behind a SPARQL endpoint at http://thor.roe.ac.uk/quaestor (this is an experimental service, and should not be relied on in the long term). Performance seems acceptable, but has not been examined in detail. For examples of use, see 4 SPARQL queries.
The Resource Description Framework (RDF [std:rdf]) is a family of technologies standardised
by the W3C from 1999 (see http://www.w3.org/RDF for tutorials and further
references), building on a large volume of previous work in computing
science and library science. RDF is an abstract data model; it has a
small number of alternate notations; and it has a lightweight schema
language (RDF Schemas or RDFS [std:rdfs]) for
articulating subclass and subproperty relations. Associated with it
are a range of ontology languages (various levels of OWL) and
associated formalisms building on it.
The RDF abstract data model represents all knowledge as a set of
triples: resources have properties whose
values are either resources or literals. All resources are named by
URIs, most typically http: URIs, but also including
mailto: URIs and other schemes. Properties are also
named by URIs. RDF introduces the notion of a class, or
type, for a resource, which is associated with a resource with the
standard predicate rdf:type.
RDFS adds to this the properties rdfs:subClassOf and
rdfs:subPropertyOf, making it possible to express a
hierarchy of classes and properties, such that if B is a
subclass of A then any object of type B is
necessarily also of type A (with an analogous relationship
for properties).
The Web Ontology Language (OWL) [std:owl] takes this further, adding mechanisms for defining classes (for example as the union of two other classes), declaring relationships between classes (for example that they are equivalent or disjoint), and defining properties with various logical features (a symmetric property p, for example, is one such as 'has sibling', which is such that if resource A has a property p with value B, then B can be deduced to have a property p with value A). OWL contains three levels of language, OWL Full, OWL DL and OWL Lite, with different implemention costs.
An ontology is, in the now-standard description ultimately
attributable to [gruber93], a formal
specification of a shared conceptualisation
, that is, a set of
classes and properties which articulate a model of the world (see also
[baader04]). It can range from an elaborate
set of definitions and restrictions, to a lightweight model which is
barely more than a set of subclass relationships. For example, one
might define the classes of Person, Male and
Female, declare the the latter as subclasses of the
former, and that a Person will have precisely one
geneticFather and one geneticMother
properties, which have Male and Female as
their respective domains.
RDF is useful by itself, as a useful lowest-common-demoninator data
aggregation format: everything can be translated into RDF, at the cost
of spectacularly increased (though generally hidable) verbosity,
vocabularies and data sources can be mixed freely, and SPARQL allows
the result to be queried flexibly. In order to use the extra
structure declared in an RDFS or OWL ontology one must employ a
reasoner, which can consume an ontology and a set of asserted
facts (for example that http://siegfried is a
Male and has geneticMother
http://sieglinde) and either implicitly or explicitly add
the implied facts (in this case that http://siegfried is
also a Person, and that http://siglinde must be
a Female and thus also a Person). A reasoner which
can make the deductions required for RDFS is a lightweight and
generally fast thing; at the other end of the scale it is possible to
create an ontology using OWL Full expressing relationships which a
reasoner cannot be guaranteed to discover in polynomial time.
In this current RM work, the ontological work was done using only RDFS, with only a light garnishing of OWL annotations.
The conversion of the registry metadata to RDF required two parallel strands, namely the conversion of the current resource schemas from XML Schemas to the analogous RDFS and OWL ontologies, and the conversion of the registry records themselves to RDF triples; these are described in sections 2.1 Converting the registry resource schemas and 2.2 Converting the registry records below. The resulting triples are served using a specialised `triple-store', briefly described in 3 The triple-store.
The IVOA maintains a number of XML Schemas (see http://www.ivoa.net/xml/ for the collection) at various levels of standardisation. As of September 2007, the production registries are using 0.x versions of the schemas for their records, though there is an ongoing campaign to move the registries to version 1.0+ schemas before the end of 2007.
We have worked, on this occasion, with the XSchemas ConeSearch, RegistryInterface, SIA, STC (v1.30), TabularDB (v0.3), VODataService, VORegistry, and VOResource (all are v1.0 except where noted). Although the registry records use a broader range of schemas than this, this set covers 11894 out of the 12561 registry records we converted.
The VOResource XSchema is the core schema for the registry, and the one with the richest structure; the corresponding ontology was therefore developed by hand, using Protégé [protege07]. Most of this development was rather mechanical, but part of the motivation with this conversion was to explore the extent to which OWL idioms could express the RM concepts more precisely, more expressively, or more intelligibly. To this end we made a few non-obvious adjustments in the conversion, which include the following:
maxOccurs and
minOccurs XSchema attributes have generally been
ignored. This is because (i) so-called `cardinality' constraints are
(perhaps unexpectedly) difficult to reason about; (ii) the RDF
version of the resource is envisaged as a mirror of the XML version,
so that we might reasonably assume XML schema validity; and (iii)
while XSchemas in general are heavily concerned with notions of
syntactical validity -- forbidding this construction or that -- RDF
schemas exist to provide the information required to reason with, and
are not much concerned with validity.logo property has a domain of Actor, the concepts
Contact and Resource are valid domains for that
property as well (in the XSchema, only Creator has a logo)Type has a flat
enumeration of permissible values. In the ontology these were
arranged into a shallow hierarchy, with ArchiveType, for
example, a subClassOf ScientificDataType, which is in turn a
subClassOf ResourceType. At the same time, the concepts at
the same level in this hierarchy were declared as being
disjoint, in OWL terms, in the sense that no resource can be
declared to be of ArchiveType and SimulationType
(we make no claim that this hierarchy or these disjointedness
assertions are strongly defensible -- the aim is to find to what
extent extra structure such as this can add value to the registry).uri property and broken-out
authorityID and resourceKey properties.Relationship type contains
relationshipType and relatedResource
elements. This is replaced in the ontology by a
property relatedResource property, which takes a
ContentDescription as domain and Actor as range. It
has subproperties mirrorOf, servedBy,
serviceFor and derivedFrom, with the same domain and
range. Thus we have replaced the stand-off link of the XSchema with a
direct link in the ontology. We have verified that it is indeed
feasible to convert XML instances to use the appropriate property, and
we expect that the direct relationship will be more readily
intelligible to the eventual users, of whichever type.validationLevel element contains
one of an enumerated set of integers. In the ontology, the
validationLevel property has a domain of Resource or
Capability and a range of Organization, and the
level of validation is expressed by whichever of the successive
subproperties validationLevel0 to
validationLevel4 is used. As well as being more direct, this
directly encapsulates the information that a resource with a
validationLevel1 property necessarily also has a
validationLevel0 property with the same value. Again, it
remains to be seen by experience whether this does in fact make
queries easier for users.Curation
turned into concept CurationDescription. These were
generally for stylistic and idiomatic reasons.Although the VOResource ontology uses a number of OWL features, none of them are essential, and it would be straightforward to define an alternative pure RDFS variant of the ontology with relatively little loss of functionality (at the time of writing, the MyGrid OWL Validator classifies the ontology as OWL Full, but this is accidental and we aim to change this in future).
The other XSchemas in the list were all converted directly to RDFS
by an XSLT transformation of the .xsd file, assisted by
a few additional XSD appinfo annotations inserted into
the file.
The resource registry records were converted from an XML dump of the contents of the MSSL registry, comprising records conforming to version 0.10 of the VOResource schema. Since the conversion from XML to RDF presumed VOResource 1.0 records, the contents of the dump were pre-converted to VOResource 1.0 entries using Ray Plante's conversion script.
As with the ontologies described in 2.1 Converting the registry resource schemas, the conversion from the VOResource schema was done using a hand-written XSLT transform, and the conversions for the other supported schemas done by XSLT scripts generated from the XSchema files.
Initial testing suggests that the transformation is accurate, but RDF does not have a notion of `validity' which is as straightforward as that for XML, so more testing and debugging is certainly necessary.
Of the 12561 registry records in the supplied dump, 11894 converted without obvious error in this way; the remainder used XSchemas not supported so far, or included constructions not supported in the conversion scripts. These records turned into 656104 triples.
The RDF resulting from the transformation in section 2.2 Converting the registry records is stored in, and made available by, a triple-store.
Triple-stores typically do very little reasoning, if they do any at
all. Typically, they will do only RDFS reasoning; that is, those
deductions following from subclass, subproperty, domain and range
assertions. In the example of 1.1 RDF technologies, this
would include the deductions that http://sieglinde is a
Female and a Person, but it would be unable to make use
of any information about the (OWL) symmetricProperty nature
of hasSibling (in the case of this particular example,
too much reasoning about family relationships might be injurious!).
RDF cannot usefully be persisted using an RDB in a naïve way. It seems obvious to store RDF triples in a single three-column table, but it seems that this is in fact infeasible. As I understand it, the pattern of accesses of such a table, when servicing a SPARQL query, implies a large number of self-joins on the table, which is a pattern sufficiently unlike typical RDB accesses that it counfounds RDBMS query planners and becomes impossibly inefficient. Triple-stores therefore have two options: they can use an RDBMS as a persistance mechanism, as long as they store the triples in a sufficiently clever way; or they can create a storage engine from scratch, optimised for storing RDF.
We evaluated a number of triple-stores:
The following triple-stores were not evaluated, but should be.
For example:
prefix vor: <http://www.ivoa.net/xml/VOResource/v1.0#>
prefix sia: <http://www.ivoa.net/xml/SIA/v1.0#>
select ?r
where {
?r vor:capability [ sia:imageServiceType [ a sia:ImageServiceTypeAtlas ]].
?r vor:content [ vor:contentLevel [ a vor:ResearchContentLevel ] ].
}
Using the (experimental, temporary) SPARQL endpoint at http://thor.roe.ac.uk/quaestor, and using
curl to post to the service, we may query the
RDF Registry as follows:
% curl --data-binary @all-research-atlases.rq \
--header content-type:application/sparql-query \
http://thor.roe.ac.uk/quaestor/kb/rm
(the RDF registry metadata has been uploaded to a Quaestor
knowledgebase named `rm'; see the documentation at http://thor.roe.ac.uk/quaestor for discussion).
This XML response conforms to the SPARQL standard [std:sparql]. Note that the Content-Type of the
POSTed query is given as application/sparql-query. If we
add an Accept header via the curl option --header
accept:text/tab-separated-values, we retrieve a simple list of
hits.
$Revision: 1.2 $ $Date: 2007/09/15 01:44:20 $