![]() |
International Virtual Observatory Alliance |
This is a Note. The first release of this document was 5 March 2004.
This is an IVOA Note expressing suggestions from and opinions of the authors. It is intended to share best practices, possible approaches, or other perspectives on interoperability with the Virtual Observatory. It should not be referenced or otherwise interpreted as a standard specification. A list of current IVOA Recommendations and other technical documents can be found at http://www.ivoa.net/Documents/.
This document has been developed with support from the National Science Foundation's Information Technology Research Program under Cooperative Agreement AST0122449 with The Johns Hopkins University.
Descriptions of the resources were encoded using the
Resource Metadata XML schemas [VOResource 2003] developed by the
IVOA
Registry Working Group. Some of the registries stored these
descriptions internally in this format as well; this allowed them to
use an off-the-shelf OAI tool to expose the descriptions via an PMH
interface. Thus, the main challenge for the publishing registries was
creating the XML documents; this was enabled at all of the registries
via a web-based form that a data provider can fill out to describe a
resource. The main challenge for the searchable registry developers
were to creating harvester that could retreive the OAI resource
records, extract the VOResource metadata from the OAI envelope, and
parse the metadata for loading into the database. In principle, the
searchable registry would also emit VOResource-compliant metadata in
response to search queries; however, in lieu of a standard to do this,
the search interface was optimized for our sole client, the
NVO Data
Inventory Service [NVO-DIS].
1.1. A Review of Successes
Although this document focuses on the difficulties using the
VOResource schemas, it is worth noting that we feel that overall our
prototyping was a successful validation of the IVOA registry
framework. In particular, we demonstrated:
The primary object in the model is the Resource, represented
by the generic <Resource> element.
The contents of this element constitutes the core metadata that describes
all types of resources. This element is extended to describe
specific types of resources; specific resource elements currently
defined include <Organisation>, <Project>,
<Registry>, <Authority>, and
<Service>. Each resource
extension element can add additional metadata that is specific to that
resource. Given that <Resource>
is so generic, it is more common to describe a resource using one of
the extension elements.
In specific terms of XML Schema, the
<Resource> element is extended
in a two step process. First, one extends the
ResourceType XML type using the XML Schema
type extension mechanism. Next, one defines the extension element
(e.g. <Service>) as having this
extended type and declaring as a member of the substitution
group, "Resource." This use of the substitution group means that the
extension element can be used anywhere the
<Resource> element is used.
1.3. Purpose of this Document
The remainder of this document focuses on what, in our experience,
made using the VOResource schemas difficult. The purpose is to
provide input to a discussion of a possible future revision. This
document avoids making to many conclusions about what should
be done to improve the schemas; rather, it attempts to assemble the
major issues to be considered. We recognize that some of the
complexity inherent in the schemas may be necessary to adequately
address the needs of the various IVOA projects. Thus, a prudent
revision need not address all of the challenges described here.
2. Challenges
In this section, we examine some of the challenges and difficulties
that we, as on-the-ground implementors and data providers, have
encountered while creating resource descriptions using the VOResource
schema and fill our registries.
2.1. The Learning Curve
For much of our deployment period, the only practical documentation
that existed on VOResource helpful for creating compliant descriptions
was an
example file [adil-v0.9.xml] posted to
the Registry Working Group Twiki. While it contained examples of
several types of resource descriptions, it was not targeted
specifically to use of descriptions with registries. General XML
tools, such as XMLSpy and
xs3p, provided
"reference manual"-type descriptions of the schemas; these are
somewhat helpful because the schemas themselves were fully documented,
and these tools incorporated this documentation into there
human-readable format. However, even for those who know where to
consult these documents, it's unclear how helpful they were to people
in the absence of a concept of the big picture.
What was clearly needed was a general tutorial document that stepped through a few annotated examples, target specifically to the registry applications. Such a document was in the works; however, in the time leading up to deployment, priority was given to tracking down various bugs in the schemas and their publishing. An important companion to that document is a metadata dictionary [Plante 2003] (which Plante developed as an adaptation to the xs3p-generated reference manual); using the documentation in the schema, this provides users with a dictionary for looking up the specific meaning and syntax of any VOResource element. This dictionary was published late in the deployment stage, so it is unclear if it, on its own, was useful to developers.
In the absence of a tutorial document, several of us looked to the RM document for guidance. This led to considerable confusion because the metadata names and structure defined in the schemas do not match exactly to what is in the RM. Although, the RM does state that particular encoding may diverge in name and structure from what is defined in the RM, some feel that the divergence is greater than necessary and thus a source of unnecessary confusion.
2.2. Hierarchy vs. Flatness
One of the reasons for VOResource's deviation from the RM document is
to take advantage of XML's ability to organize information
hierarchically. The purpose of employing a hierarchical organization
is to aggregate information into logical objects. For example, all
the information about the resource's Coverage is contained within the
<Coverage> element. This
can be useful for an applicatin that wishes to handle, say, Coverage
objects independently from a resource description.
A hierarchical structure does have some disadvantages, however. One is that it can be a bit tedious programmatically accessing a piece of informaation deep within the hierarchy using an in-memory tree (either DOM or customized "binding" objects): this may require several function calls to traverse the levels of the hierarchy. Such calls can be particularly verbose if one or more levels of the hierarchy are optional; one needs to confirm that the layer exists before extracting its contents.
We also found SAX parsing is a bit more complicated than it might be
because certain elements
(e.g. <Title> and
<Name>) are reused in multiple
locations within the hierarchy. In these cases, the basic concept the
element represents is the same but modifies different things (e.g. the
title of a publisher versus the title of a resource). This requires
that the parser must not only look for an element with a particular
name, it must keep track of where in the hierarchy the element is
found in order to grab the right one.
The biggest difficulty with the hierarchical model imposes is that it
does not map easily to a relational database which was used to
implement the searchable registry. If the RDBMS tables are
to be normalized, one would typically store each layer (or object) from
the XML document into a separate table. Repeatable elements (e.g.
<Format>, <Subject>) would
also need to be further segregated into separate tables.
Consequently, simple queries will invariably require joins across
multiple tables. A flatter model would make the mapping to an RDBMS
easier.
It's worth pointing out that the question of hierarchy versus flatness
primarily is an issue that effects how easily XML tools work with our
schema. However, it also can affect a user's overall understanding of
the model, particularly if one is familiar with the relatively
straight-forward concepts in the RM. For example, the name of a
publisher is not the value of the
Publisher element, but rather
Publisher/Title. The hierarchy, thus,
can obscure the actual location of the most commonly used values.
This is also an issue of general complexity.
2.3. Size and Complexity
The hierarchical nature of the schemas can be one aspect of the
schema's complexity. The number of elements defined in the schema is
another. Overall complexity can help determine how easy it is to work
with the schema. In this section, we enumerate some ways the overall
size and complexity of the schema can increase the effort necessary to
use the schemas.
First, most of the elements are optional, and many of them not being used by our current applications. This creates a challenge for both providers--those that must create the resource descriptions--and application developers--those that make use of them. First, for providers, it becomes unclear which of the elements are most important to provide to be useful. For developers, more elements often require more code to support. For example, with a searchable registry, more elements means more data that must be stored, mapped into and out of a database. Because this mapping is non-trivial (as mentioned in the previous section), more hand coding is necessary.
It can be argued that the larger the schema, the harder it is to
comprehend and support. It may, therefore, make sense to remove
or otherwise consolidate elements that in practice are rarely used.
There are several elements that might be considered in that catagory.
For example, there are several elements of the type
ResourceReferenceType (e.g.
Facility and
Instrument). This type has four child
nodes, Identifier, Title, Description,
and ReferenceURL, of which only
Title is required. This type is
intended for refering to other resources that may or may not
be described in an external, registered resource description. For
Facility and
Instrument, providers have not
registered them separately (because there is no driving need at the
moment). As a result, the Identifier
value is not set; neither, typically, are
Description and
ReferenceURL. If we concluded that
these rarely-used children are not needed we could simplify the
Facility and
Instrument elements greatly by defining
their type to be xsd:string to simply
hold the title. Obviously, with this example, what would be
sacrificed would be flexibility, particularly to future uses
(e.g. when a facility is commonly registered independently).
2.4. Referencing Multiple Namespaces and Schema Files
A typical VOResource instance draws on multiple schemas, usually one
extension schema that defines a specific type of resource
(e.g. VOCommunity which defines the
Organisation resource) and the core
VOResource schema. In our prototype registry deployment, a set of
descriptions from a single provider would typically have to draw on
all six of the standard resource schemas defined.
The complexity introduced by the use of multiple namespaces made namespace use error-prone. When a namespace was mishandled, it was often difficult to determine the cause as the error message one usually saw was something like "schema not found" or "unexpected element." From the perspective of a data provider, who arguably should not have to be an XML expert to use the schemas, these messages are too obscure to easily track down the problem.
The difficulties we experienced with namespaces resulted primarily from three common errors:
There are five locations where the namespace URI is typically cited:
targetNamespace
attribute within the schema (.xsd) file.
xmlns attribute at the
top of the XML document indicating the default namespace
for the document.
xmlns attribute
elsewhere in the XML document used to switch to a new
default namespace.
xmlns:prefix
attribute used to define namespace prefixes.
schemaLocation
attribute to indicate where to find the schema document
associated with the namespace.
VOResource uses a standard-but-optional mechanism for locating schemas in which the URI is a URL that resolves to the actual Schema file itself. Applications can search this URL as a default location; however, not all XML tools support this mechanism. This avoids having to "hard-code" the schema locations in the XML documents.
However, support for this mechanism varied greatly among the tools in use.
The prominent role that namespaces play in the resource description documents ultimately made the XML mark-up more complex and mysterious from the perspective of a resource provider than we would prefer. In particular:
xmlns and
schemaLocation attributes
visually detracts the information being encoded. It makes the
document "noisy".
Using the binding tools is not straight-forward primarily for two reasons. Most important is the fact that substitution groups, which VOResource uses to support polymorphism, was not fully supported by these tools. (At the time of this writing, we believe that substitution groups are now supported by JAXB and MS-XSD. Castor can conditionally support them under certain usage patterns.) Second, it is not clear how to use these tools to create classes for an extension schema which draws on the core VOResource schema; that is, while it may be possible, this is not well documented. For example, through trial and error, we found that MS-XSD could support all the extension schemas if we generated all of the classes from all of the schemas all at once in one call to the XSD command.
4. Addressing the Difficulties
As stated in the Introduction, the purpose of
this document is to serve as input to a discussion of a revision of
VOResource; that is, we hope that we can benefit from these lessons
learned. To that end, we see three possible approaches that the IVOA
RWG can take to address the concerns raised in this document:
The major motivation for this option is that by late 2004, we expect to be fairly entrenched with the VOResource schemas. We expect there to be several working registries in production mode, serving several thousands of records. The prospect of retooling these registries for a new core schema (and retraining their maintainers) may make a revision prohibitively expensive.
Given the IVOA goals for 2004, such a revision process would have to happen rapidly in the early spring, and be tested in real prototypes by spring and ready for IVOA review at the May 2004 Interoperability meeting.
Appendix A: Changes from previous versions
Not applicable.
http://www.ivoa.net/internal/IVOA/IVOARegWp03/adil-v0.9.xml
http://www.ivoa.net/Documents/PR/ResMetadata/ResMetadata.html
http://www.ivoa.net/twiki/bin/view/IVOA/IvoaResReg
http://heasarc.gsfc.nasa.gov/vo/data-inventory.html
http://bill.cacr.caltech.edu/usvo-pubs/files/VORegistries.pdf
http://nvo.ncsa.uiuc.edu/VO/schemas/vomdoc-v0.9/Note-RMSchemas.html#appA
http://www.ivoa.net/Documents/PR/Identifiers/Identifiers.html
http://www.openarchives.org/OAI/openarchivesprotocol.html
http://www.ivoa.net/twiki/bin/view/IVOA/IVOARegWp03#VOResource_a_Resource_Metadata_S