|
|
International Virtual Observatory Alliance |
Simple Spectral Access Protocol
Version 1.04
IVOA Recommendation Feb 01, 2008
This version:
http://www.ivoa.net/Documents/REC/DAL/SSA-20080201.html
Latest version:
http://www.ivoa.net/Documents/latest/SSA.html
Previous version(s):
Version 1.03, December 2007
Version 1.02, September 2007
Version 1.01, June 2007
Version 1.00, May 2007
Version 0.97, November 2006
Version 0.96, September 2006
Version 0.95 May 2006
Version 0.91 October 2005
Version 0.90 May 2005
Editors:
D.Tody, M. Dolensky
Authors:
D.Tody, M. Dolensky, J. McDowell, F. Bonnarel, T.Budavari, I.Busko, A. Micol, P.Osuna, J.Salgado, P.Skoda, R.Thompson, F.Valdes, and the data access layer working group.
The Simple Spectral Access (SSA) Protocol (SSAP) defines a uniform interface to remotely discover and access one dimensional spectra. SSA is a member of an integrated family of data access interfaces altogether comprising the Data Access Layer (DAL) of the IVOA. SSA is based on a more general data model capable of describing most tabular spectrophotometric data, including time series and spectral energy distributions (SEDs) as well as 1-D spectra; however the scope of the SSA interface as specified in this document is limited to simple 1-D spectra, including simple aggregations of 1-D spectra.
The form of the SSA interface is similar to that of the older Simple Image Access (SIA) interface for accessing 2-D image data, and the cone search interface for accessing astronomical catalogs. Clients first query the global resource registry to find services of interest. Clients then issue a data discovery query to selected services to determine what relevant data is available from each service; the candidate datasets available are described uniformly in a VOTable format document which is returned in response to the query. Finally, the client may retrieve selected datasets for analysis.
Spectrum datasets returned by an SSA spectrum service may be either precomputed, archival datasets, or they may be virtual data which is computed on the fly to respond to a client request. Spectrum datasets may conform to a standard data model defined by SSA, or may be native spectra with custom project-defined content. Spectra may be returned in any of a number of standard data formats. Spectral data is generally stored externally to the VO in a format specific to each spectral data collection; currently there is no standard way to represent astronomical spectra, and virtually every project does it differently. Hence spectra may be actively mediated to the standard SSA-defined data model at access time by the service, so that client analysis programs do not have to be familiar with the idiosyncratic details of each data collection to be accessed. Services are self describing, and provide a service metadata query operation which may be called to determine the capabilities of a specific service instance. Metadata returned by a service metadata query may be cached in the registry to facilitate registry-based service discovery.
Since SSA is part of a family of interfaces, much of the SSA interface described herein is common with the other DAL interfaces and not specific to SSA. In particular, the HTTP-based basic service profile, the main query parameters, and most of the dataset metadata returned in the query response, are generic and apply equally well to any type of data, and are (or will be, as interfaces are updated) shared by all the DAL interfaces.
This document has been produced by the IVOA Data Access Layer Working Group.
It has been reviewed by IVOA Members and other interested parties, and has been endorsed by the IVOA Executive Committee as an IVOA Recommendation. It is a stable document and may be used as reference material or cited as a normative reference from another document. IVOA's role in making the Recommendation is to draw attention to the specification and to promote its widespread deployment. This enhances the functionality and interoperability inside the Astronomical Community.
Comments on this document can be posted to the mailing list dal@ivoa.net, uploaded to the collaborative web page IvoaDAL, or sent to the authors directly.
A getCapabilities operation returning service metadata will be added
which will eventually obsolete the current FORMAT=METADATA mechanism. As
an addition to the interface, this change is expected to be backwards
compatible with existing services. The getCapabilities operation is expected to be compatible with the VO
Support Interface (VOSI) specification that the IVOA Grid & Web Services
working group is currently defining (Rixon et al. 2007). Additional changes
are expected when other Grid and query language technology is integrated into
the DAL interfaces including SSA.
A list of current IVOA Recommendations and other technical documents can be found at http://www.ivoa.net/Documents/.
This document has been developed with support from the 5th and 6th Framework Programmes of the European Community for research, technological development and demonstration activities, contracts HPRI-CT-2001-50030, VOTech-011892, and via a grant from the National Science Foundation's Information Technology Research program to develop the U.S. National Virtual Observatory.
Many of the ideas in this document originated from others involved in developing Virtual Observatory concepts and standards. In particular, the idea of using association in the query response to group similar datasets grew out of an idea originally proposed by Roy Williams. Arnold Rots originated the idea of ranking query results via a score heuristic, and helped put the coordinate systems used in SSA on firm theoretical foundation via the development of STC. Francois Bonnarel, Mireille Louys, Alberto Micol, and others contributed to the representation of astronomical metadata and in particular the Characterization data model. Laszlo Dobos contributed early implementations of the access protocol using the spectral archive at JHU.
Many thanks to all who contributed to the DAL survey among spectral data providers and consumers (Dolensky/Tody 2004): Ivo Busko, Mike Fitzpatrick, Satoshi Honda, Stephen Kent, Tom McGlynn, Pedro Osuna Alcalaya, Benoît Pirenne, Raymond Plante, Phillipe Prugniel, Enrique Solano, Alex Szalay, Francisco Valdes and Andreas Wicenec.
Parts of this protocol were adapted from the OpenGIS (Open Geospatial Consortium, Inc.) Web-Mapping Service (WMS) specification. In particular, the basic service elements and certain details of the use of the HTTP protocol to formulate requests and responses is patterned after the OpenGIS WMS service. Parts of the text of this specification were adapted directly from the WMS service specification.
Contents
1 Introduction
1.1 Architecture
1.2 Basic Usage
1.3 Basic Service Elements
1.3.1 Request Format
1.3.2 Parameters
1.3.3 Parameter Values
1.3.4 Error Response
1.4 Requirements for Compliance
1.4.1 Levels of Compliance
2 Concepts and Terminology
2.1 Dataset and Data Collection
2.2 Data Model
2.3 Data Representation
2.4 Virtual Data
2.5 Data Derivation
2.5.1 Data Source
2.5.2 Creation Type
2.6 Service Type
2.7 Services, Interfaces, and Protocols
2.8 Dataset Identifiers
2.9 Provenance
2.10 Data Association
2.11 UTYPEs and UCDs
3 SSA Operations
3.1 Introduction
3.2 Methods & Protocols
3.3 Future Extensions
3.3.1 GetCapabilities
3.3.2 StageData
3.3.3 GetAvailability
4 QueryData Operation (required)
4.1 Input Parameters
4.1.1 Mandatory Query Parameters
4.1.2 Recommended and Optional Query Parameters
4.1.3 Service-Defined Parameters
4.2 Query Response
4.2.1 Query Response Metadata
4.2.2 Types of Metadata
4.2.3 Query Metadata
4.2.4 Association Metadata
4.2.5 Access Metadata
4.2.6 Additional Service-Defined Metadata
4.2.7 Metadata Extension Mechanism
5 GetData (reserved)
6 Metadata Query
6.1 Metadata Request
6.2 Metadata Response
7 Data Retrieval
7.1 Access Reference URL
7.2 Data Format
7.3 Data Compression
7.4 Error Response
8 Basic Service Elements
8.1 Introduction
8.2 Version numbering and negotiation
8.2.1 Version number form and value
8.2.2 Version number changes
8.2.3 Appearance in requests and in service metadata
8.2.4 Version number negotiation
8.3 General HTTP request rules
8.3.1 Introduction
8.3.2 Reserved characters in HTTP GET URLs
8.3.3 HTTP GET
8.3.4 HTTP POST
8.4 General HTTP response rules
8.5 Numeric and boolean values
8.6 Output formats
8.7 Request parameter rules
8.7.1 Parameter ordering and case
8.7.2 Range-list parameters
8.7.3 Missing or null-valued parameters
8.8 Common request parameters
8.8.1 VERSION
8.8.2 REQUEST
8.8.3 Extended capabilities and operations
8.9 Service result
8.10 Error Response and Other Unsuccessful Results
8.10.1 Service Error
8.10.2 Overflow
8.10.3 Other Errors
Appendix A: Theoretical Spectral Access Use Case
Appendix B: Standard QueryData Query Response
Appendix C: Standard Metadata Query Response
Appendix D: SSA Data Model Summary
References
The Simple Spectral Access protocol (SSAP, SSA) defines a uniform interface to remotely discover and access simple 1-D spectra. SSA is based on a more general data model capable of describing tabular spectrophotometric data including time series and SEDs as well as 1-D spectra. Basic usage is similar to the Simple Image Access (SIA) protocol (Tody/Plante 2004) and the simple cone search (SCS) protocol for simple access to astronomical catalogs. Unlike these earlier interfaces, spectral data access via SSA may involve active transformation of data as stored externally into a standard data format and data model defined by SSA, in order to deal with the problem of heterogeneous spectral data formats as stored externally. SSA also defines much more complete metadata to describe the available datasets.
All the IVOA data access interfaces share the same basic interface, differing mainly in the type of data being accessed. A query is used for data discovery, and to negotiate with the service the details of the static or virtual (dynamically created) datasets to be retrieved. Subsequent data access requests can then be made to retrieve individual datasets of interest. SSA differs from some other data access interfaces in that a service may mediate not only dataset metadata, but the actual dataset itself, to allow a client to do detailed analysis on a spectrum without having to understand how it is represented externally. Direct access to native project data is also provided.
All of the second generation DAL interfaces share the same basic service profile, although services may define additional operations specific to the service. A single service may support multiple operations or methods to perform various functions. The current DAL interfaces use an HTTP GET-based interface to submit parameterized requests, with responses being returned as structured documents, e.g., FITS or VOTable. The service operations currently defined for SSA are the following:
Further operations are planned but not currently defined (3.3).
A spectrum conforming to the SSA (Spectrum) data model may be returned serialized in any of a number of different data formats, including VOTable, FITS binary table, and native XML. Comma or tab separated value (CSV) format may also be provided by implementations, but is not currently specified.
Although SSA is a complex interface, the most common usage can be quite simple. A query can be entered in a Web browser, viewing the results as XML in the browser and downloading selected spectra by a copy-paste operation on the given access reference URL. A simple query might search for 1-D spectra by position on the sky – the classic “cone search” type of query. More complex queries are little more complicated, merely adding additional query constraint parameters, e.g., to constrain the waveband or spectral resolution, or to find spectra by redshift.
In a simple case of a positional query the SSA query URL is very similar to that for SIA or SCS. For example,
|
Example:
|
The query response is a VOTable describing each candidate dataset as defined later in this document.
Dataset retrieval is then a simple matter of examining the query response, selecting the dataset or datasets to be retrieved, if any, and retrieving them by reading the document pointed to by the access reference (a URL) for the dataset. Interpretation of the returned spectrum dataset is the responsibility of the client application.
For a fully compliant SSA service, the data returned by the service will be in one of the SSA defined standard data formats, conformant to the SSA-defined Spectrum data model. Due to the need to mediate external data or support features such as format conversion or data subsetting, the service may compute the output dataset on demand, however this is transparent to the client.
The basic form of a SSA service (or any other second generation DAL service) is specified in detail in section 8. In the current section we merely summarize the basic elements of a standard data service.
In general a service may implement multiple operations, such as queryData; altogether these define the interface to the service. Interfaces may change with time hence are versioned. It is possible for a given service instance to simultaneously expose multiple interfaces or versions of interfaces.
The SSA interface described in this document is based on a distributed computing platform (DCP) comprising Internet hosts that support the Hypertext Transfer Protocol (HTTP). Thus, the online representation of each operation supported by a service is composed as a HTTP Uniform Resource Locator (URL).
A request URL is formed by concatenating a baseURL with zero or more operation-defined request parameters. The baseURL defines the network address to which request messages are to be sent for a particular operation of a particular service instance on a particular server. Service operations generally share the same baseURL but this is not required.
Parameters may appear in any order. If the same parameter appears multiple times in a request the operation is undefined (if alternate values for a parameter are desired the range-list syntax may be used instead). Parameter names are case-insensitive. Parameter values are case-sensitive unless defined otherwise in the description of an individual parameter.
All operations define the following standard parameters:
REQUEST The request or operation name, e.g., “queryData” (mandatory).
VERSION The version number of the interface (optional).
The values of both the REQUEST and VERSION parameters are case-insensitive. Although the SSA V1.0 only defines a single queryData operation, use of REQUEST is mandatory to provide upwards compatibility with future versions.
A given service instance may support multiple versions of the SSA interface, which includes both the input parameters and the query response with all of its complex metadata, and by default the service assumes the highest standard version which is implemented (access to any experimental versions supported by a service requires explicit specification of the version by the client). Explicit specification of the interface version assumed by the client is necessary to ensure against a runtime version mismatch, e.g., if the client caches the service endpoint but a newer version of the service is subsequently deployed. If desired the client can omit the VERSION parameter to disable runtime version checking, and default to the highest version standard interface implemented by the service.
All other request parameters are defined separately for each operation.
Integer numbers are represented as defined in the specification of integers in XML Schema Datatypes. Real numbers are represented as specified for double precision numbers in XML Schema Datatypes. Sexagesimal formatting is not permitted, either for parameter input or in output metadata, other than in ISO 8601 formatted time strings (sexagesimal format is fine for a user interface but inappropriate for a lower level machine interface, where it only complicates things).
SSA defines a special range-list format for specifying numerical ranges or lists of ranges as parameter values. For example, “1E-7/3E-6;source“ could specify a spectral bandpass defined in the rest frame of the source. The syntax supports both open and closed ranges. Ranges or range lists are permitted only when explicitly indicated in the definition of an individual parameter. For a full description of range list syntax refer to section 8.7.2.
In
the case of an error, service operations should return a VOTable containing an INFO element with name QUERY_STATUS
and the value set to “ERROR”.
More fundamental service or protocol errors may however result in an HTTP
level error, hence a client program should be prepared to handle either
response. A null query, that is a queryData which does not find any data, is
not considered an error. More information on error responses is given in
section 8.10.
The keywords “must”, “required”, “should”, and “may” as used in this document are to be interpreted as described in the W3C specifications (IETF RFC 2119). Mandatory interface elements are indicated as must, recommended interface elements as should, and optional interface elements as may or simply “may” without the bold face font.
Sometimes the extent to which a given interface element is required depends upon the mode of operation of the service. For example, a service which performs spectral extraction must implement the APERTURE query parameter, but it is not used for other types of SSA services, and for these need not be implemented.
In order to be minimally compliant a service must implement all elements of the SSA protocol identified as “must” in this document. In brief, the minimal service implementation includes the following:
1.
The SSA query method must implement the HTTP GET interface, returning the query
response encoded as a VOTable document. At least the POS, SIZE, TIME, BAND, and FORMAT query parameters must be supported by the service (regardless of whether these are defined
for the data being accessed). The query response must include all metadata fields identified as mandatory in the protocol.
2. The direct URL-based getData method must be provided capable of returning data in at least one of the SSA-compliant data formats (VOTable is suggested if only one format is supported).
3. The “FORMAT=METADATA” metadata query feature must be provided to return service metadata encoded as defined herein.
If a service cannot return data which is SSA (i.e., Spectrum DM) compliant, it is still useful to implement a service which provides a SSA-compliant query method but which returns native or external data. Such a service is said to be query compliant if the query operation is at least minimally compliant. The ability to return native project data is always desirable, as this provides the maximum transfer of information from the project, however the ability to return SSA (Spectrum DM) compliant data is essential for transparent multiwavelength data analysis, hence is the primary requirement. Legacy data providers are encouraged to both provide data in both their proprietary legacy data format as well as in the Spectrum DM format, leaving the choice of which is more useful for analysis up to the client application and the user.
A service is said to be fully compliant if, in addition to the functionality required to be minimally compliant, the service implements all the “should” elements of the interface defined herein.
A top of the line service will be fully compliant plus will implement some of the optional (“may” provide) elements of the interface. For example the service may support additional query parameters or may return additional metadata; the service may provide access to native data as well as SSA-compliant data, or may be capable of returning data in any supported standard data format requested by the client.
The term dataset as used in this specification normally refers to a primary dataset such as an individual spectrum, image, table, and so forth, i.e., an individual data object usually including associated metadata. A complex dataset is some logical association or aggregation of primary datasets, often of different types, possibly with additional high level metadata describing the association. In common usage, dataset can refer to either of these. A data collection is a collection of primary or complex datasets, such as a survey data release (e.g., "SDSS DR6") or an instrumental data collection from an individual observatory instrument.
SSA consists of both an access protocol and interface, and an underlying data model describing the data to be accessed. The term data model as used here refers to a logical model for the data detailing the decomposition of a complex dataset into simpler elements, including specifying the meaning of each element, the relationships between elements, the metadata used to describe the data elements and the overall dataset, and the concepts upon which the data model is based. In this document we refer to the underlying data model interchangeably as the SSA data model or the Spectrum or spectral data model. The data model used in SSA is described in (McDowell, Tody, et.al, 2007).
Explicitly defining the data model assumed by a data object is important for a variety of reasons. Doing so helps greatly to document the structure and meaning of the data. Data analysis software has to understand data at a fundamental level in order to function correctly.
Data model mediation - the process of transforming data from some externally-defined data model to a prescribed data model (the SSA data model in our case) - makes it possible for a client application to deal uniformly with external data without having to understand the idiosyncratic details of each external data collection. SSA does data model mediation on the fly, at data access time, in the service used to publish a data collection to the VO. A data publishing service is written for a specific data collection by the creators or curators of the data who understand the data well, and may thereafter be accessed by any number of independently written client applications; hence mediation to a standard model is best performed by the service.
If more detailed knowledge of a specific data collection is required than is possible using a standard model, direct pass-through of the native project data is also possible. This is an important capability as it ensures that nothing has been lost in the translation, and it provides for direct communication between the client application (or user) and the data provider. Nonetheless, for general automated multiwavelength data analysis, if we provide only access to native project data, this puts the burden of interpretation of individual project datasets completely on the data consumer (e.g., the client application), and we feel that the data provider has a better understanding of their data, and is generally better equipped to make this translation. Hence data should always be provided in a form compliant to the SSA/Spectrum data model if possible, with pass-through of native project data provided as well where possible.
A data model defines the logical content of data, but says nothing about how the data is represented externally. The same data object may be represented externally in many different ways, e.g., as a FITS file or VOTable, as a direct XML serialization, in a RDBMS, and so forth. So long as the data model does not change, and the data representation is expressive enough, data may be transformed from one representation to another without loss of information. If transformation between different data models is required, some loss of information may occur. This can happen, for example, during mediation of external data to a known data model by a SSA service.
In the most general case SSA uses a container-component approach to represent datasets. In this case a general container such as VOTable or FITS is used to represent a Spectrum object. A similar approach is used for the SSA query response, which is returned as a VOTable. The container is used to aggregate component data models which are associated in some fashion to model more complex objects such as a spectrum. The advantage of this approach is flexibility, in that there is no fixed structure for the overall dataset, and extensibility, as it is easy to add custom components to describe the details of a specific data collection while conforming to the standard core model.
Application programs typically manipulate a data object by directly accessing the elements of the data model via some language-specific API. UTYPE tags are used to provide a uniform means to identify the elements of a data model in any language or environment. For example, given the component data model “DataID“, the UTYPE “DataID.Title“ identifies the data model field containing the title string for the dataset; “DataID.Collection“ identifies the parent data collection, and so forth.
A virtual dataset is one which can be described, but which may not physically exist until it is accessed, at which time it is created on the fly by the service. A typical example is a cutout (subset) of an image or spectrum. Where general distributed multiwavelength data analysis is concerned, most data access in the VO is necessarily to virtual data. Physical datasets can also be accessed, but this is a far less powerful technique as physical datasets are often too large to transmit efficiently over the network, particularly when only a small portion of the data is needed, and capabilities such as mediation to a standard model or transformations of various kinds are not possible.
When a query is made to a SSA service which can return virtual data, the service computes the parameters of any virtual datasets it can generate to satisfy the query. What can be generated depends upon what the client has requested, the input data available to the service, and the capabilities of the service. The metadata returned in the query response will describe the virtual dataset and its relationship to any parent dataset or datasets. The access reference is in effect a token to be passed back to the service to generate the virtual dataset. The client can either access the virtual data (in which case it is realized by the service, and returned), or further refine the query to more finely specify the data to be returned by the service.
Data can come from a variety of sources, and may go through various types of processing, including by the data access service itself, before being delivered to a client analysis application. It is important for analysis to understand the origin of the data and what processing it has undergone. To address this issue we introduce two new concepts, data source and creation type.
The data source specifies where the data originally came from, that is, the data collection to which the service provides access. The following values are currently defined:
|
survey |
A survey dataset, which typically covers some region of observational parameter space in a uniform fashion, with as complete as possible coverage in the region of parameter space observed. |
|
pointed |
A pointed observation of a particular astronomical object or field. Typically these are instrumental observations taken as part of some PI observing program. The data quality and characteristics may be variable, but the observations of a particular object or field may be more extensive than for a survey. |
|
custom |
Data which has been custom processed, e.g., as part of a specific research project. |
|
theory |
Theory data, or any data generated from a theoretical model, for example a synthetic spectrum. |
|
artificial |
Artificial or simulated data. This is similar to theory data but need not be based on a physical model, and is often used for testing purposes. |
The creation type describes the process used to produce the dataset as returned by the service, from the data source. Typically this describes only the processing performed by the data service, but it could describe some additional earlier processing as well, e.g., if data is partially precomputed. The creation type is especially important for virtual data and for data which is derived from the parent data set by some complex form of processing. The following values are currently defined:
|
archival |
The entire archival or project dataset is returned. Transformations such as metadata or data model mediation or format conversions may take place, but the content of the dataset is not substantially modified (e.g., all the data is returned and the sample values are not modified). |
|
cutout |
The dataset is subsetted in some region of parameter space to produce a subset dataset. Sample values are not modified, e.g., cutouts could be recombined to reconstitute the original dataset. |
|
filtered |
The data is filtered in some fashion to exclude or alter portions of the dataset, e.g., passing only data in selected regions along a measurement axis, or processing the data in a way which recomputes the sample values, e.g., due to interpolation or flux transformation. Filtering is often combined with other forms of processing, e.g., projection. |
|
mosaic |
Data from multiple non- or partially-overlapping datasets are combined to produce a new dataset. |
|
projection |
Data is geometrically warped or dimensionally reduced by projecting through a multidimensional dataset. |
|
spectralExtraction |
Extraction of a spectrum from another dataset, e.g., extraction of a spectrum from a spectral data cube through a simulated aperture. |
|
catalogExtraction |
Extraction of a catalog of some form from another dataset, e.g., extraction of a source catalog from an image, or extraction of a line list catalog from a spectrum (not valid for a SSA service). |
The full creation type may involve more than one of these operations, for example, both projection and filtered, or both spectral extraction and filtered.
This list is by no means complete in general astronomical data processing terms, but is intended to express only the types of operations which might take place during VO data access, where subsetting, filtering, projection, spectral extraction, etc., are all defined operations. Other values may be added in the future. The creation type is not intended to describe the processing done to produce the data collection itself, which the service is used to access.
Not all SSA services are of the same type: services are further classified by their subtype, indicating how they generate the spectra returned by the service. The subtype of a SSA service is similar to the dataset creation type as described in section 2.5.2; usually the creation type and the SSA service subtype are the same, but this is not always the case. A simple service which returns only entire archival spectra is an “archival” SSA service. A service which can return subregions of larger spectra is a “cutout” service. A SSA service which can combine multiple input spectra is a “mosaic” service (a mosaic service can also do cutouts if presented with a sufficiently small spectral bandpass). A SSA service which dynamically generates spectra from more fundamental data, e.g., a spectral data cube or event list, is a “spectralExtraction” service.
A service operates at a defined service endpoint (e.g., an Internet URL, often called a baseURL), and implements one or more predefined client-server interfaces. The service interface consists of one or more service operations, also known as requests, or methods. Each operation accepts as input zero or more request parameters. The details of how a client talks to a service interface over a given transport protocol (e.g., HTTP) defines the protocol used to interact with the service.
A dataset identifier is an identifying name for a dataset that is globally unique within the VO and is compliant with the URI syntax rules (IETF RFC 2396). It consists of an IVOA Identifer (Plante et.al. 2005), followed by a pound sign ("#"), and a local identifier. The IVOA Identifier defines a name space (for example a data collection) which may contain any number of individual datasets, each with its own unique local identifier. The local identifier consists of one or more legal URI characters, and is a name given by the creator or publisher of the dataset which identifies an individual dataset within the namespace defined by the IVOA Identifier..
In ABNF (IETF RFC 2234) format, the dataset identifier is defined as:
dataset-id = ivoa-id "#" uric
where ivoa-id is a legal IVOA identifier in URI format (uri-form in [Identifiers]) and uric is the set of legal URI characters (uric in (IETF RFC 2396)).
To provide consistency with the IVOA Identifier standard, the rules for comparing dataset identifiers are the same as for IVOA identifiers: two dataset identifiers shall be considered as refering to the same dataset "if a case-insensitive, character-by-character comparison indicates that they are identical." That is, "apart from a transformation to handle case-insensitive comparisons, no other normalizing transformations shall be necessary" to test for equivalence [Identifiers].
As we shall see in section 4.2.5.5, we define several types of dataset identifiers, including CreatorDID, PublisherDID, and DatasetID. The CreatorDID is the dataset identifier (if any) assigned by the creator of the dataset, for example a survey project or observatory. This does not change, even if the dataset is published in multiple locations. CreatorDIDs can be assigned at dataset creation time, before the data has been published to the VO, but will be globally unique so long as the creating entity uses a registered IVOA Identifier for the namespace. The PublisherDID is the dataset identifier assigned by a publisher; this DID is unique within the publisher's name space, but has no meaning otherwise. A special case of a PublisherDID is a DatasetID, which is a globally unique dataset identifier assigned by a publisher to attempt to index data from many sources, for example an ADS dataset identifier.
When data is published to the VO it should always be possible for the publisher to assign a unique PublisherDID. A CreatorDID may or may not be assigned by the dataset creator (legacy data at least is unlikely to have one). We recommend the practice as it can easily be done in an automated fashion at dataset creation time, as one might assign a serial number, and provides a globally unique way to identify any dataset. In general a global data indexing service will only index selected datasets, e.g., those referenced in journal articles, so while a DatasetID can be useful for things such as linking datasets to journal articles, many datasets may not have registered DatasetIDs, and in principle there can be multiple publishing authorities registering DatasetIDs.
The combination of a data source with a creation type provides us with a primitive capability for describing the provenance of a dataset, i.e., where it came from, and how it was produced. This is important because SSA and other DAL services can generate virtual data products where complex processing may be performed at access time.
To be able to describe the provenance of a virtual data product we need one additional concept, the dataset identifier of the parent dataset, as assigned by the entity which created the dataset (typically a survey project, observatory, modeling program, etc.). Dataset identifiers are discussed in more detail in section 2.8.
Given a virtual data product we can then say how the data product was derived from the parent dataset or datasets (the creation type), identify the parent dataset (the creator-assigned dataset ID), and the origin and type of data from which the virtual data product was derived (data source, collection, and so forth). In the more complex cases such as a mosaic a virtual data product may have multiple parent datasets.
If a process which produces data products is complex enough, with many inputs, ultimately the result is a new data collection, but in most runtime data access scenarios the simple provenance model presented here should be enough to identify a virtual data product or other dataset and how it was produced.
There are many cases where it is desirable to be able to associate multiple datasets, for example to model a multi-spectral observation such as an Echelle, or to group datasets that represent the same data made available in several different data formats. Spectra of the member galaxies in a cluster might be a completely different type of association. In the case of images, a multi-band observation could be viewed as an association of several independent images, each in a single spectral band and with some shared observational metadata.
The approach taken in SSA to address this problem of complex data is to keep the basic data objects as simple as possible but use association to describe more complex entities. Hence, an Echelle observation could be viewed as a collection of independently accessible 1-D spectra which are logically associated. The spectra would include the individual Echelle orders and possibly an overall combined high resolution spectrum. Some extension metadata might also be provided to provide additional information describing the overall association. The individual spectra would be usefully accessible without requiring that a client application understand the complex instrument (an Echelle spectrograph) which produced the data, however the more complex view would optionally be accessible as well.
Associations are described in the SSA query response since this has the ability to relate multiple datasets. How this is done will be described further in the specification of the SSA query response, but the main technique is to define a new query response field Association.ID for which all members of an association share the same value. An association key may also be provided for each member of the association to uniquely identify their role within the association (e.g., the Echelle order in our example above). Finally, an association Type field or param tells what type of association this is. The ID may be used to link to extension metadata providing further information describing the specific extension.
A UTYPE is a fixed string which uniquely identifies a field of a data mo