Hi Pat -
This is very useful, you have identified some issues that need more careful thought.
> ** buildng a search engine (SE)
>
> To elaborate further, a useful SE on SSA and SIA would also need to find the
> following things for each record:
>
> 1. unique identifier that could be used sometime later to get the
> AccessReference (ie to get the data or let a user get the data):
>
> - publisher ID is tied to the specific service, so one would need to keep
> the tuple of <resourceID, pubID> where resourceID lets you find the same
> service in the registry and pubID lets you find the record within that
> service.... Correct?
>
> 2. a globally unique "dataset ID" culd be used, but the SE would still
> need to know which service(s) can deliver the record and data... plus
> specific implementations of a SE might need specific things from the
> record not supplied by everyone that can deliver the dataset (eg. I need
> spatial support, time bounds, and energy bounds to build my search engine -
> someone else might need more or less)....
There are two related aspects to this problem depending upon whether we are building a generic index or are building lists of data objects targeting some specific type of analysis:
Indexing static data --
By SE I think we mean a global indexing service, which indexes "atlas" datasets belonging to some collection (i.e., static files or records in some archive). The SE would restrict its queries to "atlas" or "pointed" services (or whatever we decide to call these in the future). This is distinct from services which compute virtual data, where what you see depends upon what you ask for.
For this case I think what you suggest is probably the way to go. The SE needs to record the resourceID of the service, and the publisher dataset ID (pubID or whatever we decide to call it) of the specific dataset as assigned by the service.
CreatorID cannot be used for this purpose as 1) we can't guarantee that there is one (not all data collections assign CreatorIDs), and 2) in this case we want to index specific dataset instances from specific services. However, if there is a creatorID it can be used for data discovery or to query the SE to find indexed replicas.
Indexing virtual data --
A similar issue came up recently in connection with persistent virtual directories, where a data discovery client application builds a list of data products targeting some specific type of analysis, and comes back sometime later to access them. This is a different case as here we want to deal with virtual data - we are building a filtered-down list of data objects to be used for specific analysis, and we may have many such lists. In this case the IDs in general will not work, as there may be multiple virtual data products (e.g., cutouts) which are generated from the same atlas dataset, or a virtual data product may derive from multiple atlas datasets.
One way to address this problem and solve the problem of persistence could be for the service to represent a virtual data collection, assigning persistent CreatorIDs for virtual data it can generate (such an ID would probably point to a persistent database record which tells the service how to generate the virtual data product). However this seems like it is probably too complex, at least for the moment. The access reference generated by a service already tells the service how to generate a virtual data product. Perhaps to deal with issues of persistence we just need to be more rigorous about specifying the time to live for an access reference (the old SIA spec already includes this but I don't think current services have bothered to implement it).
> To support an SE, "mtime" needs to be a query parameter of the form
> mtime=MIN,MAX with support for mtime=MIN, (for >=) and it has to be part
> of each record on output. Personally I would like to see these as REQUIRED.
Yes, this looks reasonable, and is consistent with the current spec.
> ** using/getting AccessReference
>
> In addition, if I build an SE that stores <resourceID,pubID> then I
> will also like to have a fast way to convert them into AccessReference
> (URLs). I'm assuming the AccessReference one gets from the query is
> currently valid but not guaranteed to be valid indefinitely (publishers
> may want/need to change data delivery, which I don't think should mandate
> changing the modification time). Specifically, it would be nice to be
> able to pass a list of pubID values to a service and get one response,
> rather than have to issue separate queries and get one response (VOTable)
> per pubID with one record each. With http get, the length of the list
> would be limited, of course.
>
> Logically, I an SE will need pubID as a REQUIRED query and output
> parameter. List support is an optimisation.
We already thought of this, which is why SSA permits a query by ID. Markus's suggestion of changing the query by ID parameters to permit a list of ID's looks reasonable. This approach does not scale well but is simple, and probably adequate for the moment.
> I really hope this can get into SSA 1.0 and hence SIA 1.1,
I don't see any problem. The main issue has to do with the precise semantics of the IDs, and what we decide to call them.