Fw: A suggested revision for UCDs

From: Robert Hanisch <hanisch-at-stsci.edu>
Date: Wed, 22 Oct 2003 13:15:37 -0400


Here is a discussion that Tom and I had off-list, but I think are number of points of more general interest are raised. Warning -- it is quite long!

Bob

Hi Bob,

Thanks for the review and comments. I'm particularly interested in the areas that were unclear. It seemed to me that I needed to actually put the ideas out where I could get some detailed reactions. A fair number of typo issues were addressed in the version I uploaded to the Twiki and announced to the UCD, DM and DAL groups. Haven't heard of any reaction.

I've responded to your comments below (there a lot of detail but I thought it user to think these things through).

Tom

Robert Hanisch wrote:
> Hi Tom. I read through your revised UCD document this evening. Phew.
> There is much in it I like, much I don't, and much I don't follow.
Perhaps
> the two latter categories mix together.
>
> I guess my biggest problem is that the roles of concept, attribute, and
> modifier are partly defined by syntax (where they appear in the string)
and
> partly by having to know what names (em, pos, flux) have been allocated to
> which category. This seems very arbitrary (and very confusing) to me.
> Although I have never written a parser in my life, it looks to me like a
> parser for this would be a zillion if statements. Maybe this is fewer if
> statements than for other approaches, but it still looks very complex.
>

I agree that this is a major issue, although my biggest concern with it is a little
different. I'm giving a long answer to help me organize my thoughts.

The writer of a table presumably has access to the documentation for UCDs so it shouldn't be a big problem dealing with the three types -- especially
once there are examples. The problem is more in using UCDs when reading tables.

In practice I'm not sure this would be a big deal for 'real' tools. E.g., something like
VOPlot is going to need to know about the value and meas.error attributes internally so
that it can plot values and error bars for a given quantity. I.e., it's just
going to look for pairs of columns within the same group of the form:

     SomeString;value and SomeString;meas.error A spectral processing tool is going to look for pairs like phot.flux*;value and
phys.wavelength;value. Specific tools internalize this kind of knowledge -- or
even better read it in as a data model. These tools don't really know about how
UCDs are organized. The organization is intended to make it easy for them to
search for the appropriate strings, but they just take advantage of that.

Generic tools for manipulating UCDs and for validating them are where the problem
really begin to show up. Currently there are only 6 trees that are not basic concepts
(em, frame and intent for modifiers and filter, stat and meas for attributes).
I think the single word attributes are important enough that they will not cause a problem. So a complete algorithm to determine what word belongs in what vocabulary is currently pretty easy... Psuedocode is just:

     firstAtom = substring(ucd, index(ucd,"."))
     switch (firstAtom) {
         'em', 'frame','intent': return thisIsAModifier
         'stat','filter','meas': return thisIsAnAttribute
         'value','local','instance','multiplet', 'vector': return
thisIsAnAttribute
     }
     return thisIsAConcept

Alternatively we're talking about validating UCDs against an IVOA schema to define
the valid words and the match against this could give the type.

There are other simple ways to deal with this: Begin all modifiers with m. and attributes
with a. Or I've suggested in the draft that all modifiers could be in the frame tree
-- the idea is that the role of modifiers is to limit the context to which the concept applies.
I don't think the attribute trees join as easily but if it's important enough we could
pick a name for all of the attribute trees.

The biggest problem is non-standard namespaces. How do we handle a new UCD tree?
In some sense the issue is moot. Non-standard words shouldn't be used outside
of some developers local context. They can be responsible for handling them. However
I suspect that non-standard words will escape into the wild. The validate against
the schema approach still works, but it's impossible for writers for tables to know
how to use these UCDs.

There are some other ideas that might help address this issue: Your suggestion
of another separator character is nice. I thought about it but decided that it was too radial a change. Maybe separate atributes and modifiers within themselves by commas but separate them by '-'s. e.g., a complex UCd might be:

     flux.phot-em.optical,intent.calculated-meas.error,stat.max I'd still like to keep the vocabularies separate, but now it's trivial to parse the UCD.

For the moment I tried to minimize the change from the original proposal. Note that this
is all much harder in the original proposal. There is no way to tell what anything after
the first word is. In that proposal the first word is a property, but all subsequent
words can be either properties or concepts. Nor there any lexical definition of what
a property is (i.e., any word can be a property).

> The document has a lot of signs of a rush job -- is it Uniform or Unified?
> (Unified, I think.)

I always thought it was Uniform so that wasn't a typo but an error or my part...
Sigh...

Is flux a 0-level concept? Or is it phot.flux? That I think is fixed in the published version (it's always phot.flux)

On p.
> 3 you say that units are not part of UCDs, but on p.16 you create a UCD,
> phys.degrees;value

I wasn't quite sure what the UCD should be there. Maybe phys.angle.separation;value?

, that is all about units. On p.12, I really like the
> typo(?) in 'pudding' (pubbing).

Alas that is also fixed. [That kind of error must reflect some curious things about the mind. I clearly picked the mirror image letter even though the typing motion for it is nothing like 'd']
>
> I'm not sure how others have reacted -- have not gone to the UCD list yet
to
> see. But I was particularly confused by the following things.
>
> o p.4, you say that
>
> phot.flux;em.optical;intent.calculated;value
>
> is equivalent to
>
> phot.flux;intent.calculated;em.flux;value
>
> But there must be a mistake here. Shouldn't 'flux' in the second line be
> 'optical'? And isn't the first form illegal if alphabetical order is
> required?

The typos in the UCD were fixed and I hope that would help clarify what I was trying to say. The two UCDs should have been

    phot.flux;em.optical;intent.calculated;value and

    phot.flux;intent.calculated;em.optical;value The statement I was trying to make was that there is no natural reason to prefer one of these to the other, so we had to choose an arbitrary rule to try to ensure uniqueness of UCDs. Thus indeed the second is illegal.

>
> I find the goal of brevity at conflict with the goal of clarity. What
does
> 'em' mean to a human reader? Why 'src' and not 'source'? Why 'value' and
> not 'scalar' (parallel structure to 'vector')? Why default on 'value' in
a
> otherwise well-defined ontology?

I can't really argue with most of these. The tension between various goals it why I tried to list them all together. I would be happy to change to longer
words.

The default for value was just meant to be a convenience for writers of tables.
If it confuses things I'm happy to drop it.

I like value rather than scalar because a value can be a vector quantity. E.g.,
if we have a cell that contains an array of fluxes it's UCD might be

    phot.spectrum;value
That's because the concept of spectrum is inherently non-scalar. A field that had
a UCD of

    phot.spectrum;vector
would imply that each cell contained an array of spectra (i.e, that the cell was
presumably a 2-d array). However this is no big deal.

>
> I think if a clear distinction is to be made between attributes and
> modifiers, it must be encoded explicitly (i.e., not just based on a list
of
> magic words). I do not like the semicolons as delimiters; this is not
what
> they mean in English grammar. (The semicolon in the last sentence was
used
> properly. The second clause is not necessarily a direct modifier of the
> first, but rather is related in some intimate way.)

This is fine by me -- I gave an example above using different separators. I think
the grammar is just as simple.

>
> I don't understand how to use the concept 'concept' in a practical sense.
>

Well I tried to give two examples: If you have a VOTable in an editor how do you
find the fields that don't have a defined concept? If a user simply omits the
UCD field it's kind of painful to find them. However one can just do a string
search for "concept" if the user has entered ucd='concept;value' to explicitly
mark that the underlying UCD is unknown.

The real reason is given in the last example in section 5. When correlating two
tables that describe different kinds of quantities, e.g., sources and observations,
I need to be able to describe what the ouput table is. There are two objects
in every row so it's a multiplet (in my scheme), but what kind of multiplet? I can't
call it a source, and I can't call it an observation, so I need to go up to a more
generic word, i.e., concept. Basically it just provides the root for entire concept
hierarchy. If we really wanted to be regular, we could start all of the base
concepts as using this word...

> Your definition of 'pos' does not include solar or planetary coordinate
> systems, though later you give an example that does.

I don't know what the current hierarchy under pos is... What I'd guess is that it would contain something like:

     pos.body.lat and pos.body.lon

and then the frame modifier would be used to specify which body. [Or maybe I left an inconsistency in from the previous version]

>
> 'intent' is defined as the 'human context' of the concept. Huh? How are
> 'calculated', 'predicted', and 'simulated' anymore human concepts than
> 'observed' or 'measured'?

Observed and measured would be fine additions here except that they are likely to be considered the default. I.e., a time.exposure;value is assumed to be the measured time, so I don't need to put that in. [Note that meas is short for measurement]. The explanation probably needs to be better, but I think we need some kind of modifier that distinguishes between 'real' values and predicted, scheduled, calculated, ... values. This
doesn't come up so much in VizieR tables, but many of the tables that I deal with are riddled with situations where I may have an allocated exposure time,
a predicted exposure time and an actual exposure time. So something is assumed to be actual/measured/observed unless an intent is specified.
>
> In 4.4 you insist that full words should be used ('electron' instead of
> 'el'), but at the same time assert that 'phys', 'temp', 'em', etc., are
all
> ok.

I don't have a horse in this race... I tried to match the usage of the previous
paper, but I'd be happy to go either way.
>
> Example 2 (p.14) does not convey to me anything semantically different if
I
> disregard your comments. How am I supposed to understand something about
> guide stars and plate centers from the structure of the UCDs alone? I
take
> issue with your assertion that "both software and humans should have no
> trouble distinguishing the very different semantics of the two tables."
>

Well... I'd hope that by looking at the table UCD, you would immediately note that one table returned source information and the other returned observation information. That's no small matter. The structure immediately shows which concept is subordinate to the other. The actual semantics of the relationship were not described. You could do that if you want that level
of detail. I'm not sure what the right UCDs are.

E.g., in the source table might have included (hope the indentation survives the mail):

      obs.instance
           meta.id;value
            pos;meas.center
                 pos.eq.ra;value
                 pos.eq.dec;value

I guess if we really want to include the concept of a guide star in the UCD hierarchy, they probably belong in the base concept or maybe in frame somehow, but I think this is
too detailed. If we went ahead with it... The guide star might be

      src;frame.usage.guiding;instance
meta.id;value
pos.instance

Note that in the first case it's the position that got the extra information, because the observation is just a standard observation (as far as we know). In the second case we're suggesting that this is a special kind of source.

But I don't think I want to put that in the relatively simple examples. What I was trying to show was how the need for main columns has disappeared and that we could get source or observation information from either table with equivalent ease.

> I don't like 'arith' as a concept. 'math' would be ok. If we need it at
> all.

Well I did try to discourage it... I have no problem with math.
>
> I don't like 'soft' as a concept. Is it so bad to just say 'software'?
All
> this stuff will be encoded in XML, which is notoriously verbose. If we
> chose unclear abbreviations we will obscure whatever semantic meaning is
to
> be found.

Fine with me...

>
> OK, a lot of these criticisms are not really directed to you, but to the
> predecessor document. I understood your presentation in Strasbourg (I
> thought) but do not follow the document sufficiently well that I would
ever
> be comfortable promoting it forward. I did not like Roy and Sebatien's
> premise that concept and property could morph, one into the other,
depending
> on context. I do like your attempt to structure things more rigidly. It
> seems to me not rigid enough. And when I ran into phys.degrees I felt
like
> the whole thing was falling down around me. The concept is an angular
> distance, which of course can be expressed in degrees, radians, arcsec,
etc.

Agreed... {see above)
>
> It might be worth our time to look at the AIPS++ measures definitions. If
I
> were to construct a quick hierarchy, what we are trying to do here is
> distinguish various sorts of measurements, metadata about those
> measurements, and metadata about the people/organizations associated with
> those measurements. So our fundamental concept is a measurement, of which
> there are various sorts:
>
> measurement
> photometric
> spectroscopic (which is just photometric per wavelength in an ordered
sort
> of way)
> astrometric ('pos')
> temporal
> instrumental
>
> Ancillary information about measurements comes in the form of metadata:
>
> metadata
> identifiers
> people
> organizations
>
> And we may have some special classes:
>
> software
> source (to collect measurements of an object in space-time)
>
> Measurements are taken in bandpasses, and in certain coordinate frames,
and
> from either the real universe or from computer simulations. A bandpass is
a
> 'frame' restricting coverage in the em-spectrum. A coordinate frame
> describes a restriction on the spatial coverage. The idea of 'intent' has
> nothing to do with anything; it is simply a mode of collecting
measurements.
>
> Allright, enough of my rantings for this evening. I applaud your attempt
to
> add rationality to Roy and Sebastien's work, but feel we still have some
way
> to go.
>

Thanks... I don't disagree with what you are saying and I hope that we can a least reopen the discussion.

Tom Received on 2003-10-22Z17:21:20