VOTable alternative?

Guy Rixon gtr at ast.cam.ac.uk
Tue Jan 27 02:22:34 PST 2004


On Mon, 26 Jan 2004, martin hill wrote:

> So I reiterate that we are not looking for a *replacement* to a VOTable, but a
> better way of describing the specific things we expect to do science with.  The
> result of this kind of 'better way' discussion is that we are looking
> particularly at the things that VOTable do badly.

Sorry, but I think many people in this group _are_ trying to displace VOTable.
The motivation seems to be "let's create something that does better what we
want to do, then we (the creators of software) won't have to support VOTable."
At least, that's the tone of many of the posts.

> Where (perhaps?) we diverge is that I don't think it should be used as the
> normal data exchange mechanism.  Because it doesn't describe the data
> sufficiently well _in a standard way_ that allows us and future astronomers to
> build tools out of 'industry standard' tools.  Other comments below...

Yes.  I think that VOTable _shuld_ be the normal medium of exchange, _for
generic tables_.  I.e., for the most common cases, we exchange generic tables.
Catalogues, typically, which are the main data type for the VO.

Yes, we can use fancy XML structures for special cases that aren't tables.
Yes, we can constrain _some aspects_ of these with schemata (although there
are always cases you can't catch like getting the RA and dec the wrong way
round).  But these are exceptions.

I hold it to be more important to make VOTable a workable standard than to
generate new structures for special cases.  If we gain the special structures
at the expense of being able to handle large, generic tables, then the VO
fails.

> Quoting Guy Rixon <gtr at ast.cam.ac.uk>:
>
> > On Mon, 26 Jan 2004, Martin Hill wrote:
> >
> > > Guy Rixon wrote:
> > > > On Mon, 26 Jan 2004, Martin Hill wrote:
> > > >
> > > >
> > > >>To answer your points directly:
> > > >>
> > > >>1a) We've talked about not being able to use XML tools directly on
> > VOTable.  You
> > > >>are quite right, we can build large/serial XPath queries to extract our
> > > >>information, but this involves 'programming' and makes a simple task
> > more
> > > >>difficult for all the wrong reasons.  The fact that VOTable does not lend
> > itself
> > > >>to doing *any* XML-common task nicely using standard techniques should
> > raise
> > > >>warning bells that something is not right about it, not that we should
> > slavishly
> > > >>adopt each and every XML technology that appears.
> > > >
> > > > Suppose that we have V2 or some similar format which is "richly-described
> > XML"
> > > > using W3C XML schema.  Can you describe any useful operations that can be
> > done
> > > > on this format _just from the schema_ using tools that have no
> > specialization
> > > > to astronomy?
> > >
> > > Validation :-) ie knowing you have the right dataset (spectra vs
> > catalogue).
> >
> > How can a schema ever tell me that my catalogue is valid or not?  It can
> > tell
> > me that my catalogue is not a spectrum...but I already know that.  If we
> > to distinguish between general uses of a syntax (generic catalogues) and
> > specific uses (e.g. spectrum) then we could put in a field stating the
> > higher-level standard to which a data set conforms.  We could add this to
> > VOTable.  Or better: the higher-level standard, e.g. spectrum-in-VOTable,
> > could require that a PARAM element with a certain value unique to the format
> > be present.
>
> We can in fact say a lot of things about whether a catalogue is valid or not.
> We can ensure that positions are within sensible ranges, that shapes are
> associated with galaxies, etc.

So every time I write a structure about a galaxy I have to put in morphology
information?  No, I know that's not what you meant...but it could be what the
schemata end up saying.  Option (a) all valid Galaxy elements include a
Morphology element, so catalogues without morphology data are filled with
spurious elements; (b) Morphology is optional, so a schema parser can't check
that Morphology is present when it's needed.  So...step 2: separate schemata
for each representaion of a Galaxay; Galaxy-with-morphology,
Galaxy-with-photometry, Galaxy-with-morphology-only-IR-photometry-
couple-of-spectra-on-the-side-no-fries-please.  There are just too many things
to say about a galaxy to support this; we'd need thousands of schemata.

Ultimately, you need a more-generic structure.  Granted, that structure need
not be a table; but you always have to do some application-specific parsing
and checking.  Therefore, W3C XML schema is _not_ a silver bullet that gets us
out of coding parsers and validators.

> Also *you* may know that a particular file is a spectrum or a catalogue.  But
> we're trying to build semi-intelligent systems, including workflows, that can
> take certain 'types' of data and feed them into other steps.  They need to know
> what can go into what - not just to check the process as it runs, but also at
> design time.  We don't want to let people accidently connect the wrong job step
> output into some other input without it barfing.  And we want our tools to be
> able to look around and go 'there's a service that can take what I have, to do
> what I want'.
>
> So why create an enumerated type in VOTable?  Why make up yet another
> VO-specific way of describing things when there is already an industry standard
> way of doing it?  We are/should be about reusing existing standards rather than
> making new.

Because the IT-industry standard seems not to work for efficient, binary
structures, or for tables in ordinary RDBMS, which are the two things we have
most of.

>
>
> > I disbelieve that a W3C XML schema can turn a generic parser into something
> > that does astronomy.  At best, it can allow a generic DOM-based parser to
> > validate an instance of a document.  But DOM is exactly we need to avoid,
> > because it fails to handle large documents. I doubt that we can produce a
> > single SAX-based parser that does everything just on the basis of a schema.
>
> Well of course - that's an obvious statement!  And we should *not* be using XML
> for large documents.

Good, we agree on something.  If we don't put large datasets into XML, how do
we describe them?

We have cases - e.g. the output from an ADQL server - where any given job can
produce either a small amount of data, which we can put in your proposed
format, or in VOTable/TABLEDATA, or in V2, or it can produce a mass of data
which we can't.  In a workflow, we need to route this result from originating
service A to consuming service B via a file or a stream.  B is easier to write
if the metadata for the dataset - the part doing the job of the FIELDs in
VOTable - is the same in both cases.  If B depends on a W3C XML schema and a
validating XML-parser to check its input then either it can't handle the
binary case or its has to forgo the validation. If it depends on a W3C schema
to understand the data, then maybe it can't use the binary format at all.

We need a format for metadata that is common to rich XML and binary formats.
That's why I like the inclusion of VOTable FIELDS in Roy's V2 proposal.

> > > Also, let us say we have an editor/wizard that helps us write registry
> > > information.  Even standard XML editors can now not only validate what you
> > > write, but also provide drop down lists of the various options available
> > within
> > > the element you are in.  I'm sure there will be more such tools as time
> > goes on
> > > - it's this unknown ahead that I would like us to be reasonably prepared
> > for.
> >
> > But a tool to configure the registry isn't a data-exchange format.  Registry
> > information is defined by the registry.  It's one format for one purpose.
> > Tabular data in general aren't like that.  You can't produce a schema that
> > defines exactly what you can put in each cell of a table for _any_
> > astronomical purpose, can you?
>
> Well you're thinking in terms of tabular data because your data happens to be in
> tables just now.  But actually even stellar catalogue data is *not* naturally
> tabular.
>
> And yes, I see no reason why we can't define what we're going to put in every
> element of a document for any astronomical purpose.  It doesn't even have to be
> agreed by the whole VO community - I am sure the solar people will have things
> they want to put in that the rest of the community doesn't care about.  That's
> rather the point - they don't need to squeeze things into VOTable and use
> special tools if we give them the building blocks.
>
> > > >>3) The thing is it's not a few days code we save.  If we start off now in
> > the
> > > >>community with an uncommon-XML-standard way of presenting our data, it
> > means all
> > > >>future users of the VO are going to have to write their own libraries to
> > cope
> > > >>with it, rather than any standard XML-handling libraries that come along.
> >  The
> > > >>example you've given is XPath - someone is going to have to write their
> > own code
> > > >>(in FORTRAN, C, Java, Perl, etc etc) to extract the FIELD info and then
> > the
> > > >>correct column.  We've seen how XPath has come along only recently - we
> > can
> > > >>expect all manner of other inventions to appear over the next few years.
> > > >
> > > > Not _all_ future users, surely?  Just some community represenatives for
> > each
> > > > new feature in each supported language.
> > >
> > > As the complexity of the IT world grows, we don't want to add extra layers
> > on
> > > top of what is already a significant skillset (XML).  Nor a layer of
> > unnecessary
> > > tools with all the extra maintenance, learning time and debugging required.
> >  If
> > > a new astronomer wants to do things with VOTable, s/he will have to go away
> > and
> > > find out what tools are available and where they've been put, then learn
> > those
> > > on top of the normal XML skills.  And of course, in many cases the
> > astronomer
> > > won't know about all the things that exist, and will end up writing some of
> > them
> > > again (or will anyway!).  And if the learning is significantly painful,
> > will
> > > ignore our tools altogether.
> >
> > The average astronomer wouldn't recognize XML if they were treading in it.
> > Likewise, the A.A. has no training or experience in XML tools from the IT
> > industry.  We have to make things easy by providing good UIs that are suited
> > to astronomy.  This _always_ means an extra interface layer on top of the
> > engineering infrastructure.  VOTable at least is based on astronomical norms
> > and concepts; its UI layer is likely to be easier than that for a tool that
> > has nothing to do with astronomy.
>
> Very true - but then A.A. is still working in FORTAN and C.  This is changing -
> and I'm not expecting any astronomer to make the leap of faith to using our UIs
> blindly without looking at what they're getting.  Particularly as the early days
> are likely to be flaky.
>
> But we should be able to build tools for astronomers using ordinary XML
> libraries and techniques.  And there will always be toolbuilding astronomers -
> like those who built Aladin.  We need to encourage them, not make the VO
> toolkits a specialised area suitable only for VO Gurus.
>
>
> > > >>3+) We need to argue over the schemas for a very important reason -
> > because we
> > > >>are arguing over how we share our information. VOTable does *not* do
> > this!  It's
> > > >>a cop out that lets us pass information around, but without having to
> > agree
> > > >>what's in it.  This is sometimes presented as a good thing, but actually
> > it's
> > > >>not - it just means we have deferred the problem, and we can continue to
> > defer
> > > >>it while we pat ourselves on the back for having produced something.
> > For
> > > >>example, how do we use it to transport spectra?  Aha, we need to discuss
> > this
> > > >>and agree it.  How do we use it to describe datacenter metadata?  Aha, we
> > need
> > > >>to discuss this and agree it.  So in fact we still have the original
> > problem,
> > > >>and have solved nothing.  Indeed, we've made it worse, because now we
> > have no
> > > >>way of checking our agreements.  Agreeing and publishing a schema means
> > everyone
> > > >>everywhere has something to develop against, and something to validate
> > against
> > > >>both as they publish data and receive it.  You can be sure you're not
> > getting a
> > > >>spectra when you expect a catalogue, etc.
> > > >
> > > >
> > > > Yes, but... with a generic format, we can represent data now, we just
> > can't
> > > > parse them as crisply as we'd like.  With the schema-based formats, we
> > can
> > > > only record quantities for which we've already agreed the data model. We
> > > > can't even use the ones not yet covered in the data model. It could be
> > _years_
> > > > before the data-modelling effort covers all the quantities.  What do we
> > do in
> > > > the meantime?
> > > >
> > > > We can use VOTable now and add utype attributes to it as fragments of
> > data
> > > > model become available.  This seems to me to be a necessary feature.
> > >
> > > It's not just the parsing - it's the validating and knowing how to
> > interpret a
> > > particular information set (spectra vs catalogue). Using VOTable, we're
> > not
> > > really representing the data - just the structure that astronomers are used
> > to
> > > seeing it in.
> >
> > YES!!! That's the whole point!  By enabling astronomers to use a generic
> > table-structure we let them say things that the data modellers haven't
> > thought
> > of.
> >
> > You can imaging the scenario.  Dr. Clever goes to her local VO contact and
> > says "how do I record data about this new idea I've just invented in the
> > VO?"
> > Mr. VO looks at it and says "Cool.  Give us six months to draw up the data
> > model for the new bits and then you can publish your results.  Don't try to
> > publish it this week using that VOTable rubbish (like your competitors will)
> > coz it's not semantically pure and my favourite IT-industry tool won't be
> > able
> > to read it."  We'd get lynched!
>
> Of course we would.  And rightly so!  But saying that we need to *always*
> describe data vaguely because there will be occasions where we want to be
> flexible is not a Good Solution.  Saying that we should *never* be able to use
> standard tools because there will be (a few) occasions where it might not be the
> best solution is also not a Good Solution.  Saying that people are *always*
> going to have to use VO-built tools when there may be more sophisticated
> standard ones is also not a Good Solution.

BUT: using VOTable doesn't stop you form using XML tools.  Using XML tools
exclusively stop you writing down data (a) until the data models are workable
and (b) in the cases where the data sets need to be binary files or streams.

> There's no reason why someone publishing extra data can't publish it under an
> extended schema.  Or why we can't build in such an extension mechanism into our
> schemas.
>
> > > So we need a different way of agreeing data models. We don't have to have
> > an
> > > absolute agreement - we just need a version 0.1.  Indeed we don't need an
> > agreed
> > > data model before we have semi-agreed XML snippets (for those not in the
> > > dm at ivoa.net mailing list, I've been whittering on there too on this).  In
> > the
> > > same way as we are agreeing ADQL, or VOResource, etc.
> >
> > Sounds deadly.  Is my structure using VOPosition v35.213.345 and
> > VOPhotometry
> > v4000000018 and VOMorph vvSNAPSHOT going to interoperate with your
> > VOPosition
> > v48.12 VOSpectropolarimetry v2004-12-18 and VOGalShapes v "latest"?
> > Oink-flap.
>
> This is a problem we are going to have to face anyway from the data modelling
> group.  We already face it now in ADQL. It gets worse because these snippets are
> likely to be common across schemas such as ADQL and Registry entries as well.
>
> I'd be interested in any ideas on this! :-)

There is one data-modelling group.  That's a lot more manageable than a
free-for-all.  Besides, we don't have to use the data-model refernces in
generic structures until they are stable and tested; they add value to generic
structures, they aren't essential for the structure to exist.

> > Besides, data models change to different versions because the previous
> > versions are wrong or ambiguous.
>
> Er yes.
>
> > If we have loads of v0.1 schemata around,
> > even if there's only one per subject area, then the chance of
> > misinterpretation is high.  That defeats all the validation by the schemata
> > because a schema can't show when wrong numbers have been put in.  Not in the
> > general case anyway.  The most deadly mistake for the VO is data that look
> > right numerically but mean the wrong thing.  E.g. semi-major-axis length
> > used
> > when full-major-axis length is needed.
> > major
>
> I don't get this.  Why should misinterpretation occur?  How is an element
> <Semi-major-axis> going to be confused with <full-major-axis>?  That's the sort
> of thing that occurs in VOTables when a column has been miscounted...  Similarly
> VOTables are exactly the place where silly mistakes can be made (forgot to
> delete a <FIELD> println when deleting the <TD> println?) and go undetected
> until some poor astronomer has gone through debugging why their results are
> screwy, because the magnitude values for one filter have been miswritten into
> another.

If you call elements semi-major-axis and full-major-axis, then no, you won't
confuse them.  But you have to take this level of care in description with all
the elements, so mistakes are possible.

>
> >
> > > I would be extremely
> > > reluctant to try and shoehorn a query language into VOTable just because it
> > is
> > > there.  Similarly an image.  Similarly a registry resource.  So why are we
> > > trying to do it with everything else?
> >
> > We're not.
> >
> > > VOTable only works when we as developers *know* at both output & input what
> > we
> > > are dealing with.
> >
> > That's incomplete.  It _also_ works when the data, decorated with contextual
> > metadata of the kind encoded in a VOTable FIELD, can be shown to a human who
> > makes sensible decisions.  VOTable does that.
> >
> > > And even that only works *right now* because we're only
> > > dealing with stellar catalogues and occasionally spectra. As soon as we
> > extend
> > > our dataset types the problems will be worse for not agreeing (at least a
> > bit),
> > > not better.
> > >
> > > Right now we are only really passing around stellar catalogues and spectra
> > using
> > > VOTable (is this right?) So it's early enough to swap in early schemas for
> > those
> > > as a required-output-option from datacenters (we can still preserve VOTable
> > also
> > > as a required-output-option).
> >
> > Well it's stellar catalogues and galaxy catalogues and quasar catalogues and
> > observation logs and SEDs and (IIRC) colour-term tables. Plus everything
> > else
> > that's in Vizier, which outputs in VOTable.  Vizier more or less defines the
> > UCD1 set, so I guess that there are ~1,500 distinct quanities that can
> > appear
> > in a VOTable, just from one data source.  Any idea how soon we can get that
> > lot to come out in a schema-controlled structure?
>
> I really don't see a problem with this. I take it each column has an associated
> UCD?  I would expect there to be a lot less element types than UCDs.  A
> <Brightness> element would be used for all the PHOT_MAGs.
>
> >
> > Think of what you could do with a good spreadsheet engine that has
> > astronomical formulae available to it. The spreadsheet doesn't have to
> > understand the relationships in the data; the understanding is supplied by
> > the
> > user, either interactively or via simple scripts crafted to suit particular
> > data sets. Generic, tabular data-structures are designed to feed that kind
> > of
> > tool.
> >
> > Hierarchical structures based on schema solve the separate problem of
> > machines
> > understanding the ontology of the data.  By all means go and solve that
> > problem; but it's wrong to say that because we solve that problem we mustn't
> > support generic tables.
>
> Agreed - See comment at top!
>
> >
> > Guy Rixon 				        gtr at ast.cam.ac.uk
> > Institute of Astronomy   	                Tel: +44-1223-337542
> > Madingley Road, Cambridge, UK, CB3 0HA		Fax: +44-1223-337523
> >
>
>
> --
> Martin Hill
> 07901 55 24 66
> www.mchill.net
>

Guy Rixon 				        gtr at ast.cam.ac.uk
Institute of Astronomy   	                Tel: +44-1223-337542
Madingley Road, Cambridge, UK, CB3 0HA		Fax: +44-1223-337523



More information about the votable mailing list