head 1.1; access; symbols; locks http:1.1; strict; comment @# @; expand @b@; 1.1 date 2004.05.24.20.59.39; author ClivePage; state Exp; branches; next ; desc @none @ 1.1 log @Units - Clive Pages talk @ text @ࡱ> hT( n/ 0DArialNew (0(B 0 DTimes New Roman(0(B 0 G0.  @@n?" dd@@  @@@@`` `+.R !#$%&'()*+, 0AA@@3ʚ;ʚ;g4KdKd@@B 0ppp@@ <4ddddlpC 0 80___PPT10 pp?  %wOSimple ADQL Enhancements 9Clive Page, AstroGrid Project University of Leicester, UK9 SELECT INTO newtable "   SQL designed to specify the final result  which is by default sent to standard output as a table in text form. Astronomers typically want to achieve a result incrementally, checking output of each step. Need to be able to send results to a temporary table AstroGrid is working on MySpace  user s storage area within DBMS or on external file. JHU is working on MyDB  similar concept in DBMS. Exact syntax for <newtable> may have to be left as implementation-dependent until more experience is gained.655b  J  .  R  EXPLAIN query  Most DBMS support the EXPLAIN command, although not part of standard SQL. Very useful to inform user whether a given query will be Evaluated by sequential scan  very slow for large table Or use an index  fast even for a large table Results of EXPLAIN are entirely system-dependent Perhaps this is acceptable at present, we may find a way to present results in a more uniform way later.Lg1ig1i  TOP nrows "  4Already specified in ADQL. Supported by most DBMS: 4 TOP nrows - Semantics. .    LSyntax well defined, but semantics less so. When used with ORDER BY <column> allows you to get the largest/smallest <n> rows. But: with Sybase-ASE the use of SET ROWCOUNT with ORDER BY results in selecting the first <n> rows in the table and then sorting them  not usually what is required.:`.& Great Circle Distance Function |When performing a cross-match it is very often desirable to have the distance between the matching sources as a column in the output Trigonometry to do this is a little complicated Simplest expression has very poor accuracy with small angles, better to use  haversine formula. Propose an additional function to simplify queries: GCDIST(ra1, dec1, ra2, dec2) Returns distance in units to be agreed, e.g. degrees, with arguments also in degrees. d0a4t0a4t .     Physical Units in Queries    Where units are needed Consider simple query like this: SELECT * from t1, t2 WHERE REGION( whatever ) AND properMotion > 100 ; What does the  100 mean? Vizier collection has 231 tables containing proper motions:<!ZV'TVHB     ^  ,Units  Vizier example Vizier collection has 231 tables containing proper motions: These have 19 different units for it. IAU recommends using SI units where possible (with a few extensions for astronomy) SI unit of angular velocity is radians/second None of the 231 tables uses radians/second. d<-<)   Importance of supporting units The cone-search works only because the REGION keyword has the units of its parameters specified (degrees). Almost all other selections involving numerical quantities will be meaningless (or at the very least error-prone) without units information.$a   JPossible solution (1) Give astronomer information on units in each columnKKJ  OK for simple queries on one table Will not work in general for multi-table queries, as different tables will use different units. Would have to rely astronomers checking all necessary metadata before submitting a query.     xPossible solution (2) Use  standard units in all interfaces==< Data access layer of DBMS has to do the conversion to/from units used in each column. Problems: Will be hard to get agreement on standard units for hundreds of quantities Different branches of astronomy have different conventions, e.g. flux in mJy, Jy. SI units are approved by IAU, but rarely used SI units with large scaling factors make browsing harder, errors much more likely. <` `  h       =Possible solution (3) Allow users to specify units in queries>>= DExample SELECT & WHERE properMotion > 100 [mas/yr] Notation: Suggest CDS notation, based on earlier work at GSFC which is compact, flexible, unambiguous, easy to read. http://vizier.u-strasbg.fr/doc/catstd.htx Would be nice to avoid the brackets, e.g.  100mas/yr but this will be hard to parse unambiguously. Problem: how does the VO system know what units are attributed to each column? Answer: metadata queries.+ k+dR+ k*d R  p     z "  d 0(Requirements on system for unit-handling( Need to find the units of each column: a metadata query. Need to parse the units strings in a query and apply the appropriate conversion functions. A few complications: Do magnitudes have units, or as the logarithm of a ratio are they unit-less? Functions in queries need special treatment, e.g. COS returns a unit-less value, ACOS always returns radians. Some conversions cannot be exact, e.g. magnitudes to flux densities in Janskys. (Suggest: leave until later.)N=o=o.     Metadata Queries \VO opinion seems divided on whether metadata queries should be satisfied by the DBMS (via DAL?) or by the Registry. We need to invent an interface which can be implemented either way. One can regard the registry as a kind of cache of metadata. Absolutely minimal function: What is the units string of column C of table T of database D at site S? &LL\ PMetadata query  desirable functionality( Ask which databases are available at site S Ask which tables are in database D Ask which columns are in table T Ask for properties of a table: length, width, indexes available, astronomical metadata (epoch, equinox, ...), curational metadata (version, author, ...) Ask for properties of a column: name, data-type, units, nullability, UCD, short description, data range, ... Provided these functions are supported, it does not matter whether the information comes from the DBMS or from the Registry. ZH   Y    Cross-matching tVery important functionality  by combining datasets we often get new scientific results. In DBMS terms it needs a spatial join  the join criterion is the overlap of the error-regions. Error-region always small patch of sky, never just a point because of errors of measurement, extended objects, proper motions, etc. Shapes of error-regions vary: often elliptical, sometimes circular, occasionally more complex. Size depends on confidence level of the match  often expressed as say x% confidence, or y-sigma (the latter assumes some error distribution, e.g. Gaussian).<;Z>: Other cross-match complications ~Difference in epochs of catalogues means some objects will have moved  apply proper motions? Need epoch metadata. Different users will want different confidence levels and hence sizes of error regions Large region may produce too many false positives. May need to adjust confidence level in the light of experience. ``Ws`Ws? Current ADQL Syntax J & WHERE XMATCH(x, y, !z) > 3 AND & Some problems: XMATCH is not a quite a function: if a confidence level or N-sigma value is needed it should be one of the arguments. Better to express matching probability in X% confidence than as N-sigma, as former makes no assumptions about the functional form of the error distribution. No syntax for LEFT OUTER JOIN (return unmatched sources as well as matched ones).2Ae1t.   Specifying a join in SQL JTwo methods in the standards: SELECT * FROM t1, t2 WHERE t1.x = t2.y & SELECT * FROM t1 [LEFT OUTER] JOIN t2 ON t1.x = t2.y WHERE & The latter form is more verbose but allows a number of different types of join, the join criterion is explicit. Propose that we use latter, e.g. ON XMATCH(& )xO%N.O% N . %  Some cross-match implementations  BCross-match of two catalogues of ~N sources is an O(N) operation unless some form of indexing/sorting is used, which can reduce it to O(N log N) or better. Known algorithms Join using spatial index such as R-tree (Oracle, Sybase, MySQL, Postgres& ) or Grid-file (DB2) Join using pixel code and B-tree (more complex and slower, but feasible with just about any DBMS) Sort/sweep algorithm of Dave Abel and colleagues. Note: all these use bounding boxes drawn around the error ellipses (or whatever). Refinement stage weeds out the false matches. LZZZ5y  \5 y9    + Algorithm limitations With most algorithms, if the size of the error region changes it is necessary to generate new indices - slow. But: the hard part is done by the DBMS in reducing an O(N) problem to one of O(N log N) or better. Proposed solution: Always carry out the cross-match using largest error region that is scientifically justifiable (99.9% or 3) The user can then refine the crude selection using the relatively small table of results, rejecting sources too far apart for the actual error-regions and confidence level. In this case: can omit confidence level (or N-) value in the XMATCH function  apply only in refine stage. 8ZZ>  b +}  = +Selecting sources which have no counterpart+ Syntax is: XMATCH(x, !z) Standard RDBMS can do this using LEFT OUTER JOIN of x and z INNER JOIN of x and z Take difference between results of the last two. Or is there a simpler way? I think that it is at least as important to have a defined syntax for LEFT OUTER JOIN as knowing which sources have no counterpart is often scientifically important. Propose plus symbol, e.g. XMATCH(x+, y)N?ZbZZ?bH  g   -Cross-match can also find clusters of objects- Scientific examples: find clusters of stars, galaxies, objects affected by gravitational lensing. Method: cross-match catalogue with itself but with a much larger maximum offset than the error-regions. Problem: needs index generated using bounding boxes much larger than error-regions used for finding counterparts. May need additional function like XMATCH but with additional parameters e.g. for maximum offset.Z.Y  ?   ` ` ̙33` 333MMM` ff3333f` f` f` 3>?" dd@@ v?" dd@@  " @@ ` n?" dd@@   @@@@``PR    @@ ` ` p>> @@%(    6f " )<  T Click to edit Master title style! !$  0i "v  RClick to edit Master text styles Second level Third level Fourth level Fifth level!     S  0n "`w  _ 2003 May 24 0     08t "`   v*Clive Page 0    0y "`   b*0H  0޽h ? 80___PPT10.?@@ edi200304$  $(  r  S (,>  r  S  `    H  0޽h ? 3380___PPT10.?@@<[}$  P$(  r  S  )<   r  S v  H  0޽h ? 80___PPT10.?Ȝ$  0L$(  Lr L S 0: )<   r L S Jv  H L 0޽h ? 80___PPT10.?wO  *"@@-P(  Pr P S Q )<   r P S (RT   x Y  -P #"*>=>=>"+5 P <T\?L Y  2SET ROWCOUNT n; SELECT &     @@` P <\? L  l Sybase-ASE     @@` P <X۫?L Y  .SELECT & WHERE& LIMIT n"  @@`  P <$c? L  n PostgreSQL     @@`  P <\?LCY  6SELECT & WHERE& ROWNUM <= n"   @@`  P <?CL  hOracle  @@`  P <?LYC .SELECT & WHERE& LIMIT n"  @@`  P <,?LC iMySQL  @@` P <?LY SELECT TOP n & *  @@` P <?L l SQL Server     @@`4 P < ?LY PSELECT & WHERE & FETCH FIRST n ROWS ONLY")(  @@` P <?L eDB2  @@``B P 0o ?YZB P s *1 ?YZB P s *1 ?YZB P s *1 ?CYCZB P s *1 ? Y ZB P s *1 ? Y `B P 0o ? Y `B P 0o ? ZB P s *1 ?LL `B P 0o ?YY H P 0޽h ? 80___PPT10.?p$  PT$(  Tr T S x )<   r T S v  H T 0޽h ? 80___PPT10.?,$  `X$(  Xr X S  )<   r X S v  H X 0޽h ? 80___PPT10.?[$  p\$(  \r \ S P   r \ S v  H \ 0޽h ? 80___PPT10.?p$  `$(  `r ` S  )<   r ` S v  H ` 0޽h ? 80___PPT10.?yA- ,,?QdA,(  d , GG Qd #"^RGG *d <D? G g1  @@` )d < ?G  kUs/yr  @@` (d <? S G h20  @@` 'd < ?GS   js/yr  @@` &d <#? GS  h17  @@` %d <h-?G  S  js/ha  @@` $d <p/? G  h11  @@` #d <??G   is/a  @@` "d <$I?  G  h11  @@` !d <Q?G   kms/yr  @@`  d <[? W G  g2  @@` d <Xd?GW    jms/a  @@` d <n? GW  i103  @@` d <lw?G  W  zmas/yr"   @@` d <t? G  h10  @@` d <?G  ymas/a"   @@` d <? G g2  @@` d <蜝?G  ycs/yr"   @@` d <? [G h26  @@` d <?G[  } arcsec/yr  "   @@` d <? G[ g6  @@` d <?G [ |arcsec/a  "   @@` d <˝? G g3  @@` d <\՝?G  m10ms/yr  @@` d < ޝ? G g4  @@` d < ?G  n10mas/yr    @@` d <? _G g2  @@` d <?G_  m10mas/a  @@` d <? G_ g2  @@`  d <?G _ n10-6s/yr    @@`  d <|? G g6  @@`  d <l ?G  n10-5s/yr    @@`  d <)? "G g1  @@`  d <3?G"  s 10-4arcsec/yr   @@` d <;? cG" g1  @@` d <E?Gc " s 10-2arcsec/yr   @@` d <N? Gc g3  @@` d <X?G c n0.01s/yr    @@``B -d 0o ?GGZB .d s *1 ?GcGcZB /d s *1 ?G"G"ZB 0d s *1 ?GGZB 1d s *1 ?GGZB 2d s *1 ?G_G_ZB 3d s *1 ?GGZB 4d s *1 ?GGZB 5d s *1 ?GGZB 6d s *1 ?G[G[ZB 7d s *1 ?GGZB 8d s *1 ?GGZB 9d s *1 ?G G ZB :d s *1 ?GW GW ZB ;d s *1 ?G G ZB d s *1 ?GS GS ZB ?d s *1 ?GG`B Ad 0o ?GG`B Bd 0o ?GGZB Cd s *1 ?  `B Dd 0o ?GGH d 0޽h ? 80___PPT10.?%10  0(  x  c $4 )<   x  c $)v  H  0޽h ? 80___PPT10.?y$  h$(  hr h S T )<   r h S (v  H h 0޽h ? 80___PPT10.?jv$  l$(  lr l S T   r l S v  H l 0޽h ? 80___PPT10. ?C{0  p0(  px p c $ )<   x p c $䤞v  H p 0޽h ? 80___PPT10. ?C{}  t$(  tr t S  )<   r t S v  H t 0޽h ? ___PPT10i.$?p +D=' = @@B +$  @@$(  r  S  )<   r  S 0v  H  0޽h ? 80___PPT10.=?`}$  x$(  xr x S 8 )<   r x S  v  H x 0޽h ? 80___PPT10.Possible solution (3) Allow users to specify units in queries)Requirements on system for unit-handlingMetadata Queries)Metadata query desirable functionality  Fonts UsedDesign Template Slide Titles @@Times New Ro_ Page Page 2 5 2003 May 24 ."System-@@Times New Roman-. 2  Clive Page.-@@Times New Roman-. -2 Physical Units in Queries     .-n@@Times New Roman-. 2  Clive Page,   .-@@Times New Roman-. !2 2AstroGrid Projectc     .-@@Times New Roman-. 02 NUniversity of Leicester, UK        .-\՜.+,0x    On-screen ShowUniversity of Leicesterk  ArialTimes New Roman edi200304Physical Units in QueriesWhere units are neededSlide 3Units Vizier exampleImportance of supporting unitsKPossible solution (1) Give astronomer information on units in each column=Possible solution (2) Use standard units in all interfaces>Possible solution (3) Allow users to specify units in queries)Requirements on system for unit-handlingMetadata Queries)Metadata query desirable functionality  Fonts UsedDesign Template Slide Titles ueries)Metadata query desirable functionality Slide 18Cross-matching Other cross-match complicationsCurrent ADQL SyntaxSpecifying a join in SQL!Some cross-match implementationsAlgorithm limitations,Selecting sources which have no counterpart.Cross-match can also find clusters of objects  Fonts UsedDesign Template Slide Titles_i Page Page  !"#$%&'()*+,-./0123456789:;<=>?@@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghikjRoot EntrydO)P/ACurrent User!/SummaryInformation(hPowerPoint Document(DocumentSummaryInformation8w 0B 0DTimes New Roman0B 0G0.  @@n?" dd@@  @@@@`` `+c .R $* 0AA@@3ʚ;ʚ;g4KdKd B 0ppp@@ <4ddddLpC 0 80___PPT10 pp?  %] Physical Units in Queries    Where units are needed Consider simple query like this: SELECT * from t1, t2 WHERE REGION( whatever ) AND properMotion > 100 ; What does the  100 mean? Vizier collection has 231 tables containing proper motions:<!ZV'TVHB     ^  ,Units  Vizier example Vizier collection has 231 tables containing proper motions: These have 19 different units for it. IAU recommends using SI units where possible (with a few extensions for astronomy) SI unit of angular velocity is radians/second None of the 231 tables uses radians/second. d<-<)   Importance of supporting units The cone-search works only because the REGION keyword has the units of its parameters specified (degrees). Almost all other selections involving numerical quantities will be meaningless (or at the very least error-prone) without units information.$a   JPossible solution (1) Give astronomer information on units in each columnKKJ  OK for simple queries on one table Will not work in general for multi-table queries, as different tables will use different units. Would have to rely astronomers checking all necessary metadata before submitting a query.     xPossible solution (2) Use  standard units in all interfaces==< Data access layer of DBMS has to do the conversion to/from units used in each column. Problems: Will be hard to get agreement on standard units for hundreds of quantities Different branches of astronomy have different conventions, e.g. flux in mJy, Jy. SI units are approved by IAU, but rarely used SI units with large scaling factors make browsing harder, errors much more likely. <` `  h       =Possible solution (3) Allow users to specify units in queries>>= DExample SELECT & WHERE properMotion > 100 [mas/yr] Notation: Suggest CDS notation, based on earlier work at GSFC which is compact, flexible, unambiguous, easy to read. http://vizier.u-strasbg.fr/doc/catstd.htx Would be nice to avoid the brackets, e.g.  100mas/yr but this will be hard to parse unambiguously. Problem: how does the VO system know what units are attributed to each column? Answer: metadata queries.+ k+dR+ k*d R  p     z "  d 0(Requirements on system for unit-handling( Need to find the units of each column: a metadata query. Need to parse the units strings in a query and apply the appropriate conversion functions. A few complications: Do magnitudes have units, or as the logarithm of a ratio are they unit-less? Functions in queries need special treatment, e.g. COS returns a unit-less value, ACOS always returns radians. Some conversions cannot be exact, e.g. magnitudes to flux densities in Janskys. (Suggest: leave until later.)N=o=o.     Metadata Queries \VO opinion seems divided on whether metadata queries should be satisfied by the DBMS (via DAL?) or by the Registry. We need to invent an interface which can be implemented either way. One can regard the registry as a kind of cache of metadata. Absolutely minimal function: What is the units string of column C of table T of database D at site S? &LL\ PMetadata query  desirable functionality( Ask which databases are available at site S Ask which tables are in database D Ask which columns are in table T Ask for properties of a table: length, width, indexes available, astronomical metadata (epoch, equinox, ...), curational metadata (version, author, ...) Ask for properties of a column: name, data-type, units, nullability, UCD, short description, data range, ... Provided these functions are supported, it does not matter whether the information comes from the DBMS or from the Registry. ZH   Y    rei  Oh+'08 hp    Simple ADQL Enhancementsi Page A edi200304L  Page3010gMicrosoft PowerPointnts@@_i@@`G??@@0!(A=Gg  -  Y--$XX--'@@Times New Roman-. 2 5 2003 May 24 ."System-@@Times New Roman-. 2  Clive Page.-@@Times New Roman-. -2 Physical Units in Queries     .-n՜.+,0x    On-screen ShowUniversity of Leicesterk  ArialTimes New Roman edi200@