TAP and large resultsets

Kona Andrews kea at roe.ac.uk
Tue Jan 9 07:43:42 PST 2007


Greetings colleagues,

Happy New Year and I hope you have had a restful and productive holiday.
Mine was productive of two colds and a mild case of flu, which just goes
to show what a bad idea it is ever to stop working ;-)

Prior to our telecon next week, I wanted to raise a point about the 
TAP protocol, in particular about having a paged interface for large 
queries (so the user can bring back the query results in small ordered 
chunks), as we briefly discussed last time.

First, some background.

In AstroGrid, we have a deployment-oriented remit whereby part of our
goal is to get our software deployed in "third-party" institutions (i.e.
by people outside our own team/locations), and ideally in *all* UK 
institutions.  Two things that deployers emphasise as critical to
them are:

1. Components should have a low installation/maintenance cost in 
   human time (and I acknowledge we still have much work to do here!)

2. Components should have a low resource requirement

(In other words, "we'll deploy it as long as we don't have to do very
much and it doesn't require any additional hardware; we have no time
and no money."  Etc.  Fair enough.)

In the case of the Astrogrid DataSet Access (DSA) component, the 
architecture was very carefully designed to be fully streaming (partly
to reduce resource requirement, and partly to ensure an architecture that
scaled to the very large queries envisaged as a normal event in the VO).
In other words, in the course of processing a query, the query results 
never need to be cached in memory or on disk.  This means, for
example, that a DSA running in a tomcat with (e.g.) 64Mb of memory 
and no additional "scratch disk" resources can successfully return
multi-*gigabyte* query results files to VoSpace, if requested.  

This fully-streamed approach has additional benefits, in that the 
component is not vulnerable to the filling-up of disk caches and there
is no disk-maintenance load (flushing old files, managing quotas etc).
However, the streamed approach has implications for offering results
paging as a part of TAP - namely that, since the results are not cached
anywhere, each time a page is requested in a TAP query, the full query
must be (re-)run and only the relevent subset of results returned to the 
user.

While inefficient, this is obviously not impossible to implement, and 
we can certainly implement paging as part of our TAP support.  However, 
I am strongly opposed to making the paged interface *compulsory*.  

Our observation with "real deployers" of AG software has been that, if 
an AstroGrid component starts to hammer too heavily/obviously on an 
institution's resources, then the institution responds by wanting to 
disable it (perhaps I should have added a point 3 above: "Give us 
any trouble and you're outta here...").  For example, some AstroGrid
deployers have specifically disabled the conesearch interface on their 
DSAs until conesearch efficiency improvements are in place [mea culpa]).

If paged support in TAP is *optional*, then we can provide a mechanism to
selectively disable it.  Then, if an institution finds that paged 
querying is clogging up the database because of the repetition of 
intensive queries, they can switch the *paging function* of TAP
off (or limit/throttle it in some way), but still support e.g. simpler 
unpaged queries.  However, if paging is compulsory in TAP, then they may 
just switch the whole TAP interface off - or maybe the whole component - 
to the greater detriment of the users who then can't run queries at all.  

I realise that it may seem that I'm driving the interface protocol spec 
based on a particular implementation (our streamed DSA component).
However, I do honestly believe that a streamed architecture for querying 
is the only sensible choice for scalability (handling arbitrarily large 
results and arbitrarily large numbers of simultaneous queries); anything 
based on disk caching is always going to hit the limits of the available 
disk cache at some point - sooner rather than later if deployers are
stingy with resources.   

All the best,
Kona



More information about the voql-teg mailing list