TAP and large resultsets
Kona Andrews
kea at roe.ac.uk
Tue Jan 9 07:43:42 PST 2007
Greetings colleagues,
Happy New Year and I hope you have had a restful and productive holiday.
Mine was productive of two colds and a mild case of flu, which just goes
to show what a bad idea it is ever to stop working ;-)
Prior to our telecon next week, I wanted to raise a point about the
TAP protocol, in particular about having a paged interface for large
queries (so the user can bring back the query results in small ordered
chunks), as we briefly discussed last time.
First, some background.
In AstroGrid, we have a deployment-oriented remit whereby part of our
goal is to get our software deployed in "third-party" institutions (i.e.
by people outside our own team/locations), and ideally in *all* UK
institutions. Two things that deployers emphasise as critical to
them are:
1. Components should have a low installation/maintenance cost in
human time (and I acknowledge we still have much work to do here!)
2. Components should have a low resource requirement
(In other words, "we'll deploy it as long as we don't have to do very
much and it doesn't require any additional hardware; we have no time
and no money." Etc. Fair enough.)
In the case of the Astrogrid DataSet Access (DSA) component, the
architecture was very carefully designed to be fully streaming (partly
to reduce resource requirement, and partly to ensure an architecture that
scaled to the very large queries envisaged as a normal event in the VO).
In other words, in the course of processing a query, the query results
never need to be cached in memory or on disk. This means, for
example, that a DSA running in a tomcat with (e.g.) 64Mb of memory
and no additional "scratch disk" resources can successfully return
multi-*gigabyte* query results files to VoSpace, if requested.
This fully-streamed approach has additional benefits, in that the
component is not vulnerable to the filling-up of disk caches and there
is no disk-maintenance load (flushing old files, managing quotas etc).
However, the streamed approach has implications for offering results
paging as a part of TAP - namely that, since the results are not cached
anywhere, each time a page is requested in a TAP query, the full query
must be (re-)run and only the relevent subset of results returned to the
user.
While inefficient, this is obviously not impossible to implement, and
we can certainly implement paging as part of our TAP support. However,
I am strongly opposed to making the paged interface *compulsory*.
Our observation with "real deployers" of AG software has been that, if
an AstroGrid component starts to hammer too heavily/obviously on an
institution's resources, then the institution responds by wanting to
disable it (perhaps I should have added a point 3 above: "Give us
any trouble and you're outta here..."). For example, some AstroGrid
deployers have specifically disabled the conesearch interface on their
DSAs until conesearch efficiency improvements are in place [mea culpa]).
If paged support in TAP is *optional*, then we can provide a mechanism to
selectively disable it. Then, if an institution finds that paged
querying is clogging up the database because of the repetition of
intensive queries, they can switch the *paging function* of TAP
off (or limit/throttle it in some way), but still support e.g. simpler
unpaged queries. However, if paging is compulsory in TAP, then they may
just switch the whole TAP interface off - or maybe the whole component -
to the greater detriment of the users who then can't run queries at all.
I realise that it may seem that I'm driving the interface protocol spec
based on a particular implementation (our streamed DSA component).
However, I do honestly believe that a streamed architecture for querying
is the only sensible choice for scalability (handling arbitrarily large
results and arbitrarily large numbers of simultaneous queries); anything
based on disk caching is always going to hit the limits of the available
disk cache at some point - sooner rather than later if deployers are
stingy with resources.
All the best,
Kona
More information about the voql-teg
mailing list