# implementing REGION

Patrick Dowler patrick.dowler at nrc-cnrc.gc.ca
Thu Nov 23 12:40:25 PST 2006

```Here is a brief description of how we implement geometry querying (REGION). We
are using a clustered DB2 database for this (8 nodes, 16 partitions at the
moment).

Spatial querying is implemented using a feature of DB2 called Multi
Dimensional Clustering (MDC), which is essentially a multi-dimensional
clustered index. A simple example:

table positions (
-- the real columns:
ra double precision not null,
dec double precision not null,
-- in cartesian coordinates:
x double precision not null,
y double precision not null,
z double precision not null,

-- generated columns: v+1 in [0,2]
-- so iv =[0,2]*5 in [0,10] for ~1000 bins
ix generated always as (smallint( ( x + 1 ) * 5 )),
iy generated always as (smallint( ( y + 1 ) * 5 )),
iz generated always as (smallint( ( z + 1 ) * 5 ))
)
organize by (ix, iy, iz)

Then we have to transform something like REGION(a, b, r, ICRS) via:

* transform a,b to tx,ty,tz (using same code/conventions that populated the
table), and then put this into the SQL for fast, indexed bounding cube
search:

WHERE x between tx-r and tx+r
AND y between ty-r and ty+r
AND z between tz-r and tz+r

DB2 knows that the binning columns are generated from x,y,z so it can use the
binning columns to optimise the search. For points we can make refine the SQL
to be a sphere intersecting the unit sphere (equivalent to a circle on the
sky) by sticking some extra conditons into the SQL, such as:

AND ((tx-x)*(tx-x) + (ty-y)*(ty-y)+ (tz-z)*(tz-z)) < r*r

or we can do some post-filtering in software to throw away some edge cases
that get past the first cube conditions. In practice this is what we do for
observations because instead of ra,dec we have a polygon, instead of x,y,z we
have a bounding cube, and instead of the simple distance computation we have
to do full geometric intersection, which I didn't want to put in the DB.

So, our approach is a combination of replacing REGION(...) with some other
simple SQL conditions and some post-filtering of the results to remove
(literally) edge cases.

As a point of interest, this approach is taking advantage of the parallel
query processing in our clustered DB2 system and the fact that using MDC
(scan bins instead of whole table) makes the query performance dependent more
on bandwidth (disk/memory/cpu throughout) rather than latency (disk seek
time), which just doesn't scale all that well.... a high performance but
totally brute force approach with minimal software complexity :-)

We also use MDC to "index" spectral and temporal intervals and the related
intersection kind of conditions.

--

Patrick Dowler
Tel/Tél: (250) 363-6914                  | fax/télécopieur: (250) 363-0045