Why not use Amazon S3?
mjg at cacr.caltech.edu
Mon Nov 20 16:44:32 PST 2006
I've been working with Tamas Budavari at JHU on building an Amazon S3
interface on top of a db, the idea being to introduce this into CasJobs.
A VOSpace interface will be added later but can reuse much of the
infrastructure and code for the S3 interface. So why aren't we using S3
in the VO instead of VOSpace; well, having being through S3, we now have
a much better idea of what its problems are:
S3 uses private/public keys to sign messages (note: this is not
WS-Security). When you register with S3, it generates a private-public
key pair for you which you can then download. Any SOAP method call then
requires three additional elements: AWSAccessKeyId - your public key
(used as an identifier for you), TimeStamp - the current UTC time, and
Signature - a HMAC-SHA1 digest of the concatenation of "AmazonS3" +
operation name + TimeStamp using your private key. Since the S3 also has
your private key, it can construct the Signature element it expects for
a given message, compare it what it receives from you and authenicate
your request (which must also travel over SSL to the server).
There are three problems with this approach: firstly there are two
copies of your private key - one on your machine and one on the S3
server. If the S3 server gets compromised then you are screwed. Secondly
it does not scale with multiple S3 servers since each server will have
its own private key for each user so to communicate with many servers,
e.g. for federation purposes, then the client will have to manipulate
multiple private keys to talk with all the different servers. Thirdly
delegation is not secure as the server already has your private key, it
could initiate a data transfer as you without you knowing about it.
S3 has one level of containers (called buckets) into which data objects
can be grouped. Buckets cannot contain further buckets, however, and the
bucket name must be unique since the bucket namespace is global so there
can only be one bucket called "MyData". Metadata can also not be
associated with buckets, only individual data objects.
Listing in S3 supports a pseudo-hierarchical view of data: if you
construct the identifier for data objects in a hierarchical fashion,
e.g. using '/' to delimit levels, then you can view the levels when you
list by specifying what the delimiter is and what the identifier for the
level you want to view is, e.g. my delimiter is '/' and I want to list
all objects that I have labelled with identifiers beginning
'/mydata/galaxies/images/'. The problem is that none of the data
transfer operation recognise this hierarchy (only the listing operation)
so if I try and retrieve '/mydata/galaxies/images' as an implied
collection since I could list it, I am going to get an error since the
container does not actually exist - S3 only allows 1 level of hierarchy
- Transfer protocols:
Data transfer is either inline in the SOAP message or as a DIME
attachment. No other protocols are supported.
There is no notion of multiple S3 servers or how to federate them to
present a single consistent space.
- Extensibility and future functionality:
- There is some clear roadmap for how and when S3 is going to evolve.
Future functionality is promised but with no milestones or timescales.
The existing methods can also not be altered from the WSDL POV since
that would break all the S3 clients, which is one of the attractions of
looking at S3 in the first place.
In summary, although S3 is simple and has user tools, it is not scalable
to multiple instances and has some strange organisational features which
rely on interpretations not documented in the contract. I would,
however, advocate S3 as a single instance use on large data collections.
More information about the vospace