Why not use Amazon S3?

Matthew Graham mjg at cacr.caltech.edu
Mon Nov 20 16:44:32 PST 2006


Hi,

I've been working with Tamas Budavari at JHU on building an Amazon S3 
interface on top of a db, the idea being to introduce this into CasJobs. 
A VOSpace interface will be added later but can reuse much of the 
infrastructure and code for the S3 interface. So why aren't we using S3 
in the VO instead of VOSpace; well, having being through S3, we now have 
a much better idea of what its problems are:

- Security:
S3 uses private/public keys to sign messages (note: this is not 
WS-Security). When you register with S3, it generates a private-public 
key pair for you which you can then download. Any SOAP method call then 
requires three additional elements: AWSAccessKeyId - your public key 
(used as an identifier for you), TimeStamp - the current UTC time, and 
Signature - a HMAC-SHA1 digest of the concatenation of "AmazonS3" + 
operation name + TimeStamp using your private key. Since the S3 also has 
your private key, it can construct the Signature element it expects for 
a given message, compare it what it receives from you and authenicate 
your request (which must also travel over SSL to the server).
There are three problems with this approach: firstly there are two 
copies of your private key - one on your machine and one on the S3 
server. If the S3 server gets compromised then you are screwed. Secondly 
it does not scale with multiple S3 servers since each server will have 
its own private key for each user so to communicate with many servers, 
e.g. for federation purposes, then the client will have to manipulate 
multiple private keys to talk with all the different servers. Thirdly 
delegation is not secure as the server already has your private key, it 
could initiate a data transfer as you without you knowing about it.

- Containers:
S3 has one level of containers (called buckets) into which data objects 
can be grouped. Buckets cannot contain further buckets, however, and the 
bucket name must be unique since the bucket namespace is global so there 
can only be one bucket called "MyData". Metadata can also not be 
associated with buckets, only individual data objects.

- Pseudo-hierarchy:
Listing in S3 supports a pseudo-hierarchical view of data: if you 
construct the identifier for data objects in a hierarchical fashion, 
e.g. using '/' to delimit levels, then you can view the levels when you 
list by specifying what the delimiter is and what the identifier for the 
level you want to view is, e.g. my delimiter is '/' and I want to list 
all objects that I have labelled with identifiers beginning 
'/mydata/galaxies/images/'. The problem is that none of the data 
transfer operation recognise this hierarchy (only the listing operation) 
so if I try and retrieve '/mydata/galaxies/images' as an implied 
collection since I could list it, I am going to get an error since the 
container does not actually exist - S3 only allows 1 level of hierarchy 
with buckets.

- Transfer protocols:
Data transfer is either inline in the SOAP message or as a DIME 
attachment. No other protocols are supported.

- Federation
There is no notion of multiple S3 servers or how to federate them to 
present a single consistent space.

- Extensibility and future functionality:
- There is some clear roadmap for how and when S3 is going to evolve. 
Future functionality is promised but with no milestones or timescales. 
The existing methods can also not be altered from the WSDL POV since 
that would break all the S3 clients, which is one of the attractions of 
looking at S3 in the first place.

In summary, although S3 is simple and has user tools, it is not scalable 
to multiple instances and has some strange organisational features which 
rely on interpretations not documented in the contract. I would, 
however, advocate S3 as a single instance use on large data collections.

    Cheers,

    Matthew




More information about the vospace mailing list