Why not use Amazon S3?
mjg at cacr.caltech.edu
Mon Nov 20 19:50:20 PST 2006
I agree that there is much in favour of S3 and that we can build other
interfaces such as VOSpace on top but S3 does not meet the use cases we
have for S3. Once we start extending an existing interface such as S3
for our own needs, the question needs to be asked why not just define
our own one from the start.
Tamas Budavari wrote:
> Indeed, why not use S3, which BTW stands for Simple Storage Service!
> That is a nice summary, Matthew, and let me just add that most of these
> listed weaknesses of S3 may be considered its strengths, as well, and the
> rest (not REST!) is implementation details ;-) I recommend you read
> Matthew's message again, while chanting the following mantra:
> SIMPLE, WORKS, NOW :-)
> No, I am serious, read it again! People pay for Amazon S3... To me S3 is
> an interface definition that can be implemented to leverage the industry
> standard client tools (and all the documentation, user forums, etc out
> there) and there is nothing to stop us from extending its functionalities,
> having multiple instances of S3 servers, or even implementing other data
> access interfaces, such as VOSpace 1.x on top of the same data storage.
> Cheers, T.
> On Mon, 20 Nov 2006, Matthew Graham wrote:
>> I've been working with Tamas Budavari at JHU on building an Amazon S3
>> interface on top of a db, the idea being to introduce this into CasJobs.
>> A VOSpace interface will be added later but can reuse much of the
>> infrastructure and code for the S3 interface. So why aren't we using S3
>> in the VO instead of VOSpace; well, having being through S3, we now have
>> a much better idea of what its problems are:
>> - Security:
>> S3 uses private/public keys to sign messages (note: this is not
>> WS-Security). When you register with S3, it generates a private-public
>> key pair for you which you can then download. Any SOAP method call then
>> requires three additional elements: AWSAccessKeyId - your public key
>> (used as an identifier for you), TimeStamp - the current UTC time, and
>> Signature - a HMAC-SHA1 digest of the concatenation of "AmazonS3" +
>> operation name + TimeStamp using your private key. Since the S3 also has
>> your private key, it can construct the Signature element it expects for
>> a given message, compare it what it receives from you and authenicate
>> your request (which must also travel over SSL to the server).
>> There are three problems with this approach: firstly there are two
>> copies of your private key - one on your machine and one on the S3
>> server. If the S3 server gets compromised then you are screwed. Secondly
>> it does not scale with multiple S3 servers since each server will have
>> its own private key for each user so to communicate with many servers,
>> e.g. for federation purposes, then the client will have to manipulate
>> multiple private keys to talk with all the different servers. Thirdly
>> delegation is not secure as the server already has your private key, it
>> could initiate a data transfer as you without you knowing about it.
>> - Containers:
>> S3 has one level of containers (called buckets) into which data objects
>> can be grouped. Buckets cannot contain further buckets, however, and the
>> bucket name must be unique since the bucket namespace is global so there
>> can only be one bucket called "MyData". Metadata can also not be
>> associated with buckets, only individual data objects.
>> - Pseudo-hierarchy:
>> Listing in S3 supports a pseudo-hierarchical view of data: if you
>> construct the identifier for data objects in a hierarchical fashion,
>> e.g. using '/' to delimit levels, then you can view the levels when you
>> list by specifying what the delimiter is and what the identifier for the
>> level you want to view is, e.g. my delimiter is '/' and I want to list
>> all objects that I have labelled with identifiers beginning
>> '/mydata/galaxies/images/'. The problem is that none of the data
>> transfer operation recognise this hierarchy (only the listing operation)
>> so if I try and retrieve '/mydata/galaxies/images' as an implied
>> collection since I could list it, I am going to get an error since the
>> container does not actually exist - S3 only allows 1 level of hierarchy
>> with buckets.
>> - Transfer protocols:
>> Data transfer is either inline in the SOAP message or as a DIME
>> attachment. No other protocols are supported.
>> - Federation
>> There is no notion of multiple S3 servers or how to federate them to
>> present a single consistent space.
>> - Extensibility and future functionality:
>> - There is some clear roadmap for how and when S3 is going to evolve.
>> Future functionality is promised but with no milestones or timescales.
>> The existing methods can also not be altered from the WSDL POV since
>> that would break all the S3 clients, which is one of the attractions of
>> looking at S3 in the first place.
>> In summary, although S3 is simple and has user tools, it is not scalable
>> to multiple instances and has some strange organisational features which
>> rely on interpretations not documented in the contract. I would,
>> however, advocate S3 as a single instance use on large data collections.
More information about the vospace