Data Storage on Amazon S3

The standard mechanism to store, retrieve, and share large quantities of data in AWS is the Amazon S3 object store. We recommend that it is used to keep all data which is required to outlive your workload clusters.

Features of Amazon S3

Amazon S3 is suitable for storing long-term data that can outlive workload clusters. Other features of Amazon S3 include:

Because data stored in Amazon S3 can be shared with external applications, it is an effective means to collect and publish data across applications.

Common Use Cases

Common use cases for Amazon S3 with workload clusters include:

Data can be collected for analysis even while no workload clusters are active. Similarly, Hadoop applications can publish data to Amazon S3 for access after the end of the cluster. When needed, you can copy data between HDFS and Amazon S3.

Core Concepts

Before working with Amazon S3, you should get familiar with the following core concepts:

Learn More

For general information about Amazon S3, refer to the Amazon S3 documentation.

Best Practices for S3 Security

Limitations of Amazon S3

Even though Hadoop's S3A client can make an S3 bucket appear to be a Hadoop-compatible filesystem, it is still an object store, and has some limitations. The key things to be aware of are:

For these reasons, while Amazon S3 can be used as the source and store for persistent data, it cannot be used as a direct replacement for a cluster-wide filesystem such as HDFS. This is important to know, as the fact that it is accessed with the same APIs can be misleading.

Directory Operations May Be Slow and Nonatomic

Directory rename and delete may be performed as a series of operations on the client.

Specifically, delete(path, recursive=true) may be implemented as "list the objects, and delete them singly or in batches", and rename(source, dest) may be implemented as "copy all the objects, and then delete them".

This may have the following consequences:

Because of these behaviors, committing work by renaming directories is neither efficient nor reliable.

Data is Not Written Until the OutputStream's close() Operation

Data written to an object store is often buffered to a local file or stored in memory, until one of the following conditions is met: the output stream's close() operation is invoked, or (where supported and enabled) there is enough data to create a partition in a multi-partitioned upload.

Calls to OutputStream.flush() are usually a no-op or are limited to flushing to any local buffer file:

An Object Store May Display Eventual Consistency

Object stores, such as Amazon S3, are often eventually consistent. Objects are replicated across servers for availability, but changes to a replica take time to propagate to the other replicas; the store is inconsistent during this process.

The situations when this may be visible include:

The most common problem is that directory listings are not immediately updated. Using HDFS as the store for intermediate work avoids this causing a problem on chained queries.