Data Storage on Amazon S3
The standard mechanism to store, retrieve, and share large quantities of data in AWS is the Amazon S3 object store. We recommend that it is used to keep all data which is required to outlive your workload clusters.
Features of Amazon S3
Amazon S3 is suitable for storing long-term data that can outlive workload clusters. Other features of Amazon S3 include:
- Object store model for storing, listing, and retrieving data.
- Support for objects up to 5TB, with many petabytes of data allowed in a single bucket.
- Readable and writeable from Apache Hadoop, Apache Spark, Apache Hive, and related applications.
- Readable and writeable from other applications.
- Offered at all the AWS regions where Hortonworks Data Cloud is available.
Because data stored in Amazon S3 can be shared with external applications, it is an effective means to collect and publish data across applications.
Common Use Cases
Common use cases for Amazon S3 with workload clusters include:
- Publishing generated data for use within an organization.
- Collecting data for later analysis inside a workload cluster.
- Storing data which is needed to outlast a workload cluster.
- Copying data to a different region for use in workload clusters on that region.
- Storing data in Amazon S3 as a step to backing up the data with Amazon Glacier.
Data can be collected for analysis even while no workload clusters are active. Similarly, Hadoop applications can publish data to Amazon S3 for access after the end of the cluster. When needed, you can copy data between HDFS and Amazon S3.
Before working with Amazon S3, you should get familiar with the following core concepts:
- Data is stored in S3 in buckets.
- Bucket are stored in different AWS regions.
- Buckets can restricted to different users or IAM roles.
- Data stored in an S3 bucket is billed based on the size of data and based on how long it is stored. In addition, you are billed when you transfer data between regions.
- Data downloaded from an S3 bucket located outside the region in which the bucket is hosted — that is, from an HDCloud workload cluster in a different region, or from anywhere else on the internet — is billed per MB.
- Data transfers between an S3 bucket and an HDCloud workload cluster running in the same region are free of download charges (except in the special case of buckets in which data is served on a user-pays basis).
- In a bucket, data is stored as "objects", known colloquially as "blobs".
- The Hadoop client to S3, called "S3A", makes the contents of a bucket appear like a filesystem, with directories, files in the directories, and operations on directories and files.
- As a result applications which can work with data stored in HDFS can also work with data stored in S3.
For general information about Amazon S3, refer to the Amazon S3 documentation.
Best Practices for S3 Security
- Have separate accounts for different users and, ideally, for different long-running applications in a cluster.
- Your AWS account administrator should issue transient AWS credentials to users/applications.
- When creating clusters on EC2, use IAM for authentication. This is automatically supported in S3A.
- When creating clusters outside of EC2, save the secrets in credential files on HDFS, configured to be accessible to the application but not to other users.
- Your AWS account administrator should restrict the access rights of users to only those buckets containing data to which they should have read and/or write access.
- Audit the access rights of buckets regularly. For example, from time to time, verify that you cannot read buckets without credentials.
- Have a plan to deal with lost secrets: a contact number for developers and users to call, plan for emergency changing of credentials, a checklist for other actions (killing all VMs and spot-instance reservations of that account, checking any audit logs, and so on).
- If audit logs are required, use S3 bucket logging. Note that the Log Format includes the User Agent header of HTTP requests, as well as the User. This can be customized through the option "fs.s3a.user.agent.prefix" — if different applications used different values there, the audit logs would be able to identify the specific applications.
- Developers should install git-secrets to avoid accidentally committing any secrets into git repositories.
Limitations of Amazon S3
Even though Hadoop's S3A client can make an S3 bucket appear to be a Hadoop-compatible filesystem, it is still an object store, and has some limitations. The key things to be aware of are:
- Operations on directories are potentially slow.
- Not all file operations are supported. In particular, some file operations needed by Apache HBase are not available — so HBase cannot be run on top of Amazon S3.
- Neither the per-file/per-directory permissions supported by HDFS nor its more-sophisticated ACL mechanism are supported.
- Bandwidth between your workload clusters and Amazon S3 is limited and can vary significantly depending on network and VM load.
For these reasons, while Amazon S3 can be used as the source and store for persistent data, it cannot be used as a direct replacement for a cluster-wide filesystem such as HDFS. This is important to know, as the fact that it is accessed with the same APIs can be misleading.
Directory Operations May Be Slow and Nonatomic
Directory rename and delete may be performed as a series of operations on the client.
delete(path, recursive=true) may be implemented as "list the objects, and delete them singly or in batches", and
rename(source, dest) may be implemented as "copy all the objects, and then delete them".
This may have the following consequences:
- The time to delete a directory depends on the number of files in the directory.
- Directory deletion may fail part way through, leaving a partially deleted directory.
- Rename operations may fail part way through, leaving the status of the filesystem "undefined".
- The time to rename files and directories increases with the amount of data to rename.
- Recursive directory listing can be very slow. This can slow down some parts of job submission and execution.
Because of these behaviors, committing work by renaming directories is neither efficient nor reliable.
Data is Not Written Until the OutputStream's
Data written to an object store is often buffered to a local file or stored in memory,
until one of the following conditions is met: the output stream's
close() operation is invoked, or (where supported and enabled) there is enough data to create a partition in a multi-partitioned upload.
OutputStream.flush() are usually a no-op or are limited to flushing to any local buffer file:
- Data is not visible in the object store until the entire output stream has been written.
- If the operation of writing the data does not complete, no data is saved to the object store. This includes transient network failures as well as failures of the process itself.
- There may not be an entry in the object store for the file (even a 0 byte one) until the write is complete. Hence there is no indication that a file is being written.
- The time to close a file usually depends on the file size and the bandwidth.
An Object Store May Display Eventual Consistency
Object stores, such as Amazon S3, are often eventually consistent. Objects are replicated across servers for availability, but changes to a replica take time to propagate to the other replicas; the store is inconsistent during this process.
The situations when this may be visible include:
- When listing a directory: Newly created files may not yet be visible, deleted ones still present.
- After updating an object: Opening and reading the object may still return the previous data.
- After deleting an object: Opening the object may succeed, returning the data.
- While reading an object: If object is updated or deleted during the process.
The most common problem is that directory listings are not immediately updated. Using HDFS as the store for intermediate work avoids this causing a problem on chained queries.