Authenticating with Amazon S3
In HDCloud for AWS, Amazon S3 authentication setup is automated at the time of cluster creation where by default the "Instance Role" option creates a new AWS role to grant role-based access to Amazon S3. This allows you to access S3 buckets that are part of the AWS account where HDCloud is running. Note that this option does not give you access to buckets that are not part of your AWS account.
Furthermore, no credentials are required for reading public S3 buckets. Hadoop will attempt to read these automatically using any configured credentials. It can be configured to explicitly request anonymous access, in which case no credentials need be supplied.
With the exception of the two cases described above, to interact with Amazon Web Services, applications need the access key (which is effectively an ID) and the secret key (which is effectively a password). These can belong to an individual or be issued by an organization with group accounts.
Temporary security credentials, also known as "session credentials", can be issued. These consist of a secret key with a limited lifespan, along with a session token, another secret which must be known and used alongside the access key. The secret key is never passed to AWS services directly. Instead it is used to sign the URL and headers of the HTTP request.
By default, the S3A filesystem client follows the following authentication chain:
If login details were provided in the filesystem URI, a warning is printed and then the username and password are extracted for the AWS key and secret respectively.
fs.s3a.secret.keyare looked for in the Hadoop XML configuration.
The AWS environment variables are then looked for.
An attempt is made to query the Amazon EC2 Instance Metadata Service to retrieve credentials published to EC2 VMs.
For Apache Hadoop applications to be able to interact with Amazon S3, they must know the access key and the secret key. This can be achieved in three different ways: through configuration properties, environment variables, or instance metadata.
The following authentication configuration allows you to access all the buckets to which a single account has access. If you are trying to configure multiple buckets belonging to different accounts, refer to Supporting Different Authentication Credentials in Different Buckets.
Using Instance Metadata to Authenticate with AWS Services
Virtual machines deployed in EC2 environments can query the EC2 Instance Metadata Service for an AWS key, secret, and a session token. This includes HDCloud Virtual Machines deployed within AWS.
The credentials are provided, they are used for session authentication. If only the key and secret are present, then the authentication will be done with the non-session based secrets.
This querying of the EC2 "IAM" service for credentials is attempted automatically, after the other attempts at authenticating (URL, configuration and environment variables) fail.
Using Configuration Properties to Authenticate
When attempting to use s3a URLS outside HDCloud, credentials will need to be provided by some other means.
This can be done by explicitly declaring the credentials in a configuration file such as
<property> <name>fs.s3a.access.key</name> <value>ACCESS-KEY</value> </property> <property> <name>fs.s3a.secret.key</name> <value>SECRET-KEY</value> </property>
This is the de-facto standard authentication mechanism in Hadoop applications which are not running in Amazon EC2 VMs.
If using AWS Session credentials for authentication, the secret key must be
that of the session, and the
fs.s3a.session.token option set to the session token.
<property> <name>fs.s3a.session.token</name> <value>Short-lived-session-token</value> </property>
Using Environment Variables to Authenticate
AWS CLI supports authentication through environment variables. These same environment variables will be used by Hadoop if no configuration properties are set.
||Session token (if using session authentication)|
Supporting Different Authentication Credentials in Different Buckets
S3A supports per-bucket configuration; this can be used to declare different authentication credentials and mechanisms for different buckets.
For example, a bucket
s3a://nightly/ used for nightly data can be configured
with uses a session key:
<property> <name>fs.s3a.bucket.nightly.access.key</name> <value>AKAACCESSKEY-2</value> </property> <property> <name>fs.s3a.bucket.nightly.secret.key</name> <value>SESSIONSECRETKEY</value> </property> <property> <name>fs.s3a.bucket.nightly.session.token</name> <value>Short-lived-session-token</value> </property>
This technique is useful for working with external sources of data, or when copying data between buckets belonging to different accounts.
For more details on per-bucket configuration, refer to Configuring S3A.
Embedding Credentials in the URL
Embedding credentials in the URL is dangerous and deprecated. Use per-bucket configuration options instead.
Hadoop supports embedding credentials within the S3 URL:
In general, we strongly discourage using this mechanism, as it invariably results in the secret credentials being logged in many places in the cluster. However, embedding credentials in the URL is sometimes useful when troubleshooting authentication problems: consult the S3 troubleshooting documentation for details.
Due to the security risk it represents, future versions of Hadoop may remove this feature entirely.
Before S3A supported per-bucket credentials, this was the sole mechanism for supporting different credentials for different buckets. Now that buckets can be individually configured, this mechanism should no longer be needed.
Defining Authentication Providers
S3A can be configured to obtain client authentication providers from classes
which integrate with the AWS SDK by implementing the
This is done by listing the implementation classes
in the configuration option
AWS credential providers are distinct from Hadoop credential providers. Hadoop credential providers allow passwords and other secrets to be stored and transferred more securely than in XML configuration files. AWS credential providers are classes which can be used by the Amazon AWS SDK to obtain an AWS login from a different source in the system, including environment variables, JVM properties, and configuration files.
There are a number of AWS credential provider classes specified in the
||Standard credential support through configuration properties|
There are also many AWS credential provider classes specified in the Amazon JARs. In particular, there are two which are commonly used:
||AWS Environment Variables|
||EC2 Metadata Credentials|
The order of listing credential providers in the configuration option
the order of evaluation of credential providers. The standard authentication mechanism for Hadoop S3A authentication is essentially
the following list of providers:
<property> <name>fs.s3a.aws.credentials.provider</name> <value> org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider, com.amazonaws.auth.EnvironmentVariableCredentialsProvider, com.amazonaws.auth.InstanceProfileCredentialsProvider</value> </property>
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProviderdoes not support in-URL authentication.
- Retrieving credentials with the
InstanceProfileCredentialsProvideris a slower operation than looking up configuration operations or environment variables. It is best listed after all other authentication providers —excluding the
AnonymousAWSCredentialsProvider, which must come last.
Using Temporary Session Credentials
Temporary Security Credentials can be obtained from the AWS Security Token Service. These credentials consist of an access key, a secret key, and a session token.
To authenticate with these credentials:
org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvideras the provider.
- Set the session key in the property
fs.s3a.session.token, and set the access and secret key properties to those of this temporary session.
<property> <name>fs.s3a.aws.credentials.provider</name> <value>org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider</value> </property> <property> <name>fs.s3a.access.key</name> <value>SESSION-ACCESS-KEY</value> </property> <property> <name>fs.s3a.secret.key</name> <value>SESSION-SECRET-KEY</value> </property> <property> <name>fs.s3a.session.token</name> <value>SECRET-SESSION-TOKEN</value> </property>
The lifetime of session credentials is determined when the credentials are issued; once they expire the application will no longer be able to authenticate to AWS.
Using Anonymous Login
anonymous access to a publicly accessible Amazon S3 bucket without any credentials.
It can be useful for accessing public data sets without requiring AWS credentials.
<property> <name>fs.s3a.aws.credentials.provider</name> <value>org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider</value> </property>
Once this is done, there's no need to supply any credentials in the Hadoop configuration or via environment variables.
This option can be used to verify that an object store does not permit unauthenticated access; that is, if an attempt to list a bucket is made using the anonymous credentials, it should fail — unless explicitly opened up for broader access.
hadoop fs -ls \ -D fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider \ s3a://landsat-pds/
S3A may be configured to always access specific buckets
anonymously, such as this public
<property> <name>fs.s3a.bucket.landsat-pds.aws.credentials.provider</name> <value>org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider</value> </property>
Allowing anonymous access to an Amazon S3 bucket compromises security and therefore is unsuitable for most use cases.
If a list of credential providers is given in
fs.s3a.aws.credentials.provider, then the anonymous credential provider must come last. If not, credential providers listed after it will be ignored.
Keeping Your Amazon S3 Credentials Secret
The Hadoop credential provider framework allows secure credential providers to keep secrets outside Hadoop configuration files, storing them in encrypted files in local or Hadoop filesystems, and including them in requests.
The S3A configuration options with sensitive data
have their data saved to a binary file, with the values being read in
when the S3A filesystem URL is used for data access. The reference to this
credential provider is all that is passed as a direct configuration option.
For additional reading on the Hadoop credential provider API, refer to Credential Provider API.
Create a Credential File
You can create a credential file on any Hadoop filesystem. When you create one on HDFS or a UNIX filesystem, the permissions are automatically set to keep the file private to the reader — though as directory permissions are not touched, you should verify that the directory containing the file is readable only by the current user. For example:
hadoop credential create fs.s3a.access.key -value 123 \ -provider jceks://email@example.com:9001/user/backup/s3.jceks hadoop credential create fs.s3a.secret.key -value 456 \ -provider jceks://firstname.lastname@example.org:9001/user/backup/s3.jceks
After creating the credential file, you can list it to see what entries are kept inside it. For example:
hadoop credential list -provider jceks://email@example.com:9001/user/backup/s3.jceks Listing aliases for CredentialProvider: jceks://firstname.lastname@example.org:9001/user/backup/s3.jceks fs.s3a.secret.key fs.s3a.access.key
At this point, the credentials are ready for use.
Configure the Hadoop Security Credential Provider Path Property
The URL to the provider must be set in the configuration property
hadoop.security.credential.provider.path, either on the command line or
in XML configuration files:
<property> <name>hadoop.security.credential.provider.path</name> <value>jceks://email@example.com:9001/user/backup/s3.jceks</value> <description>Path to interrogate for protected credentials.</description> </property>
Because this property only supplies the path to the secrets file, the configuration option itself is no longer a sensitive item.
The path to the provider can also be set on the command line. For example:
hadoop distcp \ -D hadoop.security.credential.provider.path=jceks://firstname.lastname@example.org:9001/user/backup/s3.jceks \ hdfs://nn1.example.com:9001/user/backup/007020615 s3a://glacier1/ hadoop fs \ -D hadoop.security.credential.provider.path=jceks://email@example.com:9001/user/backup/s3.jceks \ -ls s3a://glacier1/
Because the provider path is not itself a sensitive secret, there is no risk from placing its declaration on the command line.
Once the provider is set in the Hadoop configuration, hadoop commands work exactly as if the secrets were in an XML file. For example:
hadoop distcp hdfs://nn1.example.com:9001/user/backup/007020615 s3a://glacier1/ hadoop fs -ls s3a://glacier1/
Customizing S3A Secrets Held in Credential Files
Although most properties are automatically propagated from their
fs.s3a.bucket.-prefixed custom entry to that of the base
supporting secrets kept in Hadoop credential files is slightly more complex.
This is because the property values are kept in these files, and cannot be
Instead, callers need to create different configuration files for each
bucket, setting the base secrets (
then declare the path to the appropriate credential file in
a bucket-specific version of the property
Refer to Troubleshooting Amazon S3 for information on troubleshooting authentication failures.
To make sure that you are following best Amazon S3 security practices, refer to Security Best Practices and Checklist.