Authenticating with Amazon S3

In HDCloud for AWS, Amazon S3 authentication setup is automated at the time of cluster creation where by default the "Instance Role" option creates a new AWS role to grant role-based access to Amazon S3. This allows you to access S3 buckets that are part of the AWS account where HDCloud is running. Note that this option does not give you access to buckets that are not part of your AWS account.

Furthermore, no credentials are required for reading public S3 buckets. Hadoop will attempt to read these automatically using any configured credentials. It can be configured to explicitly request anonymous access, in which case no credentials need be supplied.

With the exception of the two cases described above, to interact with Amazon Web Services, applications need the access key (which is effectively an ID) and the secret key (which is effectively a password). These can belong to an individual or be issued by an organization with group accounts.

Temporary security credentials, also known as "session credentials", can be issued. These consist of a secret key with a limited lifespan, along with a session token, another secret which must be known and used alongside the access key. The secret key is never passed to AWS services directly. Instead it is used to sign the URL and headers of the HTTP request.

Configuring Authentication

By default, the S3A filesystem client follows the following authentication chain:

  1. If login details were provided in the filesystem URI, a warning is printed and then the username and password are extracted for the AWS key and secret respectively.

  2. The fs.s3a.access.key and fs.s3a.secret.key are looked for in the Hadoop XML configuration.

  3. The AWS environment variables are then looked for.

  4. An attempt is made to query the Amazon EC2 Instance Metadata Service to retrieve credentials published to EC2 VMs.

For Apache Hadoop applications to be able to interact with Amazon S3, they must know the access key and the secret key. This can be achieved in three different ways: through configuration properties, environment variables, or instance metadata.

Note

The following authentication configuration allows you to access all the buckets to which a single account has access. If you are trying to configure multiple buckets belonging to different accounts, refer to Supporting Different Authentication Credentials in Different Buckets.

Using Instance Metadata to Authenticate with AWS Services

Virtual machines deployed in EC2 environments can query the EC2 Instance Metadata Service for an AWS key, secret, and a session token. This includes HDCloud Virtual Machines deployed within AWS.

The credentials are provided, they are used for session authentication. If only the key and secret are present, then the authentication will be done with the non-session based secrets.

This querying of the EC2 "IAM" service for credentials is attempted automatically, after the other attempts at authenticating (URL, configuration and environment variables) fail.

Using Configuration Properties to Authenticate

When attempting to use s3a URLS outside HDCloud, credentials will need to be provided by some other means.

This can be done by explicitly declaring the credentials in a configuration file such as core-site.xml:

<property>
  <name>fs.s3a.access.key</name>
  <value>ACCESS-KEY</value>
</property>

<property>
  <name>fs.s3a.secret.key</name>
  <value>SECRET-KEY</value>
</property>

This is the de-facto standard authentication mechanism in Hadoop applications which are not running in Amazon EC2 VMs.

If using AWS Session credentials for authentication, the secret key must be that of the session, and the fs.s3a.session.token option set to the session token.

<property>
  <name>fs.s3a.session.token</name>
  <value>Short-lived-session-token</value>
</property>

Using Environment Variables to Authenticate

AWS CLI supports authentication through environment variables. These same environment variables will be used by Hadoop if no configuration properties are set.

Environment Variable Description
AWS_ACCESS_KEY_ID Access key
AWS_SECRET_ACCESS_KEY Secret key
AWS_SESSION_TOKEN Session token (if using session authentication)

Supporting Different Authentication Credentials in Different Buckets

S3A supports per-bucket configuration; this can be used to declare different authentication credentials and mechanisms for different buckets.

For example, a bucket s3a://nightly/ used for nightly data can be configured with uses a session key:

<property>
  <name>fs.s3a.bucket.nightly.access.key</name>
  <value>AKAACCESSKEY-2</value>
</property>

<property>
  <name>fs.s3a.bucket.nightly.secret.key</name>
  <value>SESSIONSECRETKEY</value>
</property>

<property>
  <name>fs.s3a.bucket.nightly.session.token</name>
  <value>Short-lived-session-token</value>
</property>

This technique is useful for working with external sources of data, or when copying data between buckets belonging to different accounts.

For more details on per-bucket configuration, refer to Configuring S3A.

Embedding Credentials in the URL

Important

Embedding credentials in the URL is dangerous and deprecated. Use per-bucket configuration options instead.

Hadoop supports embedding credentials within the S3 URL:

 s3a://key:secret@bucket-name/

In general, we strongly discourage using this mechanism, as it invariably results in the secret credentials being logged in many places in the cluster. However, embedding credentials in the URL is sometimes useful when troubleshooting authentication problems: consult the S3 troubleshooting documentation for details.

Due to the security risk it represents, future versions of Hadoop may remove this feature entirely.

Before S3A supported per-bucket credentials, this was the sole mechanism for supporting different credentials for different buckets. Now that buckets can be individually configured, this mechanism should no longer be needed.

Defining Authentication Providers

S3A can be configured to obtain client authentication providers from classes which integrate with the AWS SDK by implementing the com.amazonaws.auth.AWSCredentialsProvider interface. This is done by listing the implementation classes in the configuration option fs.s3a.aws.credentials.provider.

Note

AWS credential providers are distinct from Hadoop credential providers. Hadoop credential providers allow passwords and other secrets to be stored and transferred more securely than in XML configuration files. AWS credential providers are classes which can be used by the Amazon AWS SDK to obtain an AWS login from a different source in the system, including environment variables, JVM properties, and configuration files.

There are a number of AWS credential provider classes specified in the hadoop-aws JAR:

Classname Description
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider Standard credential support through configuration properties
org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider Session authentication
org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider Anonymous login

There are also many AWS credential provider classes specified in the Amazon JARs. In particular, there are two which are commonly used:

Classname Description
com.amazonaws.auth.EnvironmentVariableCredentialsProvider AWS Environment Variables
com.amazonaws.auth.InstanceProfileCredentialsProvider EC2 Metadata Credentials

The order of listing credential providers in the configuration option fs.s3a.aws.credentials.provider defines the order of evaluation of credential providers. The standard authentication mechanism for Hadoop S3A authentication is essentially the following list of providers:

<property>
  <name>fs.s3a.aws.credentials.provider</name>
  <value>
  org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider,
  com.amazonaws.auth.EnvironmentVariableCredentialsProvider,
  com.amazonaws.auth.InstanceProfileCredentialsProvider</value>
</property>

Note that:

Using Temporary Session Credentials

Temporary Security Credentials can be obtained from the AWS Security Token Service. These credentials consist of an access key, a secret key, and a session token.

To authenticate with these credentials:

  1. Declare org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider as the provider.
  2. Set the session key in the property fs.s3a.session.token, and set the access and secret key properties to those of this temporary session.
<property>
  <name>fs.s3a.aws.credentials.provider</name>
  <value>org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider</value>
</property>

<property>
  <name>fs.s3a.access.key</name>
  <value>SESSION-ACCESS-KEY</value>
</property>

<property>
  <name>fs.s3a.secret.key</name>
  <value>SESSION-SECRET-KEY</value>
</property>

<property>
  <name>fs.s3a.session.token</name>
  <value>SECRET-SESSION-TOKEN</value>
</property>

The lifetime of session credentials is determined when the credentials are issued; once they expire the application will no longer be able to authenticate to AWS.

Using Anonymous Login

Specifying org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider allows anonymous access to a publicly accessible Amazon S3 bucket without any credentials. It can be useful for accessing public data sets without requiring AWS credentials.

<property>
  <name>fs.s3a.aws.credentials.provider</name>
  <value>org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider</value>
</property>

Once this is done, there's no need to supply any credentials in the Hadoop configuration or via environment variables.

This option can be used to verify that an object store does not permit unauthenticated access; that is, if an attempt to list a bucket is made using the anonymous credentials, it should fail — unless explicitly opened up for broader access.

hadoop fs -ls \
  -D fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider \
  s3a://landsat-pds/

S3A may be configured to always access specific buckets s3a://landsat-pds/ anonymously, such as this public landsat-pds bucket.

<property>
  <name>fs.s3a.bucket.landsat-pds.aws.credentials.provider</name>
  <value>org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider</value>
</property>

Note that:

  1. Allowing anonymous access to an Amazon S3 bucket compromises security and therefore is unsuitable for most use cases.

  2. If a list of credential providers is given in fs.s3a.aws.credentials.provider, then the anonymous credential provider must come last. If not, credential providers listed after it will be ignored.

Keeping Your Amazon S3 Credentials Secret

The Hadoop credential provider framework allows secure credential providers to keep secrets outside Hadoop configuration files, storing them in encrypted files in local or Hadoop filesystems, and including them in requests.

The S3A configuration options with sensitive data (fs.s3a.secret.key, fs.s3a.access.key, and fs.s3a.session.token) can have their data saved to a binary file, with the values being read in when the S3A filesystem URL is used for data access. The reference to this credential provider is all that is passed as a direct configuration option.

For additional reading on the Hadoop credential provider API, refer to Credential Provider API.

Create a Credential File

You can create a credential file on any Hadoop filesystem. When you create one on HDFS or a UNIX filesystem, the permissions are automatically set to keep the file private to the reader — though as directory permissions are not touched, you should verify that the directory containing the file is readable only by the current user. For example:

hadoop credential create fs.s3a.access.key -value 123 \
    -provider jceks://hdfs@nn1.example.com:9001/user/backup/s3.jceks

hadoop credential create fs.s3a.secret.key -value 456 \
    -provider jceks://hdfs@nn1.example.com:9001/user/backup/s3.jceks

After creating the credential file, you can list it to see what entries are kept inside it. For example:

hadoop credential list -provider jceks://hdfs@nn1.example.com:9001/user/backup/s3.jceks

Listing aliases for CredentialProvider: jceks://hdfs@nn1.example.com:9001/user/backup/s3.jceks
fs.s3a.secret.key
fs.s3a.access.key

At this point, the credentials are ready for use.

Configure the Hadoop Security Credential Provider Path Property

The URL to the provider must be set in the configuration property hadoop.security.credential.provider.path, either on the command line or in XML configuration files:

<property>
  <name>hadoop.security.credential.provider.path</name>
  <value>jceks://hdfs@nn1.example.com:9001/user/backup/s3.jceks</value>
  <description>Path to interrogate for protected credentials.</description>
</property>

Because this property only supplies the path to the secrets file, the configuration option itself is no longer a sensitive item.

The path to the provider can also be set on the command line. For example:

hadoop distcp \
  -D hadoop.security.credential.provider.path=jceks://hdfs@nn1.example.com:9001/user/backup/s3.jceks \
  hdfs://nn1.example.com:9001/user/backup/007020615 s3a://glacier1/

hadoop fs \
  -D hadoop.security.credential.provider.path=jceks://hdfs@nn1.example.com:9001/user/backup/s3.jceks \
  -ls s3a://glacier1/

Because the provider path is not itself a sensitive secret, there is no risk from placing its declaration on the command line.

Once the provider is set in the Hadoop configuration, hadoop commands work exactly as if the secrets were in an XML file. For example:

hadoop distcp hdfs://nn1.example.com:9001/user/backup/007020615 s3a://glacier1/
hadoop fs -ls s3a://glacier1/

Customizing S3A Secrets Held in Credential Files

Although most properties are automatically propagated from their fs.s3a.bucket.-prefixed custom entry to that of the base fs.s3a. option supporting secrets kept in Hadoop credential files is slightly more complex. This is because the property values are kept in these files, and cannot be dynamically patched.

Instead, callers need to create different configuration files for each bucket, setting the base secrets (fs.s3a.access.key, etc), then declare the path to the appropriate credential file in a bucket-specific version of the property fs.s3a.security.credential.provider.path.

More

Refer to Troubleshooting Amazon S3 for information on troubleshooting authentication failures.

To make sure that you are following best Amazon S3 security practices, refer to Security Best Practices and Checklist.