Using the S3A FileSystem Client

Hortonworks Data Cloud supports the Apache Hadoop S3A client. S3A is a filesystem client connector used by Hadoop to read and write data from Amazon S3 or a compatible service. The S3A filesystem uses Amazon's libraries to interact with Amazon S3. It uses the URI prefix s3a://.

The S3A is backward compatible with its predecessor S3N (recognized by its prefix s3n:// in URLs), which shipped with earlier versions of Hadoop. Replacing the prefix URLs beginning withs3n:// with s3a:// is sufficient to use the S3A connector in place of S3N.

The S3A is implemented in hadoop-aws.jar. This library and its dependencies are automatically placed on the classpath.

Important

The Amazon JARs have proven very brittle: the version of the Amazon libraries must match the versions against which the Hadoop binaries were built.

Hadoop FileSystem Shell Commands

Many of the standard Hadoop FileSystem shell commands that interact with HDFS also can be used to interact with S3A.

Accessing S3A

By default, the Hadoop FileSystem shell assumes invocation against the cluster's default filesystem, which is defined in the configuration property fs.defaultFS in core-site.xml. For HDP clusters on AWS, the default filesystem is the deployed HDFS instance.

To access S3A instead of HDFS:

  1. SSH to any node in the cluster.

  2. When running commands, provide a fully qualified URI with the s3a scheme and the bucket in the authority. For example, the following command lists all files in a directory called "dir1", which resides in a bucket called "bucket1":

hadoop fs -ls s3a://bucket1/dir1

The Hadoop FileSystem shell uses the configured AWS credentials to access the S3 bucket. For further discussion of credential configuration and for additional examples of the Hadoop FileSystem shell invocation, refer to Amazon S3 Security.

Command Structure

The Hadoop FileSystem shell commands use the following syntax:

hadoop fs -<operation> s3a://<bucket>/dri1

where:

Command Examples

You can use the Hadoop FileSystem shell to list directories, create files, delete files, and so on.

You can create directories, and create or copy files into them. For example:

# Create a directory
hadoop fs -mkdir s3a://bucket1/datasets/

# Upload a file from the cluster filesystem
hadoop fs -put /datasets/example.orc s3a://bucket1/datasets/

# Touch a file
hadoop fs -touchz s3a://bucket1/datasetstouched

You can download and view objects. For example:

# Copy a directory to the local filesystem
hadoop fs -copyToLocal s3a://bucket1/datasets/

# Copy a file from the object store to the local filesystem
hadoop fs -get s3a://bucket1/hello.txt /examples

# Print the object
hadoop fs -cat s3a://bucket1/hello.txt

# Print the object, unzipping it if necessary
hadoop fs -text s3a://bucket1/hello.txt

# Download log files into a local file
hadoop fs -getmerge s3a://s3a://bucket1/logs\* log.txt

Commands That May Be Slower

Some commands tend to be significantly slower with Amazon S3 than when invoked against HDFS or other filesystems. This includes renaming files, listing files, find, mv, cp, and rm.

Renaming Files

Unlike in a normal filesystem, renaming files and directories in an object store usually takes time proportional to the size of the objects being manipulated. As many of the filesystem shell operations use renaming as the final stage in operations, skipping that stage can avoid long delays.

In particular, we recommend that when using the put and copyFromLocal commands, you set the -d option for a direct upload. For example:

# Upload a file from the cluster filesystem
hadoop fs -put -d /datasets/example.orc s3a://bucket1/datasets/

# Upload a file from the local filesystem
hadoop fs -copyFromLocal -d -f ~/datasets/devices.orc s3a://bucket1/datasets/

# Create a file from stdin
echo "hello" | hadoop fs -put -d -f - s3a://bucket1/datasets/hello.txt

Listing Files

Commands which list many files tend to be significantly slower with Amazon S3 than when invoked against HDFS or other filesystems. For example:

hadoop fs -count s3a://bucket1/
hadoop fs -du s3a://bucket1/

Other slow commands include find, mv, cp and rm.

Find

The find command can be very slow on a large store with many directories under the path supplied.

# Enumerate all files in the bucket
hadoop fs -find s3a://bucket1/ -print

# List *.txt in the bucket.
# Remember to escape the wildcard to stop the bash shell trying to expand it
hadoop fs -find s3a://bucket1/datasets/ -name \*.txt -print

Rename

The time to rename a file depends on its size. The time to rename a directory depends on the number and size of all files beneath that directory. If the operation is interrupted, the object store will be in an undefined state.

hadoop fs -mv s3a://bucket1/datasets s3a://bucket/historical 

Copy

The copy operation reads each file and then writes it back to the object store; the time to complete depends on the amount of data to copy, and on the bandwidth in both directions between the local computer and the object store.

hadoop fs -cp s3a://bucket1/datasets s3a://bucket1/historical 

Note

The further the computer is from the object store, the longer the copy process takes.

Refer to the Amazon S3 Performance section for further discussion of S3A filesystem semantics and its impact on performance.

Unsupported Subcommands

S3A does not implement the same feature set as HDFS. The following FileSystem shell subcommands are not supported with an S3A URI:

Learn More

This section only covers how selected Hadoop FileSystem shell commands behave when invoked against data in Amazon S3. Refer to the Apache documentation for more information on the Hadoop FileSystem shell commands.

Deleting Objects

The rm command deletes objects and directories full of objects. If the object store is eventually consistent, fs ls commands and other accessors may briefly return the details of the now-deleted objects; this is an artifact of object stores which cannot be avoided.

If the filesystem client is configured to copy files to a trash directory, the trash directory is in the bucket. The rm operation then takes time proportional to the size of the data. Furthermore, the deleted files continue to incur storage costs.

To make sure that your deleted files are no longer incurring costs, you can do two things:

Overwriting Objects

Amazon S3 is eventually consistent, which means that an operation which overwrites existing objects may not be immediately visible to all clients/queries. As a result, later operations which query the same object's status or contents may get the previous object; this can sometimes surface within the same client, while reading a single object.

Avoid having a sequence of commands which overwrite objects and then immediately working on the updated data; there is a risk that the previous data will be used instead.

Timestamps

Timestamps of objects and directories in Amazon S3 do not follow the behavior of files and directories in HDFS:

For details on how these characteristics may affect the distcp -update operation, refer to Copying Data Between a Cluster and Amazon S3 documentation.

Security Model and Operations

The security and permissions model of Amazon S3 is very different from this of a UNIX-style filesystem: on Amazon S3, operations which query or manipulate permissions are generally unsupported. Operations to which this applies include: chgrp, chmod, chown, getfacl, and setfacl. The related attribute commands getfattr andsetfattr are also unavailable. In addition, operations which try to preserve permissions (for example fs -put -p) do not preserve permissions.

Although these operations are unsupported, filesystem commands which list permission and user/group details usually simulate these details. As a consequence, when interacting with read-only object stores, the permissions found in "list" and "stat" commands may indicate that the user has write access — when in fact he does't.

Amazon S3 has a permissions model of its own. This model can be manipulated through store-specific tooling. Be aware that some of the permissions which can be set — such as write-only paths, or various permissions on the root path — may be incompatible with the S3A client. It expects full read and write access to the entire bucket with trying to write data, and may fail if it does not have these permissions.

Simulated Permissions

As an example of how permissions are simulated, here is a listing of Amazon's public, read-only bucket of Landsat images:

$ hadoop fs -ls s3a://landsat-pds/
Found 10 items
drwxrwxrwx   - mapred          0 2016-09-26 12:16 s3a://landsat-pds/L8
-rw-rw-rw-   1 mapred      23764 2015-01-28 18:13 s3a://landsat-pds/index.html
drwxrwxrwx   - mapred          0 2016-09-26 12:16 s3a://landsat-pds/landsat-pds_stats
-rw-rw-rw-   1 mapred        105 2016-08-19 18:12 s3a://landsat-pds/robots.txt
-rw-rw-rw-   1 mapred         38 2016-09-26 12:16 s3a://landsat-pds/run_info.json
drwxrwxrwx   - mapred          0 2016-09-26 12:16 s3a://landsat-pds/runs
-rw-rw-rw-   1 mapred   27458808 2016-09-26 12:16 s3a://landsat-pds/scene_list.gz
drwxrwxrwx   - mapred          0 2016-09-26 12:16 s3a://landsat-pds/tarq
drwxrwxrwx   - mapred          0 2016-09-26 12:16 s3a://landsat-pds/tarq_corrupt
drwxrwxrwx   - mapred          0 2016-09-26 12:16 s3a://landsat-pds/test

As you can see:

When an attempt is made to delete one of the files, the operation fails — despite the permissions shown by the ls command:

$ hadoop fs -rm s3a://landsat-pds/scene_list.gz
rm: s3a://landsat-pds/scene_list.gz: delete on s3a://landsat-pds/scene_list.gz:
  com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3;
  Status Code: 403; Error Code: AccessDenied; Request ID: 1EF98D5957BCAB3D),
  S3 Extended Request ID: wi3veOXFuFqWBUCJgV3Z+NQVj9gWgZVdXlPU4KBbYMsw/gA+hyhRXcaQ+PogOsDgHh31HlTCebQ=

This demonstrates that the listed permissions cannot be taken as evidence of write access; only object manipulation can determine this.

User-Agent Customization

By default, S3A includes the current Hadoop version in the User-Agent string passed through the AWS SDK to the Amazon S3 service. You may also include optional additional information to identify your application by setting configuration property fs.s3a.fs.s3a.user.agent.prefix in core-site.xml or on the command line, as documented here:

<property>
  <name>fs.s3a.user.agent.prefix</name>
  <value></value>
  <description>
    Sets a custom value that will be prepended to the User-Agent header sent in
    HTTP requests to the S3 back-end by S3AFileSystem.  The User-Agent header
    always includes the Hadoop version number followed by a string generated by
    the AWS SDK.  An example is "User-Agent: Hadoop 2.8.0, aws-sdk-java/1.10.6".
    If this optional property is set, then its value is prepended to create a
    customized User-Agent.  For example, if this configuration property was set
    to "MyApp", then an example of the resulting User-Agent would be
    "User-Agent: MyApp, Hadoop 2.8.0, aws-sdk-java/1.10.6".
  </description>
</property>

The presence of "Hadoop" in the User-Agent identifies that the source of the call is a Hadoop application, running a specific version of Hadoop. Setting a custom prefix is optional, but it may assist AWS support with identifying traffic originating from a specific application.