Using the S3A FileSystem Client
Hortonworks Data Cloud supports the Apache Hadoop S3A client. S3A is a filesystem client connector used by Hadoop to read and write data from Amazon S3 or a compatible service. The S3A filesystem uses Amazon's libraries to interact with Amazon S3. It uses the URI prefix
The S3A is backward compatible with its predecessor S3N (recognized by its prefix
s3n:// in URLs), which shipped with earlier versions of Hadoop. Replacing
the prefix URLs beginning with
s3a:// is sufficient to use the S3A
connector in place of S3N.
The S3A is implemented in
hadoop-aws.jar. This library and its dependencies are automatically placed on the classpath.
The Amazon JARs have proven very brittle: the version of the Amazon libraries must match the versions against which the Hadoop binaries were built.
Hadoop FileSystem Shell Commands
Many of the standard Hadoop FileSystem shell commands that interact with HDFS also can be used to interact with S3A.
By default, the Hadoop FileSystem shell assumes invocation against the cluster's
default filesystem, which is defined in the configuration property
core-site.xml. For HDP clusters on AWS, the default filesystem is the
deployed HDFS instance.
To access S3A instead of HDFS:
SSH to any node in the cluster.
When running commands, provide a fully qualified URI with the
s3ascheme and the bucket in the authority. For example, the following command lists all files in a directory called "dir1", which resides in a bucket called "bucket1":
hadoop fs -ls s3a://bucket1/dir1
The Hadoop FileSystem shell uses the configured AWS credentials to access the S3 bucket. For further discussion of credential configuration and for additional examples of the Hadoop FileSystem shell invocation, refer to Amazon S3 Security.
The Hadoop FileSystem shell commands use the following syntax:
hadoop fs -<operation> s3a://<bucket>/dri1
hadoop fsindicates that we want to perform an operation using Hadoop FileSystem shell
<operation>indicates a particular action to be performed against a directory or a file
s3a://is the prefix needed to access Amazon S3
<bucket>indicates a particular Amazon S3 bucket
You can use the Hadoop FileSystem shell to list directories, create files, delete files, and so on.
You can create directories, and create or copy files into them. For example:
# Create a directory hadoop fs -mkdir s3a://bucket1/datasets/ # Upload a file from the cluster filesystem hadoop fs -put /datasets/example.orc s3a://bucket1/datasets/ # Touch a file hadoop fs -touchz s3a://bucket1/datasetstouched
You can download and view objects. For example:
# Copy a directory to the local filesystem hadoop fs -copyToLocal s3a://bucket1/datasets/ # Copy a file from the object store to the local filesystem hadoop fs -get s3a://bucket1/hello.txt /examples # Print the object hadoop fs -cat s3a://bucket1/hello.txt # Print the object, unzipping it if necessary hadoop fs -text s3a://bucket1/hello.txt # Download log files into a local file hadoop fs -getmerge s3a://s3a://bucket1/logs\* log.txt
Commands That May Be Slower
Some commands tend to be significantly slower with Amazon S3 than when
invoked against HDFS or other filesystems. This includes renaming files, listing files,
Unlike in a normal filesystem, renaming files and directories in an object store usually takes time proportional to the size of the objects being manipulated. As many of the filesystem shell operations use renaming as the final stage in operations, skipping that stage can avoid long delays.
In particular, we recommend that when using the
copyFromLocal commands, you set the
-d option for a direct upload. For example:
# Upload a file from the cluster filesystem hadoop fs -put -d /datasets/example.orc s3a://bucket1/datasets/ # Upload a file from the local filesystem hadoop fs -copyFromLocal -d -f ~/datasets/devices.orc s3a://bucket1/datasets/ # Create a file from stdin echo "hello" | hadoop fs -put -d -f - s3a://bucket1/datasets/hello.txt
Commands which list many files tend to be significantly slower with Amazon S3 than when invoked against HDFS or other filesystems. For example:
hadoop fs -count s3a://bucket1/ hadoop fs -du s3a://bucket1/
Other slow commands include
find command can be very slow on a large store with many directories under the path
# Enumerate all files in the bucket hadoop fs -find s3a://bucket1/ -print # List *.txt in the bucket. # Remember to escape the wildcard to stop the bash shell trying to expand it hadoop fs -find s3a://bucket1/datasets/ -name \*.txt -print
The time to rename a file depends on its size. The time to rename a directory depends on the number and size of all files beneath that directory. If the operation is interrupted, the object store will be in an undefined state.
hadoop fs -mv s3a://bucket1/datasets s3a://bucket/historical
The copy operation reads each file and then writes it back to the object store; the time to complete depends on the amount of data to copy, and on the bandwidth in both directions between the local computer and the object store.
hadoop fs -cp s3a://bucket1/datasets s3a://bucket1/historical
The further the computer is from the object store, the longer the copy process takes.
Refer to the Amazon S3 Performance section for further discussion of S3A filesystem semantics and its impact on performance.
S3A does not implement the same feature set as HDFS. The following FileSystem shell subcommands are not supported with an S3A URI:
This section only covers how selected Hadoop FileSystem shell commands behave when invoked against data in Amazon S3. Refer to the Apache documentation for more information on the Hadoop FileSystem shell commands.
rm command deletes objects and directories full of objects.
If the object store is eventually consistent,
fs ls commands
and other accessors may briefly return the details of the now-deleted objects; this
is an artifact of object stores which cannot be avoided.
If the filesystem client is configured to copy files to a trash directory,
the trash directory is in the bucket. The
rm operation then takes time proportional
to the size of the data. Furthermore, the deleted files continue to incur
To make sure that your deleted files are no longer incurring costs, you can do two things:
Use the the
-skipTrashoption when removing files:
hadoop fs -rm -skipTrash s3a://bucket1/dataset
expungecommand to purge any data that has been previously moved to the
hadoop fs -expunge -D fs.defaultFS=s3a://bucket1/
Amazon S3 is eventually consistent, which means that an operation which overwrites existing objects may not be immediately visible to all clients/queries. As a result, later operations which query the same object's status or contents may get the previous object; this can sometimes surface within the same client, while reading a single object.
Avoid having a sequence of commands which overwrite objects and then immediately working on the updated data; there is a risk that the previous data will be used instead.
Timestamps of objects and directories in Amazon S3 do not follow the behavior of files and directories in HDFS:
- The creation time of an object is the time when the object was created in the object store. This is at the end of the write process, not in the beginning.
- If an object is overwritten, the modification time is updated.
- Directories may or may not have valid timestamps.
atimeaccess time feature is not supported by any of the object stores found in the Apache Hadoop codebase.
For details on how these characteristics may affect the
distcp -update operation, refer to Copying Data Between a Cluster and Amazon S3 documentation.
Security Model and Operations
The security and permissions model of Amazon S3 is very different
from this of a UNIX-style filesystem: on Amazon S3, operations which query or manipulate
permissions are generally unsupported. Operations to which this applies include:
setfacl. The related attribute commands
are also unavailable. In addition, operations which try to preserve permissions (for example
fs -put -p)
do not preserve permissions.
Although these operations are unsupported, filesystem commands which list permission and user/group details usually simulate these details. As a consequence, when interacting with read-only object stores, the permissions found in "list" and "stat" commands may indicate that the user has write access — when in fact he does't.
Amazon S3 has a permissions model of its own. This model can be manipulated through store-specific tooling. Be aware that some of the permissions which can be set — such as write-only paths, or various permissions on the root path — may be incompatible with the S3A client. It expects full read and write access to the entire bucket with trying to write data, and may fail if it does not have these permissions.
As an example of how permissions are simulated, here is a listing of Amazon's public, read-only bucket of Landsat images:
$ hadoop fs -ls s3a://landsat-pds/ Found 10 items drwxrwxrwx - mapred 0 2016-09-26 12:16 s3a://landsat-pds/L8 -rw-rw-rw- 1 mapred 23764 2015-01-28 18:13 s3a://landsat-pds/index.html drwxrwxrwx - mapred 0 2016-09-26 12:16 s3a://landsat-pds/landsat-pds_stats -rw-rw-rw- 1 mapred 105 2016-08-19 18:12 s3a://landsat-pds/robots.txt -rw-rw-rw- 1 mapred 38 2016-09-26 12:16 s3a://landsat-pds/run_info.json drwxrwxrwx - mapred 0 2016-09-26 12:16 s3a://landsat-pds/runs -rw-rw-rw- 1 mapred 27458808 2016-09-26 12:16 s3a://landsat-pds/scene_list.gz drwxrwxrwx - mapred 0 2016-09-26 12:16 s3a://landsat-pds/tarq drwxrwxrwx - mapred 0 2016-09-26 12:16 s3a://landsat-pds/tarq_corrupt drwxrwxrwx - mapred 0 2016-09-26 12:16 s3a://landsat-pds/test
As you can see:
- All files are listed as having full read/write permissions.
- All directories appear to have full
- The replication count of all files is "1".
- The owner of all files and directories is declared to be the current user (
- The timestamp of all directories is actually that of the time the
-lsoperation was executed. This is because these directories are not actual objects in the store; they are simulated directories based on the existence of objects under their paths.
When an attempt is made to delete one of the files, the operation fails — despite
the permissions shown by the
$ hadoop fs -rm s3a://landsat-pds/scene_list.gz rm: s3a://landsat-pds/scene_list.gz: delete on s3a://landsat-pds/scene_list.gz: com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: 1EF98D5957BCAB3D), S3 Extended Request ID: wi3veOXFuFqWBUCJgV3Z+NQVj9gWgZVdXlPU4KBbYMsw/gA+hyhRXcaQ+PogOsDgHh31HlTCebQ=
This demonstrates that the listed permissions cannot be taken as evidence of write access; only object manipulation can determine this.
By default, S3A includes the current Hadoop version in the User-Agent string
passed through the AWS SDK to the Amazon S3 service. You may also include optional
additional information to identify your application by setting configuration
core-site.xml or on the command line, as documented here:
<property> <name>fs.s3a.user.agent.prefix</name> <value></value> <description> Sets a custom value that will be prepended to the User-Agent header sent in HTTP requests to the S3 back-end by S3AFileSystem. The User-Agent header always includes the Hadoop version number followed by a string generated by the AWS SDK. An example is "User-Agent: Hadoop 2.8.0, aws-sdk-java/1.10.6". If this optional property is set, then its value is prepended to create a customized User-Agent. For example, if this configuration property was set to "MyApp", then an example of the resulting User-Agent would be "User-Agent: MyApp, Hadoop 2.8.0, aws-sdk-java/1.10.6". </description> </property>
The presence of "Hadoop" in the User-Agent identifies that the source of the call is a Hadoop application, running a specific version of Hadoop. Setting a custom prefix is optional, but it may assist AWS support with identifying traffic originating from a specific application.