Copying Data Between a Cluster and Amazon S3

DistCp is a utility for copying large data sets between distributed filesystems. You can use DistCp to copy data between your cluster’s HDFS and Amazon S3.

This section describes how to copy data from S3 to HDFS, from HDFS to Amazon S3, and between Amazon S3 buckets. It also includes tips for copying large data volumes and describes limitations of using DistCp with Amazon S3.

Invoking DistCp

To access DistCp utility, SSH to any node in your cluster. By default, DistCp is invoked against the cluster's default file system, which is defined in the configuration property fs.defaultFS in core-site.xml. For HDP clusters on AWS, the default filesystem is the deployed HDFS instance. This means that both of the following examples are valid:

hadoop distcp hdfs://source-folder s3a://destination-bucket
hadoop distcp /source-folder s3a://destination-bucket

Copying Data from HDFS to Amazon S3

To transfer data from HDFS to an Amazon S3 bucket, use the following syntax:

hadoop distcp hdfs://source-folder s3a://destination-bucket

Updating Existing Data

If you would like to transfer only the files that don’t already exist in the target folder, add the update option to improve the copy speeds:

hadoop distcp -update hdfs://source-folder s3a://destination-bucket

Refer to the Apache documentation for a detailed explanation of how the handling of source-paths varies depending on whether or not you add the update option.

Note

When copying between Amazon S3 and HDFS, the "update" check only compares file size and modification times; it does not use checksums to detect other changes in the data.

Copying Data from Amazon S3 to HDFS

To copy data from Amazon S3 to HDFS, list the path of the Amazon S3 data first. For example:

hadoop distcp s3a://hwdev-examples-ireland/datasets /tmp/datasets2

This downloads all files. You can add the update option to only download data which has changed:

hadoop distcp -update s3a://hwdev-examples-ireland/datasets /tmp/datasets2

For example:

[admin@ip-10-0-1-132 ~]$ hadoop distcp s3a://dominika-test/driver-data/ /tmp/test/
16/10/26 18:12:30 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, overwrite=false, skipCRC=false, blocking=true, numListstatusThreads=0, maxMaps=20, mapBandwidth=100, sslConfigurationFile='null', copyStrategy='uniformsize', preserveStatus=[], preserveRawXattrs=false, atomicWorkPath=null, logPath=null, sourceFileListing=null, sourcePaths=[s3a://dominika-test/driver-data], targetPath=/tmp/test, targetPathExists=false, filtersFile='null'}
16/10/26 18:12:31 INFO impl.TimelineClientImpl: Timeline service address: http://ip-10-0-1-90.ec2.internal:8188/ws/v1/timeline/
16/10/26 18:12:31 INFO client.RMProxy: Connecting to ResourceManager at ip-10-0-1-90.ec2.internal/10.0.1.90:8050
16/10/26 18:12:31 INFO client.AHSProxy: Connecting to Application History server at ip-10-0-1-90.ec2.internal/10.0.1.90:10200
16/10/26 18:12:34 INFO tools.SimpleCopyListing: Paths (files+dirs) cnt = 4; dirCnt = 1
16/10/26 18:12:34 INFO tools.SimpleCopyListing: Build file listing completed.
16/10/26 18:12:34 INFO tools.DistCp: Number of paths in the copy list: 4
16/10/26 18:12:34 INFO tools.DistCp: Number of paths in the copy list: 4
16/10/26 18:12:34 INFO impl.TimelineClientImpl: Timeline service address: http://ip-10-0-1-90.ec2.internal:8188/ws/v1/timeline/
16/10/26 18:12:34 INFO client.RMProxy: Connecting to ResourceManager at ip-10-0-1-90.ec2.internal/10.0.1.90:8050
16/10/26 18:12:34 INFO client.AHSProxy: Connecting to Application History server at ip-10-0-1-90.ec2.internal/10.0.1.90:10200
16/10/26 18:12:35 INFO mapreduce.JobSubmitter: number of splits:3
16/10/26 18:12:35 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1477496995418_0005
16/10/26 18:12:35 INFO impl.YarnClientImpl: Submitted application application_1477496995418_0005
16/10/26 18:12:35 INFO mapreduce.Job: The url to track the job: http://ip-10-0-1-90.ec2.internal:8088/proxy/application_1477496995418_0005/
16/10/26 18:12:35 INFO tools.DistCp: DistCp job-id: job_1477496995418_0005
16/10/26 18:12:35 INFO mapreduce.Job: Running job: job_1477496995418_0005
16/10/26 18:12:41 INFO mapreduce.Job: Job job_1477496995418_0005 running in uber mode : false
16/10/26 18:12:41 INFO mapreduce.Job:  map 0% reduce 0%
16/10/26 18:12:48 INFO mapreduce.Job:  map 33% reduce 0%
16/10/26 18:12:51 INFO mapreduce.Job:  map 67% reduce 0%
16/10/26 18:12:52 INFO mapreduce.Job:  map 100% reduce 0%
16/10/26 18:12:52 INFO mapreduce.Job: Job job_1477496995418_0005 completed successfully
16/10/26 18:12:52 INFO mapreduce.Job: Counters: 38
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=435951
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=1544
        HDFS: Number of bytes written=2300349
        HDFS: Number of read operations=38
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=13
        S3A: Number of bytes read=2300325
        S3A: Number of bytes written=0
        S3A: Number of read operations=10
        S3A: Number of large read operations=0
        S3A: Number of write operations=0
    Job Counters 
        Launched map tasks=3
        Other local map tasks=3
        Total time spent by all maps in occupied slots (ms)=19764
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=19764
        Total vcore-milliseconds taken by all map tasks=19764
        Total megabyte-milliseconds taken by all map tasks=20238336
    Map-Reduce Framework
        Map input records=4
        Map output records=0
        Input split bytes=348
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=212
        CPU time spent (ms)=7390
        Physical memory (bytes) snapshot=732037120
        Virtual memory (bytes) snapshot=6369247232
        Total committed heap usage (bytes)=1048576000
    File Input Format Counters 
        Bytes Read=1196
    File Output Format Counters 
        Bytes Written=24
    org.apache.hadoop.tools.mapred.CopyMapper$Counter
        BYTESCOPIED=2300325
        BYTESEXPECTED=2300325
        COPY=4

Copying Data Between Amazon S3 Buckets

To copy data from one Amazon S3 bucket to another, use the following syntax:

hadoop distcp s3a://hwdev-example-ireland/datasets s3a://hwdev-example-us/datasets

You can copy data between two Amazon S3 buckets, simply by listing the different URLs as the source and destination paths. In addition to copying data between buckets on a single Amazon S3 datacenter, you can use this syntax to copy data between Amazon S3 buckets hosted in different datacenters, simply by naming the bucket in the remote datacenter.

Irrespective of source and destination bucket locations, when copying data between Amazon S3 buckets, all data passes through the Hadoop cluster: once to read, once to write. This means that the time to perform the copy depends on the size of the Hadoop cluster, and the bandwidth between it and the S3 buckets. Furthermore, even when running within Amazon's own infrastructure, you are billed for your accesses to remote Amazon S3 buckets.

Specifying Bucket-Specific Options

If a bucket has different authentication or endpoint options, then the different options for that bucket can be set with a bucket-specific option. For example, to copy to a remote bucket using Amazon's V4 authentication API requires the explicit S3 endpoint to be declared:

hadoop distcp s3a://hwdev-example-us/datasets/set1 s3a://hwdev-example-frankfurt/datasets/ \
  -D fs.s3a.bucket.hwdev-example-frankfurt.endpoint=s3.eu-central-1.amazonaws.com

Similarly, different credentials may be used when copying between buckets of different accounts. When performing such an operation, consider that secrets on the command line can be visible to other users on the system, so potentially insecure -as

hadoop distcp s3a://hwdev-example-us/datasets/set1 s3a://hwdev-example-frankfurt/datasets/ \
  -D fs.s3a.bucket.hwdev-example-frankfurt.endpoint=s3.eu-central-1.amazonaws.com \
  -D fs.s3a.fs.s3a.bucket.hwdev-example-frankfurt.access.key=AKAACCESSKEY-2 \
  -D fs.s3a.bucket.nightly.secret.key=SECRETKEY

Using short-lived session keys can reduce the vulnerabilities here, while storing the secrets in hadoop jceks credential files is potentially significantly more secure.

Copying Data Within an Amazon S3 Bucket

Copy operations within a single object store still take place in the Hadoop cluster, even when the object store implements a more efficient copy operation internally. That is, an operation such as

hadoop distcp s3a://bucket/datasets/set1 s3a://bucket/datasets/set2

copies each byte down to the Hadoop worker nodes and back to the bucket. In addition to the operation being being slow, it means that charges may be incurred.

Limitations

Consider the following limitations:

Learn More

For complete documentation of DistCp, refer to the Apache documentation.

Improving Performance When Copying Large Data Volumes

This section includes tips for improving performance when copying large volumes of data between Amazon S3 and HDFS.

The bandwidth between the Hadoop cluster and Amazon S3 is the upper limit to how fast data can be copied into S3. The further the Hadoop cluster is from the Amazon S3 installation, or the narrower the network connection is, the longer the operation will take. Even a Hadoop cluster deployed within Amazon's own infrastructure may encounter network delays from throttled VM network connections.

Network bandwidth limits notwithstanding, there are some options which can be used to tune the performance of an upload.

Working with Local S3 Buckets

A foundational step to getting good performance is working with buckets "close" to the Hadoop cluster, where "close" is measured in network terms.

Maximum performance in HDCloud for AWS is achieved from working with S3 buckets in the same AWS site as the HDCloud cluster. For example, if your cluster is in North Virginia ("US East"), you will achieve best performance if your S3 bucket is in the same region.

Each time you create a cluster in a new region, create a bucket in that same region. Likewise, make sure that buckets used for backup and cluster storage are on the same site as the clusters.

In addition to improving performance, working with local buckets ensures that no bills are incurred for reading from the bucket.

When working with S3 remotely (i.e. if your cluster is not on AWS), use the closest S3 site possible; this will reduce latency on all queries and reads, as well as bandwidth between the Hadoop cluster and the object store.

Accelerating File Listing

When data is copied between buckets, listing all the files to copy can take a long time. In such cases, you can increase -numListstatusThreads from 1 (default) to 15. With this setting, multiple threads will be used for listing the contents of the source folder.

Using the S3A Fast Uploader

Warning

These tuning recommendations are experimental and may change in the future.

If you are planning to copy large amounts of data, make sure to use the fast uploader, passing -D fs.s3a.fast.upload=true while invoking DistCp. For example:

hadoop distcp -D fs.s3a.fast.upload=true  s3a://dominika-test/driver-data /tmp/test2

The fs.s3a.fast.upload option significantly accelerates data upload by writing the data in blocks, possibly in parallel.

<property>
  <name>fs.s3a.fast.upload</name>
  <value>true</value>
</property>

The details of this feature are covered in S3 Performance.

Controlling the Number of Mappers and their bandwidth

If you want to control the number of mappers launched for DistCp, you can add the -m option and set it to the desired number of mappers. If you are using DistCp from a Hadoop cluster running in Amazon's infrastructure, increasing the number of mappers may speed up the operation.

Similarly, if copying to S3 from a remote site, it is possible that the bandwidth from the Hadoop cluster to Amazon S3 is the bottleneck. In such a situation, because the bandwidth is shared across all mappers, adding more mappers will not accelerate the upload: merely slow all the mappers down.

The -bandwidth option sets the approximate maximum bandwidth for each mapper in Megabytes/second. This a floating point number, so a value such as -bandwidth 0.5 allocates 0.5 MB/s to each mapper.