Using Apache Hive with Amazon S3

This section describes how to use Apache Hive with data on Amazon S3.

Using Apache Hive with HDCloud

Before you start working with Apache Hive in HDCloud, review Accessing a Cluster and Using Apache Hive documentation.

Accessing Amazon S3 Data in Hive

Exposing Data in Amazon S3 as Hive Tables

Datasets stored in Amazon S3 can be easily made available in Hive as managed or external tables. The main difference between these two table types is that data linked in an external table does not get deleted when the table is deleted.

The following statement creates a managed table called “inventory”:

CREATE TABLE `inventory`(
  `inv_item_sk` int,
  `inv_warehouse_sk` int,
  `inv_quantity_on_hand` int)
PARTITIONED BY (
  `inv_date_sk` int) STORED AS ORC 
LOCATION
  's3a://BUCKET_NAME/tpcds_bin_partitioned_orc_200.db/inventory';

If you drop “inventory” table, the contents of s3a://BUCKET_NAME/tpcds_bin_partitioned_orc_200.db/inventory will be deleted.

The following statement creates an external table called “inventory”:

CREATE EXTERNAL TABLE `inventory`(
  `inv_item_sk` int,
  `inv_warehouse_sk` int,
  `inv_quantity_on_hand` int)
PARTITIONED BY (
  `inv_date_sk` int) STORED AS ORC 
LOCATION
  's3a://BUCKET_NAME/tpcds_bin_partitioned_orc_200.db/inventory';

In this case, dropping the “inventory” table does not delete the contents of s3a://BUCKET_NAME/tpcds_bin_partitioned_orc_200.db/inventory. It drops only the table definition from the Metastore.

Therefore, external tables are optimal when the data is already present in Amazon S3: once populated, data can be accessed multiple times and it is not deleted even if the tables are accidentally deleted.

When working with data on Amazon S3, the steps for populating partition-related information are the same as when working with data in HDFS.

Creating table definitions does not by itself auto-populate partition-related information to the Metastore. When a dataset available in Amazon S3 is already partitioned, you must run the MSCK command in order to populate the partition-related information into the Metastore.

For example, consider the following statement:

CREATE EXTERNAL TABLE `inventory`(
  `inv_item_sk` int,
  `inv_warehouse_sk` int,
  `inv_quantity_on_hand` int)
PARTITIONED BY (
  `inv_date_sk` int) STORED AS ORC 
LOCATION
  's3a://BUCKET_NAME/tpcds_bin_partitioned_orc_200.db/inventory';

This statement creates a table definition in the Metastore, but does not populate the partition-related information. To populate the partition-related information, you need to run MSCK REPAIR TABLE inventory.

You can use the hive.metastore.fshandler.threads parameter (default is 15) to increase the number of threads used for scanning the partitions in the MSCK phase.

For more information, refer to .

Analyzing Tables

When working with data on Amazon S3, the steps for analyzing tables are the same as when working with data in HDFS.

Table statistics can be gathered automatically by setting hive.stats.autogather=true or by running analyze table test compute statistics command. For example:

ANALYZE TABLE Table1 PARTITION(ds='2008-04-09', hr=11) COMPUTE STATISTICS;

However, column statistics can only be gathered by running analyze table test compute statistics for columns command. For example:

ANALYZE TABLE Table1 PARTITION(ds='2008-04-09', hr=11) COMPUTE STATISTICS FOR COLUMNS;

For more information and examples, refer to the Apache documentation.

Improving Performance for Hive Jobs

This section includes performance tips related to Hive.

General Performance Tips

Tune these parameters to improve Hive performance:

Parameter Recommended Setting
hive.warehouse.subdir.inherit.perms As S3 has no concept of file permissions, set hive.warehouse.subdir.inherit.perms = false to reduce the number of file permission checks.
hive.metastore.pre.event.listeners As S3 has no concept of directory permissions, set hive.metastore.pre.event.listeners= (set to an empty value) to reduce the number of directory permission checks in S3.

You can set these parameters in hive-site.xml.

Reading ORC Format in Hive

Tune these parameters to improve the performance of jobs that read the ORC format:

Parameter Recommended Setting
hive.orc.compute.splits.num.threads If using ORC format and you want improve the split computation time, you can increase hive.orc.compute.splits.num.threads (default is 10). This parameter controls the number of parallel threads involved in computing splits. Note that for Parquet it is still single threaded, so split computation can take longer with Parquet and S3.
hive.orc.splits.include.file.footer If using ORC format with ETL file split strategy, you can set hive.orc.splits.include.file.footer=true to piggyback the file footer information in split payload.

You can set these parameters using --hiveconf option in Hive CLI or using the set command in Beeline.

Accelerating ETL Jobs

Tune these parameters to improve the performance of ETL jobs:

Parameter Recommended Setting
hive.stats.fetch.partition.stats Query launches can be slightly slower if there are no stats available or when hive.stats.fetch.partition.stats=false. In such cases, Hive ends up looking at file sizes for every file it tries to access. Tuning hive.metastore.fshandler.threads helps helps reduce the overall time taken for the metastore operation.
fs.trash.interval As drop table can be slow in S3 as the action involves moving files to trash (a copy + delete), you can set fs.trash.interval=0 to completely skip trash altogether.

You can set these parameters using --hiveconf option in Hive CLI or using the set command in Beeline.

Accelerating Inserts in Hive

When inserting data, Hive moves data from a temporary folder to the final location. This move operation is actually a copy+delete action, which is expensive in Amazon S3; the more data is being written out to S3, the more expensive the operation is. To accelerate the process, you can tune hive.mv.files.thread (default is 15), depending on the size of your dataset. You can set it in hive-site.xml.

For more information on how to improve performance when working with Amazon S3, refer to Improving Amazon S3 Performance.