Introduction

Welcome to the Hortonworks Data Cloud Technical Preview.

Hortonworks Data Cloud (HDCloud) for Amazon Web Services (AWS) is a service that allows you to quickly launch ephemeral clusters for workloads analyzing and processing data. Powered by the Hortonworks Data Platform, Hortonworks Data Cloud is an easy-to-use solution for handling big data use cases in the cloud for Interactive Analytics (Apache Hive LLAP), Data Science (Apache Spark and Apache Zeppelin) and ETL (Apache Hive).

Use Cases

Ephemeral on-demand clusters: Spin up a Hadoop cluster within minutes and start running workloads immediately. Instead of going through infinite configuration options, choose from a set of prescriptive cluster configurations. Add additional nodes on demand and when you are done with your analysis, give the resources back to the cloud.

Spark and Hive workloads: Spin up Spark or Hive clusters, depending on your specific data science (Spark and Zeppelin) or data analytics (Hive and Hive LLAP) tasks.

Integrate with Amazon S3: Collect and publish data across applications to Amazon S3, and then use this data for analysis. Collect and store data on S3 even while no Hadoop clusters are active and publish data to S3 for access so that your data persists after you terminate the cluster.

Automation: Automatically create clusters, run specific jobs, and then terminate the clusters.

Security for enterprise use cases: Centrally apply and enforce authentication, authorization, and audit capabilities against your workload clusters.

Architecture

The following graphic illustrates high-level architecture of the Hortonworks Data Cloud:

Primary Components

The two primary components of Hortonworks Data Cloud are the cloud controller and one or more clusters being managed by that controller. The cloud controller and the cluster nodes run on EC2 instances.

The cloud controller is a web application that communicates with the AWS Services to create AWS resources on your behalf. Once the AWS resources are in place, the cloud controller uses Apache Ambari to deploy and configure the cluster to the AWS instances, based on your choice of HDP version and cluster configuration. Once your cluster is deployed, you can use the cloud controller to manage the cluster.

A cluster, used for running workloads, includes three node types: master, worker, and compute.

For the purposes of instance scaling and management, cluster instances are deployed into three auto scaling groups: one for the master node, one for the worker nodes, and another ones for the compute nodes. For more information on auto scaling groups, see AWS documentation.

AWS Services

The following AWS services are used by Hortonworks Data Cloud:

Network and Security

In addition to the Amazon EC2 instances created for the cloud controller and cluster nodes, Hortonworks Data Cloud deploys the following network and security AWS resources on your behalf:

Amazon RDS

When creating a cluster, you have an option to have a Hive Metastore database created with the cluster or to use an external Hive Metastore that is backed by Amazon RDS. Using an external Amazon RDS database for the Hive Metastore allows you to preserve the Hive Metastore metadata and reuse between clusters. For more information, see Managing Metastores documentation.

Furthermore, you have an option to use an external Amazon RDS database to store cloud controller configuration information for upgrade and recovery purposes. For more information, see Amazon RDS Instance documentation.

Amazon S3

Hortonworks Data Cloud provides seamless access to Amazon S3 buckets, in which you can store data for an extended period of time. You can copy the data sets to HDFS for analysis and then copy back to S3 when done. For more information, see Data Storage on Amazon S3 documentation.

Get Started

This section will get you running Hortonworks Data Cloud in your AWS environment.

To get started:

  1. Meet the prerequisites.
  2. Review available AWS regions and select the region in which you would like to launch the cloud controller.
  3. Review available cluster configurations and select your desired configuration.
  4. Launch a cloud controller instance that you will use to provision a cluster.
  5. Log in to the cloud controller UI and create a cluster.

Note

The Hortonworks Data Cloud software runs in your AWS environment. You are responsible for AWS charges while running Hortonworks Data Cloud and the clusters being managed by Hortonworks Data Cloud. To learn more about AWS pricing, see service-specific pricing pages or AWS Simple Monthly Calculator.

Prerequisites

To use Hortonworks Data Cloud, you need the following:

  1. AWS account: If you already have an AWS account, log in to the AWS Management Console. Alternatively, you can create a new AWS account.
  2. A key pair in a selected region: The Amazon EC2 instances that you create for Hortonworks Data Cloud will be accessible by the key pair that you provide during installation. Refer to the AWS documentation for instructions on how to create a key pair in a selected region.

AWS Regions

Not all AWS Services are supported in all regions (For details, see the AWS Region Table). Therefore, Hortonworks Data Cloud can only be launched in the following regions:

Region Name Region
US East (N. Virginia) us-east-1
US West (Oregon) us-west-2
EU Central (Frankfurt) eu-central-1
EU West (Dublin) eu-west-1
Asia Pacific (Tokyo) ap-northeast-1

Cluster Workload Configurations

You can create different types of Apache Hive and Apache Spark clusters. After you have launched the cloud controller and it's time to create a cluster, you will be prompted to choose the HDP Version and the Cluster Type.

HDP Version: HDP 2.6 Cloud
Cluster Type Services Description
Data Science Spark 1.6,
Zeppelin 0.7.0
This cluster configuration includes Spark 1.6 with Zeppelin.
Data Science Spark 2.1,
Zeppelin 0.7.0
This cluster configuration includes Spark 2.1 with Zeppelin.
EDW - Analytics Hive 2 LLAP,
Zeppelin 0.7.0
This cluster configuration includes Hive 2 LLAP.
EDW - ETL Hive 1.2.1,
Spark 1.6
This cluster configuration includes Hive and Spark 1.6.
EDW - ETL Hive 1.2.1,
Spark 2.1
This cluster configuration includes Hive and Spark 2.1.
BI Druid 0.9.2 This cluster configuration includes a Technical Preview of Druid.
HDP Version: HDP 2.5 Cloud
Cluster Type Services Description
Data Science Spark 1.6,
Zeppelin 0.6.0
This cluster configuration includes Spark 1.6 and Zeppelin.
EDW - ETL Hive 1.2.1,
Spark 1.6
This cluster configuration includes Hive and Spark 1.6.
EDW - ETL Hive 1.2.1,
Spark 2.0
This cluster configuration includes a Technical Preview of Spark 2.0.
EDW - Analytics Hive 2 LLAP,
Zeppelin 0.6.0
This cluster configuration includes a Technical Preview of Hive 2 LLAP.

For a full list of services included in each of the configurations, refer to Cluster Services.

Choosing Your Configuration

When creating a cluster, you can choose a more stable cluster configuration for a predicable experience. Alternatively, you can try the latest capabilities by choosing a cluster configuration that is much more experimental. The following configuration classification applies:

  • Stable configurations are the best choice if you want to avoid issues and other problems with launching and using clusters.
  • If you want to use a Technical Preview version of a component in a release of HDP, use these configurations.
  • These are the most cutting edge of the configurations, including Technical Preview components in a Technical Preview HDP release.