Creating a Data Lake

Before creating a data lake, you must meet the prerequisites. If you've met the prerequisites, you can proceed with the steps to create a data lake:

  1. From the cloud controller menu, select DATA LAKE SERVICES.

  2. Click +CREATE and the following form is displayed:

    By default, only basic options are shown. You can expand SHOW ADVANCED OPTIONS to view additional options.

  3. Provide the following GENERAL CONFIGURATION parameters:

    Parameter Description
    Data Lake Name

    Enter a name for your Data Lake. The name:

    • Must start with a letter.
    • Must include 5-20 characters.
    • Can include only lowercase letters, numbers, and -.

    Usage Select Dev/Test.
    Size Select Small. One EC2 instance will be created yo run your Data Lake Services.

    You can expand SHOW ADVANCED OPTIONS to view additional options:

    Parameter Description
    Instance Type Choose the instance type for the data lake node.
    Tags You can add custom tags that will be displayed on the CloudFormation stack and on EC2 instances. Refer to Tagging Resources for more information.
  4. Provide the following NETWORK parameters:

    Parameter Description
    SSH Key Name Choose an existing EC2 key pair to enable SSH access to EC2 instance via that key.
    Remote Access Allow connections to the inbound ports for the data lake instance from this address range. Must be a valid CIDR IP. For example:
    • 10.0.0.0/24 will allow access from 10.0.0.0 through 10.0.0.255.
    • 0.0.0.0/0 will allow access from all.
    Protected Gateway Access This option is checked by default. This option provides password-protected access to the Ambari and Ranger web UIs. If you uncheck it, you will not be able to log in to these UIs. See Protected Gateway for more information.

    You can expand SHOW ADVANCED OPTIONS to view additional options:

    Parameter Description
    Use existing VPC and subnet Specify whether to use an existing VPC and subnet to deploy the data lake inside it. See Existing VPC for more information.
  5. Provide the following SECURITY parameters:

    Parameter Description
    Authentication Source Select a previously registered authentication source. For more information, refer to Managing Authentication Sources.
    Data Lake Administrator Enter the credentials that you would like to use for administering the data lake. This provides a default administrator credentials for the Ambari and Ranger.
  6. Provide the following CLOUD STORAGE parameters:

    Parameter Description
    Amazon S3 Path Enter the name of an existing S3 bucket (for example, my bucket) or a path to a specific folder in this bucket (for example, my bucket/data/data1). The bucket must exist prior to being registered with the data lake.
  7. Provide the following SHARED SERVICES parameters:

    Parameter Description
    JDBC Connection Select the database type (PostgreSQL) and enter the JDBC connection string (HOST:PORT).
    Master Credentials Enter the master username and password for the RDS instance.

    This information will be used to create two databases on the RDS instance: one for Hive and one for Ranger.

    Alternatively, you can create these two databases by yourself and specify the connection information and credentials for each database. To access this option, expand SHOW ADVANCED OPTIONS and provide the following information for your Hive Metastore and Ranger Database.
    Refer to Hive Metastore and Ranger Database for more information.

  8. Click CREATE DATA LAKE.

  9. Once the data lake is created, you can find its corresponding entry on the DATA LAKE SERVICES page, which is available from the navigation menu.

Once you’ve created a data lake, you can associate it with one or more ephemeral clusters. This option called DATA LAKE SERVICES is available when you create a cluster.

Existing VPC

You can optionally choose to install into a different VPC (and subnet) than the VPC in which the cloud controller instance is running. Default is to install the data lake instances into the same VPC as the cloud controller instance, but in a new subnet.

Hive Metastore

You have an option to either use a previously registered external Hive metastore or to have an external Hive metastore database created. In both cases, your external Hive metastore will be running on an Amazon RDS instance.

Parameter Description
Register new Hive metastore... Enter connection information for an existing database on an existing Amazon RDS instance and this Hive metastore will be automatically registered and used with the data lake. See Managing Shared Metastores for more information.
List of registered Hive Metastores If you have previously registered a Hive metastore for HDP 2.6, you can select it from the list. This option is only available if you have previously registered at least one Hive metastore for HDP 2.6.

Ranger Database

Enter the following information and a Ranger database will be created on an existing Amazon RDS instance:

Parameter Description
JDBC Connection Select the database type (PostgreSQL) and enter the JDBC connection string (HOST:PORT/DB_NAME).
Authentication Enter the JDBC connection username and password.