Creating a Data Lake

Before creating a data lake, you must meet the prerequisites.

  1. From the cloud controller menu, select DATA LAKE SERVICES.

  2. Click +CREATE.

  3. Provide the following GENERAL CONFIGURATION parameters:

    Parameter Description
    Data Lake Name Enter a name for your Data Lake.
    Usage Select Dev/Test.
    Size Select Small (One EC2 instance will be created).

    You can expand SHOW ADVANCED OPTIONS to view additional options:

    Parameter Description
    Instance Type Choose the instance type for the data lake node.
    Tags You can optionally add custom tags that will be displayed on the CloudFormation stack and on EC2 instances. Refer to Tagging Resources for more information.
  4. Provide the following NETWORK parameters:

    Parameter Description
    SSH Key Name Name of an existing EC2 key pair to enable SSH to access the EC2 instances.
    Remote Access Allow connections to the inbound ports for the data lake instance from this address range. Must be a valid CIDR IP. For example:
    • 10.0.0.0/24 will allow access from 10.0.0.0 through 10.0.0.255.
    • 0.0.0.0/0 will allow access from all.
    Refer to Security Groups for more information on the inbound ports that are used with data lake instances.
    Protected Gateway Access This option is checked by default. This option provides password-protected access to the Ambari and Ranger web UIs. See Protected Gateway for more information.

    You can expand SHOW ADVANCED OPTIONS to view additional options:

    Parameter Description
    Use existing VPC and subnet Specify whether to use an existing VPC and subnet to deploy the data lake inside it. See Existing VPC for more information.
  5. Provide the following SECURITY parameters:

    Parameter Description
    Authentication Source Select a previously registered authentication source. For more information, refer to Managing Authentication Sources.
    Data Lake Administrator Enter the credentials that you would like to use for administering the data lake. This provides a default admin login for the Ambari and Ranger in the data lake.
  6. Provide the following CLOUD STORAGE parameters:

    Parameter Description
    Amazon S3 Path Enter the name of an existing S3 bucket (for example, my bucket) or a path to a specific folder in this bucket (for example, my bucket/data/data1). The bucket must exist prior to being registered with the data lake.
  7. Provide the following SHARED SERVICES parameters:

    Parameter Description
    JDBC Connection Select the database type (PostgreSQL) and enter the JDBC connection string (HOST:PORT).
    Master Credentials Enter the master username and password for the RDS instance.

    This information will be used to create two databases on the RDS instance: one for Hive and one for Ranger.

    Alternatively, you can create these two databases by yourself and specify the connection information and credentials for each database. To access this option, expand SHOW ADVANCED OPTIONS and provide the following information for:

  8. Click CREATE DATA LAKE.

  9. Once the data lake is created, you can find its corresponding entry on the DATA LAKE SERVICES page, which is available from the navigation menu.

Once you’ve created a data lake, you can associate it with one or more ephemeral clusters. This DATA LAKE SERVICES option is available when you create a cluster.

Existing VPC

You can optionally choose to install into a different VPC (and subnet) than the VPC in which the cloud controller instance is running. Default is to install the data lake instances into the same VPC as the cloud controller instance, but in a new subnet.

Hive Metastore

You have an option to either use a previously registered external Hive metastore or to have an external Hive metastore database created. In both cases, your external Hive metastore will be running on an Amazon RDS instance.

Parameter Description
Register new Hive metastore... Enter connection information for an existing database on an existing Amazon RDS instance and this Hive metastore will be automatically registered and used with the data lake. See Managing Shared Metastores for more information.
List of registered Hive Metastores If you have previously registered a Hive metastore for HDP 2.6, you can select it from the list. This option is only available if you have previously registered at least one Hive metastore for HDP 2.6.

Ranger Database

Enter the following information and a Ranger database will be created on an existing Amazon RDS instance:

Parameter Description
JDBC Connection Select the database type (PostgreSQL) and enter the JDBC connection string (HOST:PORT/DB_NAME).
Authentication Enter the JDBC connection username and password.