Overview

Shared Data Lake Services provide a way for you to centrally apply and enforce authentication, authorization, and audit policies across multiple ephemeral workload clusters. After "attaching" your workload cluster to the Data Lake Services, the attached cluster workloads will run in that security context.

While workloads are temporary, the security policies are long-running and shared for all workloads. As your workloads come and go, the instance of Data Lake Services lives on, providing consistent and available security policy definitions that are available for current and future ephemeral workloads.

Once you’ve created an instance of Data Lake Services - which for simplicity is referred to in the cloud controller web UI as a "Data Lake" or "DLS" - you have an option to attach it to one or more ephemeral clusters. This allows you to apply the authentication, authorization, and audit across multiple workload clusters.

Term Description
Authentication Source User source for authentication and definition of groups for authorization.
Data Lake Services Runs Ranger, which is used for configuring authorization policies and is used for audit capture.
Attached Clusters The clusters that get attached to the data lake. This is where you run workloads via JDBC and Zeppelin..

Architecture

The components of the Shared Data Lake Services include:

Component Technology Description
Schema Apache Hive Provides Hive schema (tables, views, and so on). If you have two or more workloads accessing the same Hive data, you need to share schema across those workloads.
Policy Apache Ranger Defines security policies around Hive schema. If you have two or more users accessing the same data, you need security policies to be consistently available and enforced.
Audit Apache Ranger Audits user access and captures data access activity for the workloads.
Directory LDAP/AD Provides an authentication source for users and a definition of groups for authorization.
Protected Gateway Apache Knox Supports a single workload endpoint that can be protected with SSL and enabled for authentication to access to resources.

Overview of Steps

Follow these steps to set up a data lake:

  1. Meet the prerequisites.
  2. Create a data lake.
  3. Once you’ve created a data lake, you can associate it with one or more ephemeral clusters. This option is available when you create a cluster in the DATA LAKE SERVICES section.

Prerequisites

In order to set up a data lake, you must first set up the following resources:

Prerequisite Description
Amazon S3 bucket You must have an existing Amazon S3 bucket. This bucket will be used to store Ranger audit logs and it will serve as the default Hive warehouse location.
LDAP/AD Instance You must have an existing LDAP/AD instance. Next, you must register it as an authentication source in the cloud controller web UI. For instructions on how to register an existing LDAP/AD, refer to Registering an Authentication Source.
Amazon RDS

You must have an existing Amazon RDS instance. You have two options:

  • Create an Amazon RDS instance (PostgreSQL). When creating a data lake, you will provide your endpoint and master credentials, and the cloud controller will automatically create databases for Hive and Ranger.
  • Create an Amazon RDS instance (PostgreSQL) with two databases on it - one for Hive and one for Ranger. You will register them when creating a data lake.

For instructions on how to create an Amazon RDS instnace, refer to Creating an Amazon RDS Instance.

After you've met these prerequisites, you can create a data lake.