Getting Started with Data Lake Services

Shared Data Lake Services provide a way for you to centrally apply (and enforce) authentication, authorization and audit capabilities against your workload clusters.

These policies can be applied globally across multiple ephemeral clusters. By "attaching" your workload cluster to the Services, the attached cluster workloads will run in that security context.

As your workloads come-and-go, the instance of Data Lake Services live-on providing consistent and available security policy definitions that are shared with other ephemeral workloads.

Once you’ve created an instance of Data Lake Services - which for simplicity is referred to in the cloud controller web UI as a Data Lake - you have an option to attach it to one or more ephemeral clusters. This allows you to apply the data access, security, and authentication policies across multiple clusters.

Architecture

The components of the “Shared Data Lake Services” include:

Component Technology Description
Schema Metastore Apache Hive Provides Hive schema (tables, views, etc). If you have 2 or more workloads accessing the same Hive data, need to share schema across those workloads.
Security Policies Apache Ranger Defines security policies around Hive schema. If you have 2 or more users accessing the same data, need security policies to be consistently available and enforced.
Audit Logging Apache Ranger Audit user access. Captures data access activity for the workloads.
User and Group Directory LDAP/AD Provides an authentication source for users and definition of groups for authorization.
Protected Gateway Apache Knox Supports a single workload endpoint that can be protected with SSL and enabled for authentication to access to resources.

Overview of Tasks

  1. Meet the prerequisites.
  2. Create a data lake.
  3. Once you’ve created a data lake, you can associate it with one or more ephemeral clusters. This is available when you create a cluster in the DATA LAKE SERVICES section.

You can also manage your existing data lakes.

Prerequisites

In order to set up a data lake, you must first set up the following:

Prerequisite Description
Amazon S3 bucket An existing Amazon S3 bucket will be used to store Ranger audit logs and it will serve as the default Hive Warehouse location.
LDAP/AD Instance You must register an existing LDAP/AD instance as an authentication source.
Amazon RDS

You have two options:

  • Create an Amazon RDS instance (PostgreSQL) and provide your endpoint and master credentials when creating a data lake. Cloud controller will automatically create databases for Hive and Ranger.
  • Create an Amazon RDS instance (PostgreSQL) with two databases in it -- one for Hive and one for Ranger. You will register them when creating a data lake.

After you've met these prerequisites, you can create a data lake.