Getting Started with Data Lake Services
Shared Data Lake Services provide a way for you to centrally apply (and enforce) authentication, authorization and audit capabilities against your workload clusters.
These policies can be applied globally across multiple ephemeral clusters. By "attaching" your workload cluster to the Services, the attached cluster workloads will run in that security context.
As your workloads come-and-go, the instance of Data Lake Services live-on providing consistent and available security policy definitions that are shared with other ephemeral workloads.
Once you’ve created an instance of Data Lake Services - which for simplicity is referred to in the cloud controller web UI as a Data Lake - you have an option to attach it to one or more ephemeral clusters. This allows you to apply the data access, security, and authentication policies across multiple clusters.
The components of the “Shared Data Lake Services” include:
|Schema Metastore||Apache Hive||Provides Hive schema (tables, views, etc). If you have 2 or more workloads accessing the same Hive data, need to share schema across those workloads.|
|Security Policies||Apache Ranger||Defines security policies around Hive schema. If you have 2 or more users accessing the same data, need security policies to be consistently available and enforced.|
|Audit Logging||Apache Ranger||Audit user access. Captures data access activity for the workloads.|
|User and Group Directory||LDAP/AD||Provides an authentication source for users and definition of groups for authorization.|
|Protected Gateway||Apache Knox||Supports a single workload endpoint that can be protected with SSL and enabled for authentication to access to resources.|
Overview of Tasks
- Meet the prerequisites.
- Create a data lake.
- Once you’ve created a data lake, you can associate it with one or more ephemeral clusters. This is available when you create a cluster in the DATA LAKE SERVICES section.
You can also manage your existing data lakes.
In order to set up a data lake, you must first set up the following:
|Amazon S3 bucket||An existing Amazon S3 bucket will be used to store Ranger audit logs and it will serve as the default Hive Warehouse location.|
|LDAP/AD Instance||You must register an existing LDAP/AD instance as an authentication source.|
You have two options:
After you've met these prerequisites, you can create a data lake.