Shared Data Lake Services provide a way for you to centrally apply and enforce authentication, authorization, and audit policies across multiple ephemeral workload clusters. After "attaching" your workload cluster to the Data Lake Services, the attached cluster workloads will run in that security context.
While workloads are temporary, the security policies are long-running and shared for all workloads. As your workloads come and go, the instance of Data Lake Services lives on, providing consistent and available security policy definitions that are available for current and future ephemeral workloads.
Once you’ve created an instance of Data Lake Services - which for simplicity is referred to in the cloud controller web UI as a "Data Lake" or "DLS" - you have an option to attach it to one or more ephemeral clusters. This allows you to apply the authentication, authorization, and audit across multiple workload clusters.
|Authentication Source||User source for authentication and definition of groups for authorization.|
|Data Lake Services||Runs Ranger, which is used for configuring authorization policies and is used for audit capture.|
|Attached Clusters||The clusters that get attached to the data lake. This is where you run workloads via JDBC and Zeppelin..|
The components of the Shared Data Lake Services include:
|Schema||Apache Hive||Provides Hive schema (tables, views, and so on). If you have two or more workloads accessing the same Hive data, you need to share schema across those workloads.|
|Policy||Apache Ranger||Defines security policies around Hive schema. If you have two or more users accessing the same data, you need security policies to be consistently available and enforced.|
|Audit||Apache Ranger||Audits user access and captures data access activity for the workloads.|
|Directory||LDAP/AD||Provides an authentication source for users and a definition of groups for authorization.|
|Protected Gateway||Apache Knox||Supports a single workload endpoint that can be protected with SSL and enabled for authentication to access to resources.|
Overview of Steps
Follow these steps to set up a data lake:
- Meet the prerequisites.
- Create a data lake.
- Once you’ve created a data lake, you can associate it with one or more ephemeral clusters. This option is available when you create a cluster in the DATA LAKE SERVICES section.
In order to set up a data lake, you must first set up the following resources:
|Amazon S3 bucket||You must have an existing Amazon S3 bucket. This bucket will be used to store Ranger audit logs and it will serve as the default Hive warehouse location.|
|LDAP/AD Instance||You must have an existing LDAP/AD instance. Next, you must register it as an authentication source in the cloud controller web UI. For instructions on how to register an existing LDAP/AD, refer to Registering an Authentication Source.|
You must have an existing Amazon RDS instance. You have two options:
For instructions on how to create an Amazon RDS instnace, refer to Creating an Amazon RDS Instance.
After you've met these prerequisites, you can create a data lake.