Node Auto Repair

The cloud controller monitors clusters, ensuring that when host-level failures occur, they are quickly resolved by deleting and replacing failed nodes.

For each cluster, the cloud controller checks for Ambari Agent heartbeat on all cluster nodes. If the Ambari Agent heartbeat is lost on a node, a failure is reported for that node. This may happen for the following reasons:

Once a failure is reported, it is repaired automatically (if auto repair is enabled), or options are available for you to repair the failure manually (if auto repair is disabled).

Repair Flow

Repair Flow with Auto Repair Enabled

With node auto repair enabled, when a host fails:

  1. A notification about node failure(s) is printed in the UI.
  2. The recovery flow is triggered. The cluster status changes to 'REPAIR'.
  3. First step of recovery: downscale (remove failed nodes).
  4. Second step of recovery: upscale (new nodes of the same type are added in place of the failed nodes).
  5. The recovery flow is completed. The cluster status changes to 'RUNNING'.

Click here for a screenshot that includes messages written to the EVENT HISTORY log during the auto repair process (with the earliest message in the bottom).

Repair Flow with Auto Repair Disabled

With node auto repair disabled, when a worker or compute node fails:

  1. A notification about node failures is printed in the UI.
  2. You have an option to repair or delete failed nodes.

For compute nodes running on spot instances, the repair option is not available, so when the host fails:

  1. A notification about node failures is printed in the UI.
  2. You have an option to delete failed nodes.

Node Repair Options

The following table describes repair options available for different node types:

AUTO REPAIR (AUTO REPAIR ON) MANUAL REPAIR (AUTO REPAIR OFF)
Master Not available Not available
Worker Available Available (Repair/Delete)
Compute on-demand instance Available Available (Repair/Delete)
Compute spot instance Not available Available (Delete)

As illustrated in the table, auto repair and manual repair are available for worker and compute nodes running on on-demand EC2 instances. Auto repair option is not available for compute nodes that run on spot instances. Furthermore, in the case of manual repair (i.e. when auto repair is turned off) for compute nodes that run on spot instances, you can delete failed nodes, but you don’t have an option to repair them.

Enabling Node Auto Repair

When creating a cluster, you can enable or disable auto repair for worker and compute nodes by using the ON/OFF option available under HARDWARE & STORAGE > Auto Repair:

Checking If Node Auto Repair Is Enabled

For each running cluster, you can see on the cluster detail screen whether auto repair is enabled for worker and/or compute nodes. When it is enabled, you will see auto repair is active text next to the name of the node group. For example, in the following screenshot we can see that auto repair is enabled for worker nodes and disabled for compute nodes:

Performing Manual Node Repair

With auto repair enabled, cloud controller automatically performs repair tasks for you. However, if you disable node auto repair, it is your responsibility to check for failures and, when they occur, repair or delete failed nodes.

Manually Checking for Failures

When host-level failures are detected on worker or compute nodes, the following message is displayed on the cluster tile: “The cluster has unhealthy nodes”.

In addition, a message similar to “Manual recovery is needed for the following node...” is written to the EVENT HISTORY, and the status of the node changes from green to red.

On the cluster details page, you can see in the WORKER and COMPUTE section how many nodes of each type failed. Furthermore, in the HARDWARE section, you can see which specific nodes failed the status of the failed nodes changes from green to red.

For example, the following screenshot shows one failed worker node:

Manually Repair or Delete Failed Nodes

In case of node failures, you have an option to manually repair or delete the failed node(s) using the following options:

To repair failed node(s):

  1. Click on the (ADD ICON) icon.
  2. You will be asked: “Are you sure you want to repair the failed [worker/compute] node(s)?”.
  3. Select Proceed.
  4. The recovery flow is triggered. The cluster status changes to 'REPAIR'.
  5. Failed instances are removed from the cluster and new instances are added in their place.
  6. The recovery flow is completed. The cluster status changes to 'RUNNING'.

Click here for a screenshot that includes messages written to the EVENT HISTORY log during the manual repair process (with the earliest message in the bottom).

To delete failed node(s):

  1. Delete the node by clicking on the (ADD ICON) icon.
  2. You will be asked: “Are you sure you want to remove the failed [worker/compute] node(s)?”.
  3. Select Proceed.
  4. The recovery flow is triggered. The cluster status changes to 'REPAIR'.
  5. Failed instances are removed from the cluster.
  6. The recovery flow is completed. The cluster status changes to 'RUNNING'.

Click here for a screenshot that includes messages written to the EVENT HISTORY log during the node deletion process (with the earliest message in the bottom).

Remember that you can add additional nodes later by resizing the cluster.

Node Auto Repair via CLI

Enabling Node Auto Repair

When creating a cluster via CLI, you can enable node auto repair for worker or compute nodes by setting the RecoveryMode option to AUTO (which is the default setting).

To disable node auto repair, set RecoveryMode to MANUAL. This will require you to manually check for and repair failures.

Manually Checking and Repairing Failures

To check for failures, use the hdc list-clusters and hdc describe-cluster commands.
To repair your cluster in case of a node failure, use the hdc repair-cluster command.

For more information, refer to Monitoring and Repairing Your Cluster in the CLI documentation.