---
title: Node Provider Maintenance Guide
slug: node-provider-maintenance
description: Day-to-day responsibilities for keeping node machines healthy on the Internet Computer — monitoring, common maintenance tasks, scheduled outages, and peer support.
tags:
  - node-provider
  - maintenance
  - operations
  - runbook
date: 2026-05-04
related:
  - node-provider-documentation
  - node-provider-troubleshooting
  - node-alerting-options
---

This guide is the operational handbook for an active node provider. It
covers the recurring work — monitoring, common machine-level tasks,
coordinating around data-center outages, and where to go for peer
support — that keeps a fleet healthy between major lifecycle events.
For the role as a whole, see [Node Provider Documentation](/wiki/node-provider-documentation/).

## Troubleshooting

When something is wrong with a node, start with the
[Node Provider Troubleshooting](/wiki/node-provider-troubleshooting/)
guide. It indexes the deployment-error, unhealthy-node, networking, and
NNS-proposal subguides.

## Submitting NNS proposals

Many maintenance actions — onboarding a new machine, updating a node's
IPv4 address, retiring a node, changing principals — are carried out
through proposals to the Network Nervous System (NNS). When a proposal
affects rewards, check the next minting date so you understand which
period the change will land in.

If a proposal you submitted does not adopt cleanly, follow
[Troubleshooting Failed NNS Proposals](/wiki/troubleshooting-failed-nns-proposals/).

## Monitoring

You are expected to monitor your nodes continuously. Public dashboards
(such as the [IC Dashboard](https://dashboard.internetcomputer.org/))
expose per-node health and the IC observability stack provides
machine-readable metrics.

Several community-built tools make this easier:

- **Aviate Labs Node Monitor** — turnkey email alerts for unhealthy
  nodes.
- **DIY Node Monitoring** — a community-shared GitHub repository with
  scripts you can adapt.
- **Prometheus exporter for node status** — exports node health in a
  Prometheus-compatible format so it can plug into your existing
  observability stack.

For more detail and links, see [Node Provider Alerting Options](/wiki/node-alerting-options/).

## Common maintenance tasks

The following tasks recur across the lifetime of a node. Each has
its own detailed procedure:

- [Removing a node from the registry](/wiki/removing-node-from-registry/)
- [Adding additional node machines to an existing node allowance](/wiki/adding-additional-nodes/)
- [Updating a node's IPv4 address and domain name](/wiki/updating-node-ipv4-and-domain/)
- [Changing the IPv6 addresses of nodes](/wiki/changing-node-ipv6/)
- [Moving a node between data centers](/wiki/moving-node-between-data-centers/)
- [iDRAC access and TSR logs](/wiki/idrac-access-and-tsr-logs/)
- [Checking node CPU and memory speed](/wiki/checking-node-cpu-and-memory/)
- [Updating firmware](/wiki/updating-node-firmware/)
- [Changing the node provider or data-center principal](/wiki/changing-node-provider-or-dc-principal/)

## Limited node provider console

Technicians interact with a running node through a deliberately
restricted console. The console exposes only the operations a node
provider needs — diagnostics, recovery initiation, registry-related
queries — and nothing else.

## Permitted tools

> [!WARNING]
> For security and confidentiality reasons, no other software is
> permitted to run on a node alongside the replica. Do not install
> diagnostic tools, monitoring agents, or shells onto the node itself.

Run troubleshooting tooling from a USB-booted Linux distribution or
from a separate auxiliary machine where you have full administrative
control. See *Setting up an auxiliary machine* below.

## Scheduled data-center outages

When a data center announces planned downtime that will affect your
nodes:

1. Notify DFINITY in the Node Provider Matrix channel ahead of the
   outage.
2. After the outage, verify that every affected node returns to a
   healthy state. The two acceptable states are:
   - **Active in subnet** — the node is healthy and serving traffic.
   - **Awaiting subnet** — the node is operational and ready to be
     assigned.
3. If a node remains degraded, give it time to catch up — but confirm
   it has reached one of the two healthy states before considering the
   outage closed.

## Node rewards based on useful work

> [!NOTE]
> The Internet Computer protocol can tolerate up to one third of nodes
> misbehaving, but providers are expected to keep their machines
> online and healthy regardless.

Automatic reward mechanisms tied to *useful work* are under
development. Until they ship, the operational expectation is the same:
keep nodes online and healthy. Once the new mechanisms are in effect,
unhealthy nodes will be penalised automatically.

## Subnet recovery

Occasionally a subnet will need recovery, and the recovery procedure
will require participation from the providers running its nodes. When
that happens, instructions are issued in the Node Provider Matrix
channel.

Follow the dedicated [Manual Node Recovery Guide](/wiki/manual-node-recovery/)
when instructed. Enable notifications on the Matrix channel so you do
not miss direct mentions during an active recovery.

## General best practice

1. Keep a separate diagnostic machine in the same rack as your nodes,
   so you can investigate problems without depending on the node
   itself or on remote access.
2. Engage with peers in the Node Provider Matrix channel — most
   problems have been seen before.

## Setting up an auxiliary machine for network diagnostics

An auxiliary machine sits next to your nodes in the rack and gives you
full-control Linux tooling without touching the replica. Provision it
as follows.

### Hardware

- Use any appropriately resourced server. There is no requirement for
  Gen-1 or Gen-2 hardware — this machine is not part of the network.
- Apply physical security controls equivalent to those you apply to
  the nodes themselves.

### Operating system and software

1. Install a minimal, hardened Linux distribution. Ubuntu 22.04 LTS is
   a good default.
2. Apply the latest security patches and firmware updates before
   placing the machine in service.

### Network configuration

- Assign an IPv6 address from the same range as your IC nodes so the
  diagnostic machine can talk to them on the data plane.
- Apply a restrictive firewall — allow only the traffic you actively
  need.
- Consider running the auxiliary machine behind a VPN except during
  active troubleshooting.

### Diagnostic tools

Install at minimum:

```bash
ping
traceroute
nmap
tcpdump
iperf
```

Configure any monitoring agents that simulate node-side traffic so you
can baseline the network in normal conditions.

### Access control

- Use strong, unique passwords; prefer SSH key-based authentication.
- Disable root SSH login.
- Review authentication and command logs regularly for anomalies.

### Maintenance

- Update the operating system and tooling on a regular cadence.
- Periodically re-run the diagnostic tools against a known-good node
  to confirm the auxiliary machine itself is still working.

## Peer support: Node Provider Matrix channel

The Node Provider Matrix channel is the primary venue for
maintenance-related questions, peer assistance, and incident reporting.
Search the channel before posting — many issues have been resolved
there before.

When using the channel:

- Enable notifications so you receive direct mentions promptly,
  especially during incidents.
- Include your node-provider name in your Matrix alias so other
  providers and DFINITY engineers can identify you quickly.

Always consult [Node Provider Troubleshooting](/wiki/node-provider-troubleshooting/)
before asking for help.

## Related

- [Node Provider Documentation](/wiki/node-provider-documentation/) — the role overview.
- [Node Provider Troubleshooting](/wiki/node-provider-troubleshooting/) — the troubleshooting index.
- [Node Provider Alerting Options](/wiki/node-alerting-options/) — monitoring tools in detail.
- [Manual Node Recovery Guide](/wiki/manual-node-recovery/) — what to do when a subnet recovery is called.