Skip to main content

Planned Maintenance Procedures for Live Networks

Planned Maintenance Procedures for Live Networks

Taking a backbone node offline for maintenance - whether for firmware updates, hardware replacement, or antenna adjustments - affects the users routing through it. With proper planning, that impact can be reduced to a brief interruption rather than a prolonged outage. This page describes a repeatable maintenance procedure for Meshtastic backbone nodes in active community networks.

Pre-Maintenance Checklist

Complete these steps before any planned maintenance window:

  • Notify the community: Post advance notice (24-48 hours) in your community communication channels - Discord, Signal group, or whatever your community uses. Include the node name, scheduled time, and estimated duration. Users who depend on that backbone link can plan accordingly.
  • Verify backup paths: Run traceroutes from nodes on either side of the target to confirm alternate routing exists. If no backup path exists, do NOT take the node offline for non-emergency maintenance until a temporary relay is deployed and verified, or defer the maintenance. For a sole-path backbone node, planned maintenance without a backup path is itself a planned outage - schedule it only when the network is not needed and notify all users that the network will be DOWN, not merely degraded. Always have a rollback plan: a failed firmware flash can leave a node crash-looping, so never update or remove the only repeater on a path without a verified way to restore it.
  • Schedule during low-traffic hours: 2-5 AM local time is typically the quietest window for community mesh networks. Emergency networks may have different quiet windows - check your message logs to identify the lowest-traffic period.
  • Document current configuration: If replacing hardware, record all node settings (node name, channel configuration, role, hop limit) so the replacement can be configured identically before going live.

During Maintenance

Where possible, power down gracefully rather than performing a hard power-off. A graceful shutdown lets the node stop transmitting cleanly and avoids cutting off a packet mid-transmission or corrupting the node's stored configuration during a flash or write. For solar-powered nodes, the practical approach is to disconnect the load output of the charge controller rather than physically unplugging the node in the dark.

Perform the maintenance task - firmware flash, hardware swap, antenna replacement - as quickly as practical. Every additional minute of downtime increases the chance of a user encountering a failed message delivery.

Post-Maintenance Verification

Before declaring the node returned to service, verify it from multiple directions:

  1. Send a test message from a node that previously routed through the maintained node and confirm delivery.
  2. Run traceroutes from multiple directions to confirm the node is routing normally.
  3. If you run a monitoring system (for example MQTT/Grafana or a third-party map such as meshmap.net), confirm it shows the node online AND recently active. Note that seeing a node online is necessary but NOT sufficient - a node can appear up while failing to relay (for example, wrong role or rebroadcast mode). Always also complete step 1 (a real test message routed through the node) before declaring it back in service.
  4. Update your network maintenance log with the date, work performed, and any configuration changes.

Emergency Rollback Procedure

If a replacement node does not work as expected - wrong firmware, hardware fault, or configuration error - act quickly. Restore the original node if it is still functional, or connect a known-good spare configured to match the original settings. If neither is possible, notify the community immediately that the outage is extended and provide an estimated restoration time. For any network used in emergencies, keep at least one spare node per critical site, PRE-CONFIGURED to match that site (role, channel/PSK, hop limit, fixed position). An unconfigured spare is not a substitute - configuring on-site under incident pressure wastes the very time the spare was meant to save.

After any unplanned extended outage, conduct a brief post-incident review: what failed, why, and what process change would prevent recurrence. Even a one-paragraph note in the maintenance log is valuable for future operators.