Your node agent cannot tell you it crashed. That sentence sounds obvious until you watch it happen.
A platform team at a logistics startup deployed a "smart node" orchestration framework across forty edge devices sitting in distribution centers. The rollout looked clean. Dashboards green, agents reporting in, everyone moved on to the next sprint. Two weeks later half the fleet went silent. The on-call engineer SSH'd into one box and found the agent process still running, still listening on port 9100, still returning {"status":"ok"}. The catch: that JSON came from a file written at boot. The worker thread had died days earlier. The HTTP server kept serving a cached lie, and the dashboard kept showing a green light for a node that had not done real work in over a week.
Here is what most teams get wrong with self-managed node fleets. They let the agent grade its own homework. If the agent is the only thing reporting "I am healthy," then the one failure you care about most, the agent itself breaking, is the exact failure that cannot raise its hand. You have built a smoke detector wired to the same fuse as the stove.
1. Self-reported health is a lie waiting to happen
The fix is an external prober. Something that lives off the node, calls the node on a fixed interval, and checks not just that you got a 200 but that the answer is fresh.
curl -fsS http://node-17:9100/healthz | jq -e '.timestamp > (now - 30)'
If that timestamp is older than your probe interval, the node is dead to you. No appeals. A 200 with a stale body is worse than a connection refused, because connection refused at least tells the truth.
The reason the logistics fleet stayed green so long is that nobody ever asked "when was this answer computed." They asked "did it return ok," and a file on disk will happily return ok forever.
2. Cache the config, never cache the heartbeat
Caching is fine for things that change rarely: TLS certs, peer lists, feature flags, the desired image tag. It is poison for liveness. The bug above happened because someone wrote boot-time state to /var/lib/agent/state.json and the health handler read from that file instead of recomputing anything.
The rule that holds up in production: anything answering "am I alive right now" must execute fresh work inside the handler. Hit the database. Read the queue depth off the broker. Touch a file and read it back. Compute a checksum. If the handler can return success without doing a single piece of real work, it will eventually return success when nothing is working.
def healthz():
# bad: returns the same answer whether or not the worker is alive
return read_json("/var/lib/agent/state.json")
def healthz():
# good: proves the pipeline is live this very second
depth = broker.queue_depth("ingest") # real network call
last = worker.last_processed_unix() # set by the worker loop
fresh = (now() - last) < 30
return {"ok": fresh, "queue": depth, "ts": now()}
The second handler cannot pass while the worker is dead, because last_processed_unix stops moving the moment the worker stops.
3. Heartbeats need a deadman, not a counter
A counter of "heartbeats received" climbs forever and looks healthy in Grafana even while half the fleet is cold. Counters measure presence. You need to alert on absence.
Run a deadman timer per node. The signal is silence.
- alert: NodeSilent
expr: time() - node_last_heartbeat_timestamp_seconds > 90
for: 2m
labels:
severity: page
Now a node that stops emitting pages you within a couple of minutes. You are no longer waiting for a node to report that it is broken. You are watching for the moment it stops reporting at all, which is the only thing a truly dead node can "tell" you.
4. Pull from source of truth every cycle
The few milliseconds you save by caching desired state inside the agent is exactly what burns you when state diverges. If the orchestrator keeps desired state in etcd or behind a control plane API, the agent should reconcile against that store on every loop iteration, not against a copy it grabbed at startup.
Yes, this means more network calls. That is the point. Cheap, frequent, idempotent reconciliation beats clever caching every time, because drift cannot accumulate if you re-read the truth every thirty seconds.
for range time.Tick(30 * time.Second) {
desired, err := controlPlane.GetDesired(nodeID) // every loop, no cache
if err != nil {
metrics.ReconcileErrors.Inc()
continue
}
reconcile(currentState(), desired)
heartbeat.Touch() // updates node_last_heartbeat_timestamp_seconds
}
Run that loop fast enough that drift stays small. Thirty seconds is usually fine. Five minutes is usually not, because five minutes of divergence on forty nodes is a long time for a wrong config to do damage.
5. Test your prober by killing things
An alerting stack that has never fired is not proven, it is just untested. Once a quarter, pick three random nodes and kill -9 the agent. Your NodeSilent alert should fire inside the deadman window. If it does not, your prober has a gap and you only found out because you went looking, instead of finding out during a real outage at 3am.
This is the one honest test of an observability stack: break something on purpose and confirm the page lands. Everything else is hope.
Why this matters in production
A dead node that reads green does not cause a clean outage. It causes a slow one. Orders pile up in a queue nobody is draining. A scanner in a distribution center quietly stops syncing while the dashboard insists everything is fine. By the time a human notices, you are debugging a week of backlog instead of a single failed process. The cost is not the crash. The cost is the hours between the crash and the moment anyone believed it.
Done right
Five pieces, none of them clever, all of them load-bearing: an external prober that checks freshness, a health handler that does real work every call, deadman alerts that page on silence, a reconcile loop that pulls desired state every cycle, and a quarterly chaos drill that proves the alerts still fire. Trust the absence of data, never the presence of a cached "ok."
Want hands-on labs that wire up probers, deadman alerts, and reconcile loops against a real control plane? See tekanaid.com/courses.

