tests/nodeagent: e2e coverage for reboot paths, restart counter, and disk-space maintenance by eriknordmark · Pull Request #1162 · lf-edge/eden

eriknordmark · 2026-05-07T23:23:42Z

Summary

Adds a new tests/nodeagent/ suite with six e2e tests that exercise nodeagent's safety-net reboot paths, restart-counter persistence, and the maintenance-mode-on-no-disk-space chain.

Cloud-disconnect reboot paths

Each in two distinct network-failure modes:

Test	Path tested	Failure mode
`reset_on_disconnect_link_down`	`BootReasonDisconnect`	eth0 link-down (`eden eve link down` / QEMU `set_link off`)
`reset_on_disconnect_blackhole`	`BootReasonDisconnect`	controller IP silently dropped via in-EVE `iptables -A OUTPUT -d <adam-ip> -j DROP`; link, IP and lease all stay valid
`baseos_fallback_link_down`	`BootReasonFallback`	eth0 link-down during the post-update test window
`baseos_fallback_blackhole`	`BootReasonFallback`	controller IP black-holed during the test window

The two failure modes exercise different code paths: link-down trips NIM's DPC verification first; the black-hole keeps the interface healthy so nodeagent's reset / fallback timer is the primary defence. The black-hole case is closer to common real-world failures (ISP routing change, controller IP migration, upstream firewall) than yanking the cable.

Restart counter + disk-space maintenance

Test	Path tested
`restart_counter_monotonic`	`RestartCounter` increments by exactly +1 across an actual EVE reboot, end-to-end through `/persist/status/restartcounter` → `NodeAgentStatus` → zedagent info → controller
`maintenance_no_disk_space`	`RemainingSpace=0` (driven by a blank volume + filler outside AppPersistPaths) propagates through `volumemgr → nodeagent → zedagent → controller` as `MaintenanceMode:true` + `MaintenanceModeReasons:[LOW_DISK_SPACE]`, and clears cleanly when pressure is removed

Test plan

All six tests pass locally against a coverage-instrumented EVE under QEMU (ZedVirtual-4G).
Each black-hole rule is wiped naturally on the post-disconnect reboot — no cleanup needed.
Disk-space maintenance test cleans up filler + volume; device returns to ZDEVICE_STATE_ONLINE.
CI run on lf-edge/eden's harness.

Per-test runtime (laptop, KVM-accelerated): 4–8 min each.

Implementation notes

Freshness guard for reboot tests: each reset_on_disconnect_* prefixes with eden controller edge-node reboot to establish a known LastBootReason baseline. The final lim.test match thus cannot succeed against stale info from a prior run. The baseos_fallback_* tests get the same freshness for free since the upgrade flow always transits BootReasonUpdate before reaching BootReasonFallback.
Info publish cadence: timer.deviceinfo.interval=30 so the rebooted device publishes a fresh info message quickly. lim.test's TestInfo matches only new info (einfo.InfoNew), so without this it can wait up to 10 min for the next periodic push.
Config propagation timing: timer.config.interval=10 shortens the controller-config polling. The fallback tests then exec sleep 70 so the new shorter interval has time to take effect under the default 60 s polling, before any subsequent config push.
Idempotency across runs (fallback tests): eveimage-remove + brief wait before eveimage-update, so re-running with the same EVE version still triggers a fresh update. (EVE refuses to retry a previously-failed version unless the retry counter is bumped or the baseos config is removed first.)
Black-hole mechanism: eden eve ssh -- iptables -A OUTPUT -d <adam-ip> -j DROP. The -- terminates eden's cobra flag parsing so the iptables flags pass through cleanly. Empirically verified the rule lands in pillar/zedbox's network namespace (it shares net:[4026531840] with the dom0 shell), so it does silently drop zedagent's controller-bound traffic.
Disk-pressure mechanism: blank volume at 60 % of /persist (sparse — declared size feeds reservedAppDiskUsage, on-disk cost is small) + 70 % filler in /persist/log/ to push dynamicUsedByDom0 past the 2 GiB Dom0DiskUsageMaxBytes static cap. The filler MUST NOT go to /persist/newlog/ — that directory is in volumemgr's excludeDirs and so cancels out in usedByDom0 = device.Used - appUsage. timer.metric.diskscan.interval=30 makes volumemgr's recompute window a tractable test duration.
Disk-pressure cleanup: the volume-create / fill / assert sequence runs inside a bash wrapper invoked via exec -t 30m bash run_test.sh, with trap cleanup EXIT tearing down the multi-GiB filler and the blank volume regardless of which assertion fails. The escript dialect has no defer of its own, so without the wrapper a flaky MaintenanceMode:true match (info-publish lag, propagation timeout) would halt the script with the filler still pinned to disk and wedge /persist for any subsequent run.
Device-model gating: tests skip with a message line on device models other than ZedVirtual-4G / VBox (no eden eve link / eden eve ssh support).

Suite structure

tests/nodeagent/
├── Makefile
├── eden-config.yml
├── eden.nodeagent.tests.txt
└── testdata/
    ├── baseos_fallback_blackhole.txt
    ├── baseos_fallback_link_down.txt
    ├── maintenance_no_disk_space.txt
    ├── reset_on_disconnect_blackhole.txt
    ├── reset_on_disconnect_link_down.txt
    └── restart_counter_monotonic.txt

The suite is not wired into tests/workflow/smoke.tests.txt or eve-upgrade.tests.txt in this PR — the six tests are addressable on their own (e.g. eden.escript.test -test.run TestEdenScripts/reset_on_disconnect_link_down -testdata tests/nodeagent/testdata/). Folding them into a broader suite is a follow-up after we see how they behave under CI.

Why draft

Want a chance for someone familiar with eden's harness to look at the patterns before marking ready — particularly the per-test timer tuning, the idempotency dance for the fallback tests, and the blank-volume-plus-filler trick for the disk-space test.

Adds a new tests/nodeagent/ suite with four scripts that exercise the two safety-net reboot paths in pkg/pillar/cmd/nodeagent — each in two network-failure modes: - reset_on_disconnect_link_down (BootReasonDisconnect, eth0 down) - reset_on_disconnect_blackhole (BootReasonDisconnect, controller IP silently dropped via in-EVE iptables; link, IP and lease all stay valid so NIM is happy and nodeagent's timer is the primary defence) - baseos_fallback_link_down (BootReasonFallback, eth0 down during the post-update test window) - baseos_fallback_blackhole (BootReasonFallback, controller IP silently dropped during the test window) The two failure modes exercise different code paths: link-down trips NIM's DPC verification first, while the black-hole keeps the interface healthy so the disconnect / fallback timer in nodeagent fires alone. Implementation notes: - Every test prefixes with eden controller edge-node reboot to establish a known LastBootReason baseline, so the final lim.test match cannot succeed against stale info from a prior run. - timer.deviceinfo.interval is shortened to 30s so the rebooted device publishes a fresh info quickly after the disconnect-driven reboot. lim.test's TestInfo uses einfo.InfoNew (only new messages match); without this the test can wait up to 10min for the next periodic push. - timer.config.interval is shortened to 10s so subsequent config pushes propagate to the device promptly. Both fallback tests then pause 70s after the initial timer push so the new shorter interval has time to take effect under the default 60s polling. - The fallback tests issue eveimage-remove + a brief wait before eveimage-update so re-running with the same EVE version still triggers a fresh update — without this, EVE refuses to retry a previously-failed version unless the retry counter is bumped. - Black-hole tests use eden eve ssh -- iptables -A OUTPUT -d <adam-ip> -j DROP. The "--" terminates eden's flag parsing so the iptables flags pass through cleanly. After the disconnect-driven reboot the rule dies with the previous boot's in-memory iptables, so no cleanup is needed. All four tests pass locally against a coverage-instrumented EVE under QEMU (ZedVirtual-4G). Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Two more nodeagent e2e tests: - restart_counter_monotonic: capture RestartCounter from Adam's info stream, force a controller-issued reboot, capture again, assert the new value is exactly previous + 1. Closes the e2e gap left by the unit test (which only covers the in-memory increment+write logic against a temp file): the persistence path actually writes to /persist/status/restartcounter, the value survives an EVE reboot, and reaches the controller through the NodeAgentStatus -> zedagent -> info chain. - maintenance_no_disk_space: drive volumemgr to RemainingSpace=0 by combining a blank volume sized at 60% of /persist (declared size is sparse-allocated so on-disk cost is small but contributes to reservedAppDiskUsage in getRemainingDiskSpace) with a 70% /persist/log filler that pushes dynamicUsedByDom0 past the 2 GiB Dom0DiskUsageMaxBytes static cap. nodeagent's handleVolumeMgrStatusImpl then sets MaintenanceModeReasonNoDiskSpace, which surfaces in the controller's info as MaintenanceMode=true and MaintenanceModeReasons:[MAINTENANCE_MODE_REASON_LOW_DISK_SPACE]. The volume-create / fill / assert sequence runs inside a bash wrapper invoked via 'exec -t 30m bash run_test.sh', so that 'trap cleanup EXIT' tears down the multi-GiB filler and the blank volume regardless of which assertion fails -- the escript dialect has no defer of its own, and a leaked filler would wedge /persist for any subsequent run. Final testscript-level assertion is that the device returns to ZDEVICE_STATE_ONLINE after pressure clears. Note: filler must go to /persist/log (or any path NOT in AppPersistPaths and NOT NewlogDir). Files in /persist/newlog/ are in volumemgr's excludeDirs and so cancel out in usedByDom0 = device.Used - appUsage; they do not push the dom0 reservation. Both tests use the timer.config.interval=10 and timer.deviceinfo.interval=30 pattern from the existing tests so config and post-event info propagate quickly. The maintenance test additionally sets timer.metric.diskscan.interval=30 so volumemgr recomputes RemainingSpace within ~30s of disk usage changing. Verified locally on a coverage-instrumented EVE under QEMU (ZedVirtual-4G). restart_counter_monotonic ~3.6 min, maintenance_no_disk_space ~6 min. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…back tests baseos_fallback_link_down and baseos_fallback_blackhole push a BaseOsConfig for 12.1.0 to exercise the upgrade-window fallback path, but they only `eden eve reset` (clears device config) at the end, not `eveimage-remove`. The 12.1.0 BaseOsConfig therefore lingers in adam after the test finishes, which leaks state into any subsequent test or suite that touches baseos behaviour. Add a final eveimage-remove of {{ $short_version }} before the eden eve reset so adam returns to a clean BaseOsConfig list. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Two follow-ups for the nodeagent suite: 1. Raise outer -test.timeout values in eden.nodeagent.tests.txt so the test-framework wrapper doesn't kill a healthy run before its inner lim.test waits resolve. On the coverage-instrumented EVE build, post-reboot Info republish takes ~12 min, so the prior 15-minute timeouts on reset_on_disconnect_link_down / _blackhole could not accommodate even one of the three sequential lim.test waits each test uses (the test-script -timewait stays at 30m so a healthy device still resolves quickly; the change only relaxes the outer cap). New values: 45m for the disconnect / fallback tests, 20m for restart_counter_monotonic, 40m for maintenance_no_disk_space. 2. Add a get-config assertion after every eveimage-remove call so the test fails loudly if the controller config still references the removed image. The current eden CLI EdgeNodeEVEImageRemove only removes the legacy baseosconfig list entry and leaves the modern single-block baseos field + contentInfo[] populated; this assertion surfaces that bug end-to-end. (The corresponding eden CLI fix lands in PR lf-edge#1172 alongside the update_eve_image cleanup tests.) Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

eriknordmark mentioned this pull request May 8, 2026

tests/baseosmgr: e2e coverage for force-fallback and retry-update #1164

Open

2 tasks

eriknordmark changed the title ~~tests/nodeagent: e2e coverage for cloud-disconnect reboots~~ tests/nodeagent: e2e coverage for reboot paths, restart counter, and disk-space maintenance May 8, 2026

eriknordmark force-pushed the nodeagent-tests branch from 13ba5ce to d2f5ad7 Compare May 8, 2026 21:13

eriknordmark marked this pull request as ready for review May 8, 2026 21:23

eriknordmark requested a review from uncleDecart as a code owner May 8, 2026 21:23

eriknordmark requested review from europaul and rene May 10, 2026 08:36

eriknordmark mentioned this pull request May 10, 2026

lim.test TestInfo times out matching dinfo.systemAdapter.status.ports.ifname even though EVE publishes valid systemAdapter info #1166

Closed

eriknordmark and others added 4 commits May 15, 2026 11:18

eriknordmark force-pushed the nodeagent-tests branch from ae02bb8 to c80f311 Compare May 15, 2026 09:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests/nodeagent: e2e coverage for reboot paths, restart counter, and disk-space maintenance#1162

tests/nodeagent: e2e coverage for reboot paths, restart counter, and disk-space maintenance#1162
eriknordmark wants to merge 4 commits into
lf-edge:masterfrom
eriknordmark:nodeagent-tests

eriknordmark commented May 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

eriknordmark commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Cloud-disconnect reboot paths

Restart counter + disk-space maintenance

Test plan

Implementation notes

Suite structure

Why draft

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

eriknordmark commented May 7, 2026 •

edited

Loading