Skip to content

tests/nodeagent: e2e coverage for reboot paths, restart counter, and disk-space maintenance#1162

Open
eriknordmark wants to merge 4 commits into
lf-edge:masterfrom
eriknordmark:nodeagent-tests
Open

tests/nodeagent: e2e coverage for reboot paths, restart counter, and disk-space maintenance#1162
eriknordmark wants to merge 4 commits into
lf-edge:masterfrom
eriknordmark:nodeagent-tests

Conversation

@eriknordmark
Copy link
Copy Markdown
Contributor

@eriknordmark eriknordmark commented May 7, 2026

Summary

Adds a new tests/nodeagent/ suite with six e2e tests that exercise nodeagent's safety-net reboot paths, restart-counter persistence, and the maintenance-mode-on-no-disk-space chain.

Cloud-disconnect reboot paths

Each in two distinct network-failure modes:

Test Path tested Failure mode
reset_on_disconnect_link_down BootReasonDisconnect eth0 link-down (eden eve link down / QEMU set_link off)
reset_on_disconnect_blackhole BootReasonDisconnect controller IP silently dropped via in-EVE iptables -A OUTPUT -d <adam-ip> -j DROP; link, IP and lease all stay valid
baseos_fallback_link_down BootReasonFallback eth0 link-down during the post-update test window
baseos_fallback_blackhole BootReasonFallback controller IP black-holed during the test window

The two failure modes exercise different code paths: link-down trips NIM's DPC verification first; the black-hole keeps the interface healthy so nodeagent's reset / fallback timer is the primary defence. The black-hole case is closer to common real-world failures (ISP routing change, controller IP migration, upstream firewall) than yanking the cable.

Restart counter + disk-space maintenance

Test Path tested
restart_counter_monotonic RestartCounter increments by exactly +1 across an actual EVE reboot, end-to-end through /persist/status/restartcounterNodeAgentStatus → zedagent info → controller
maintenance_no_disk_space RemainingSpace=0 (driven by a blank volume + filler outside AppPersistPaths) propagates through volumemgr → nodeagent → zedagent → controller as MaintenanceMode:true + MaintenanceModeReasons:[LOW_DISK_SPACE], and clears cleanly when pressure is removed

Test plan

  • All six tests pass locally against a coverage-instrumented EVE under QEMU (ZedVirtual-4G).
  • Each black-hole rule is wiped naturally on the post-disconnect reboot — no cleanup needed.
  • Disk-space maintenance test cleans up filler + volume; device returns to ZDEVICE_STATE_ONLINE.
  • CI run on lf-edge/eden's harness.

Per-test runtime (laptop, KVM-accelerated): 4–8 min each.

Implementation notes

  • Freshness guard for reboot tests: each reset_on_disconnect_* prefixes with eden controller edge-node reboot to establish a known LastBootReason baseline. The final lim.test match thus cannot succeed against stale info from a prior run. The baseos_fallback_* tests get the same freshness for free since the upgrade flow always transits BootReasonUpdate before reaching BootReasonFallback.
  • Info publish cadence: timer.deviceinfo.interval=30 so the rebooted device publishes a fresh info message quickly. lim.test's TestInfo matches only new info (einfo.InfoNew), so without this it can wait up to 10 min for the next periodic push.
  • Config propagation timing: timer.config.interval=10 shortens the controller-config polling. The fallback tests then exec sleep 70 so the new shorter interval has time to take effect under the default 60 s polling, before any subsequent config push.
  • Idempotency across runs (fallback tests): eveimage-remove + brief wait before eveimage-update, so re-running with the same EVE version still triggers a fresh update. (EVE refuses to retry a previously-failed version unless the retry counter is bumped or the baseos config is removed first.)
  • Black-hole mechanism: eden eve ssh -- iptables -A OUTPUT -d <adam-ip> -j DROP. The -- terminates eden's cobra flag parsing so the iptables flags pass through cleanly. Empirically verified the rule lands in pillar/zedbox's network namespace (it shares net:[4026531840] with the dom0 shell), so it does silently drop zedagent's controller-bound traffic.
  • Disk-pressure mechanism: blank volume at 60 % of /persist (sparse — declared size feeds reservedAppDiskUsage, on-disk cost is small) + 70 % filler in /persist/log/ to push dynamicUsedByDom0 past the 2 GiB Dom0DiskUsageMaxBytes static cap. The filler MUST NOT go to /persist/newlog/ — that directory is in volumemgr's excludeDirs and so cancels out in usedByDom0 = device.Used - appUsage. timer.metric.diskscan.interval=30 makes volumemgr's recompute window a tractable test duration.
  • Disk-pressure cleanup: the volume-create / fill / assert sequence runs inside a bash wrapper invoked via exec -t 30m bash run_test.sh, with trap cleanup EXIT tearing down the multi-GiB filler and the blank volume regardless of which assertion fails. The escript dialect has no defer of its own, so without the wrapper a flaky MaintenanceMode:true match (info-publish lag, propagation timeout) would halt the script with the filler still pinned to disk and wedge /persist for any subsequent run.
  • Device-model gating: tests skip with a message line on device models other than ZedVirtual-4G / VBox (no eden eve link / eden eve ssh support).

Suite structure

tests/nodeagent/
├── Makefile
├── eden-config.yml
├── eden.nodeagent.tests.txt
└── testdata/
    ├── baseos_fallback_blackhole.txt
    ├── baseos_fallback_link_down.txt
    ├── maintenance_no_disk_space.txt
    ├── reset_on_disconnect_blackhole.txt
    ├── reset_on_disconnect_link_down.txt
    └── restart_counter_monotonic.txt

The suite is not wired into tests/workflow/smoke.tests.txt or eve-upgrade.tests.txt in this PR — the six tests are addressable on their own (e.g. eden.escript.test -test.run TestEdenScripts/reset_on_disconnect_link_down -testdata tests/nodeagent/testdata/). Folding them into a broader suite is a follow-up after we see how they behave under CI.

Why draft

Want a chance for someone familiar with eden's harness to look at the patterns before marking ready — particularly the per-test timer tuning, the idempotency dance for the fallback tests, and the blank-volume-plus-filler trick for the disk-space test.

@eriknordmark eriknordmark changed the title tests/nodeagent: e2e coverage for cloud-disconnect reboots tests/nodeagent: e2e coverage for reboot paths, restart counter, and disk-space maintenance May 8, 2026
@eriknordmark eriknordmark marked this pull request as ready for review May 8, 2026 21:23
@eriknordmark eriknordmark requested a review from uncleDecart as a code owner May 8, 2026 21:23
@eriknordmark eriknordmark requested review from europaul and rene May 10, 2026 08:36
eriknordmark and others added 4 commits May 15, 2026 11:18
Adds a new tests/nodeagent/ suite with four scripts that exercise the
two safety-net reboot paths in pkg/pillar/cmd/nodeagent — each in two
network-failure modes:

- reset_on_disconnect_link_down  (BootReasonDisconnect, eth0 down)
- reset_on_disconnect_blackhole  (BootReasonDisconnect, controller IP
                                   silently dropped via in-EVE
                                   iptables; link, IP and lease all
                                   stay valid so NIM is happy and
                                   nodeagent's timer is the primary
                                   defence)
- baseos_fallback_link_down      (BootReasonFallback, eth0 down during
                                   the post-update test window)
- baseos_fallback_blackhole      (BootReasonFallback, controller IP
                                   silently dropped during the test
                                   window)

The two failure modes exercise different code paths: link-down trips
NIM's DPC verification first, while the black-hole keeps the interface
healthy so the disconnect / fallback timer in nodeagent fires alone.

Implementation notes:

- Every test prefixes with eden controller edge-node reboot to
  establish a known LastBootReason baseline, so the final lim.test
  match cannot succeed against stale info from a prior run.
- timer.deviceinfo.interval is shortened to 30s so the rebooted device
  publishes a fresh info quickly after the disconnect-driven reboot.
  lim.test's TestInfo uses einfo.InfoNew (only new messages match);
  without this the test can wait up to 10min for the next periodic
  push.
- timer.config.interval is shortened to 10s so subsequent config
  pushes propagate to the device promptly. Both fallback tests then
  pause 70s after the initial timer push so the new shorter interval
  has time to take effect under the default 60s polling.
- The fallback tests issue eveimage-remove + a brief wait before
  eveimage-update so re-running with the same EVE version still
  triggers a fresh update — without this, EVE refuses to retry a
  previously-failed version unless the retry counter is bumped.
- Black-hole tests use eden eve ssh -- iptables -A OUTPUT -d <adam-ip>
  -j DROP. The "--" terminates eden's flag parsing so the iptables
  flags pass through cleanly. After the disconnect-driven reboot the
  rule dies with the previous boot's in-memory iptables, so no
  cleanup is needed.

All four tests pass locally against a coverage-instrumented EVE under
QEMU (ZedVirtual-4G).

Signed-off-by: eriknordmark <erik@zededa.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two more nodeagent e2e tests:

- restart_counter_monotonic: capture RestartCounter from Adam's info
  stream, force a controller-issued reboot, capture again, assert
  the new value is exactly previous + 1. Closes the e2e gap left
  by the unit test (which only covers the in-memory increment+write
  logic against a temp file): the persistence path actually writes
  to /persist/status/restartcounter, the value survives an EVE
  reboot, and reaches the controller through the
  NodeAgentStatus -> zedagent -> info chain.

- maintenance_no_disk_space: drive volumemgr to RemainingSpace=0 by
  combining a blank volume sized at 60% of /persist (declared size
  is sparse-allocated so on-disk cost is small but contributes to
  reservedAppDiskUsage in getRemainingDiskSpace) with a 70%
  /persist/log filler that pushes dynamicUsedByDom0 past the 2 GiB
  Dom0DiskUsageMaxBytes static cap. nodeagent's
  handleVolumeMgrStatusImpl then sets
  MaintenanceModeReasonNoDiskSpace, which surfaces in the controller's
  info as MaintenanceMode=true and
  MaintenanceModeReasons:[MAINTENANCE_MODE_REASON_LOW_DISK_SPACE].

  The volume-create / fill / assert sequence runs inside a bash
  wrapper invoked via 'exec -t 30m bash run_test.sh', so that
  'trap cleanup EXIT' tears down the multi-GiB filler and the blank
  volume regardless of which assertion fails -- the escript dialect
  has no defer of its own, and a leaked filler would wedge /persist
  for any subsequent run. Final testscript-level assertion is that
  the device returns to ZDEVICE_STATE_ONLINE after pressure clears.

  Note: filler must go to /persist/log (or any path NOT in
  AppPersistPaths and NOT NewlogDir). Files in /persist/newlog/ are
  in volumemgr's excludeDirs and so cancel out in
  usedByDom0 = device.Used - appUsage; they do not push the dom0
  reservation.

Both tests use the timer.config.interval=10 and
timer.deviceinfo.interval=30 pattern from the existing tests so
config and post-event info propagate quickly. The maintenance test
additionally sets timer.metric.diskscan.interval=30 so volumemgr
recomputes RemainingSpace within ~30s of disk usage changing.

Verified locally on a coverage-instrumented EVE under QEMU
(ZedVirtual-4G). restart_counter_monotonic ~3.6 min,
maintenance_no_disk_space ~6 min.

Signed-off-by: eriknordmark <erik@zededa.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…back tests

baseos_fallback_link_down and baseos_fallback_blackhole push a
BaseOsConfig for 12.1.0 to exercise the upgrade-window fallback
path, but they only `eden eve reset` (clears device config) at the
end, not `eveimage-remove`. The 12.1.0 BaseOsConfig therefore
lingers in adam after the test finishes, which leaks state into any
subsequent test or suite that touches baseos behaviour.

Add a final eveimage-remove of {{ $short_version }} before the
eden eve reset so adam returns to a clean BaseOsConfig list.

Signed-off-by: eriknordmark <erik@zededa.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two follow-ups for the nodeagent suite:

1. Raise outer -test.timeout values in eden.nodeagent.tests.txt so the
   test-framework wrapper doesn't kill a healthy run before its inner
   lim.test waits resolve. On the coverage-instrumented EVE build,
   post-reboot Info republish takes ~12 min, so the prior 15-minute
   timeouts on reset_on_disconnect_link_down / _blackhole could not
   accommodate even one of the three sequential lim.test waits each
   test uses (the test-script -timewait stays at 30m so a healthy
   device still resolves quickly; the change only relaxes the outer
   cap). New values: 45m for the disconnect / fallback tests, 20m for
   restart_counter_monotonic, 40m for maintenance_no_disk_space.

2. Add a get-config assertion after every eveimage-remove call so the
   test fails loudly if the controller config still references the
   removed image. The current eden CLI EdgeNodeEVEImageRemove only
   removes the legacy baseosconfig list entry and leaves the modern
   single-block baseos field + contentInfo[] populated; this assertion
   surfaces that bug end-to-end. (The corresponding eden CLI fix lands
   in PR lf-edge#1172 alongside the update_eve_image cleanup tests.)

Signed-off-by: eriknordmark <erik@zededa.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant