tests/nodeagent: e2e coverage for reboot paths, restart counter, and disk-space maintenance#1162
Open
eriknordmark wants to merge 4 commits into
Open
tests/nodeagent: e2e coverage for reboot paths, restart counter, and disk-space maintenance#1162eriknordmark wants to merge 4 commits into
eriknordmark wants to merge 4 commits into
Conversation
2 tasks
13ba5ce to
d2f5ad7
Compare
Adds a new tests/nodeagent/ suite with four scripts that exercise the
two safety-net reboot paths in pkg/pillar/cmd/nodeagent — each in two
network-failure modes:
- reset_on_disconnect_link_down (BootReasonDisconnect, eth0 down)
- reset_on_disconnect_blackhole (BootReasonDisconnect, controller IP
silently dropped via in-EVE
iptables; link, IP and lease all
stay valid so NIM is happy and
nodeagent's timer is the primary
defence)
- baseos_fallback_link_down (BootReasonFallback, eth0 down during
the post-update test window)
- baseos_fallback_blackhole (BootReasonFallback, controller IP
silently dropped during the test
window)
The two failure modes exercise different code paths: link-down trips
NIM's DPC verification first, while the black-hole keeps the interface
healthy so the disconnect / fallback timer in nodeagent fires alone.
Implementation notes:
- Every test prefixes with eden controller edge-node reboot to
establish a known LastBootReason baseline, so the final lim.test
match cannot succeed against stale info from a prior run.
- timer.deviceinfo.interval is shortened to 30s so the rebooted device
publishes a fresh info quickly after the disconnect-driven reboot.
lim.test's TestInfo uses einfo.InfoNew (only new messages match);
without this the test can wait up to 10min for the next periodic
push.
- timer.config.interval is shortened to 10s so subsequent config
pushes propagate to the device promptly. Both fallback tests then
pause 70s after the initial timer push so the new shorter interval
has time to take effect under the default 60s polling.
- The fallback tests issue eveimage-remove + a brief wait before
eveimage-update so re-running with the same EVE version still
triggers a fresh update — without this, EVE refuses to retry a
previously-failed version unless the retry counter is bumped.
- Black-hole tests use eden eve ssh -- iptables -A OUTPUT -d <adam-ip>
-j DROP. The "--" terminates eden's flag parsing so the iptables
flags pass through cleanly. After the disconnect-driven reboot the
rule dies with the previous boot's in-memory iptables, so no
cleanup is needed.
All four tests pass locally against a coverage-instrumented EVE under
QEMU (ZedVirtual-4G).
Signed-off-by: eriknordmark <erik@zededa.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two more nodeagent e2e tests: - restart_counter_monotonic: capture RestartCounter from Adam's info stream, force a controller-issued reboot, capture again, assert the new value is exactly previous + 1. Closes the e2e gap left by the unit test (which only covers the in-memory increment+write logic against a temp file): the persistence path actually writes to /persist/status/restartcounter, the value survives an EVE reboot, and reaches the controller through the NodeAgentStatus -> zedagent -> info chain. - maintenance_no_disk_space: drive volumemgr to RemainingSpace=0 by combining a blank volume sized at 60% of /persist (declared size is sparse-allocated so on-disk cost is small but contributes to reservedAppDiskUsage in getRemainingDiskSpace) with a 70% /persist/log filler that pushes dynamicUsedByDom0 past the 2 GiB Dom0DiskUsageMaxBytes static cap. nodeagent's handleVolumeMgrStatusImpl then sets MaintenanceModeReasonNoDiskSpace, which surfaces in the controller's info as MaintenanceMode=true and MaintenanceModeReasons:[MAINTENANCE_MODE_REASON_LOW_DISK_SPACE]. The volume-create / fill / assert sequence runs inside a bash wrapper invoked via 'exec -t 30m bash run_test.sh', so that 'trap cleanup EXIT' tears down the multi-GiB filler and the blank volume regardless of which assertion fails -- the escript dialect has no defer of its own, and a leaked filler would wedge /persist for any subsequent run. Final testscript-level assertion is that the device returns to ZDEVICE_STATE_ONLINE after pressure clears. Note: filler must go to /persist/log (or any path NOT in AppPersistPaths and NOT NewlogDir). Files in /persist/newlog/ are in volumemgr's excludeDirs and so cancel out in usedByDom0 = device.Used - appUsage; they do not push the dom0 reservation. Both tests use the timer.config.interval=10 and timer.deviceinfo.interval=30 pattern from the existing tests so config and post-event info propagate quickly. The maintenance test additionally sets timer.metric.diskscan.interval=30 so volumemgr recomputes RemainingSpace within ~30s of disk usage changing. Verified locally on a coverage-instrumented EVE under QEMU (ZedVirtual-4G). restart_counter_monotonic ~3.6 min, maintenance_no_disk_space ~6 min. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…back tests
baseos_fallback_link_down and baseos_fallback_blackhole push a
BaseOsConfig for 12.1.0 to exercise the upgrade-window fallback
path, but they only `eden eve reset` (clears device config) at the
end, not `eveimage-remove`. The 12.1.0 BaseOsConfig therefore
lingers in adam after the test finishes, which leaks state into any
subsequent test or suite that touches baseos behaviour.
Add a final eveimage-remove of {{ $short_version }} before the
eden eve reset so adam returns to a clean BaseOsConfig list.
Signed-off-by: eriknordmark <erik@zededa.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two follow-ups for the nodeagent suite: 1. Raise outer -test.timeout values in eden.nodeagent.tests.txt so the test-framework wrapper doesn't kill a healthy run before its inner lim.test waits resolve. On the coverage-instrumented EVE build, post-reboot Info republish takes ~12 min, so the prior 15-minute timeouts on reset_on_disconnect_link_down / _blackhole could not accommodate even one of the three sequential lim.test waits each test uses (the test-script -timewait stays at 30m so a healthy device still resolves quickly; the change only relaxes the outer cap). New values: 45m for the disconnect / fallback tests, 20m for restart_counter_monotonic, 40m for maintenance_no_disk_space. 2. Add a get-config assertion after every eveimage-remove call so the test fails loudly if the controller config still references the removed image. The current eden CLI EdgeNodeEVEImageRemove only removes the legacy baseosconfig list entry and leaves the modern single-block baseos field + contentInfo[] populated; this assertion surfaces that bug end-to-end. (The corresponding eden CLI fix lands in PR lf-edge#1172 alongside the update_eve_image cleanup tests.) Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ae02bb8 to
c80f311
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a new
tests/nodeagent/suite with six e2e tests that exercise nodeagent's safety-net reboot paths, restart-counter persistence, and the maintenance-mode-on-no-disk-space chain.Cloud-disconnect reboot paths
Each in two distinct network-failure modes:
reset_on_disconnect_link_downBootReasonDisconnecteden eve link down/ QEMUset_link off)reset_on_disconnect_blackholeBootReasonDisconnectiptables -A OUTPUT -d <adam-ip> -j DROP; link, IP and lease all stay validbaseos_fallback_link_downBootReasonFallbackbaseos_fallback_blackholeBootReasonFallbackThe two failure modes exercise different code paths: link-down trips NIM's DPC verification first; the black-hole keeps the interface healthy so nodeagent's reset / fallback timer is the primary defence. The black-hole case is closer to common real-world failures (ISP routing change, controller IP migration, upstream firewall) than yanking the cable.
Restart counter + disk-space maintenance
restart_counter_monotonicRestartCounterincrements by exactly +1 across an actual EVE reboot, end-to-end through/persist/status/restartcounter→NodeAgentStatus→ zedagent info → controllermaintenance_no_disk_spaceRemainingSpace=0(driven by a blank volume + filler outside AppPersistPaths) propagates throughvolumemgr → nodeagent → zedagent → controllerasMaintenanceMode:true+MaintenanceModeReasons:[LOW_DISK_SPACE], and clears cleanly when pressure is removedTest plan
ZedVirtual-4G).ZDEVICE_STATE_ONLINE.Per-test runtime (laptop, KVM-accelerated): 4–8 min each.
Implementation notes
reset_on_disconnect_*prefixes witheden controller edge-node rebootto establish a knownLastBootReasonbaseline. The finallim.testmatch thus cannot succeed against stale info from a prior run. Thebaseos_fallback_*tests get the same freshness for free since the upgrade flow always transitsBootReasonUpdatebefore reachingBootReasonFallback.timer.deviceinfo.interval=30so the rebooted device publishes a fresh info message quickly.lim.test'sTestInfomatches only new info (einfo.InfoNew), so without this it can wait up to 10 min for the next periodic push.timer.config.interval=10shortens the controller-config polling. The fallback tests thenexec sleep 70so the new shorter interval has time to take effect under the default 60 s polling, before any subsequent config push.eveimage-remove+ brief wait beforeeveimage-update, so re-running with the same EVE version still triggers a fresh update. (EVE refuses to retry a previously-failed version unless the retry counter is bumped or the baseos config is removed first.)eden eve ssh -- iptables -A OUTPUT -d <adam-ip> -j DROP. The--terminates eden's cobra flag parsing so theiptablesflags pass through cleanly. Empirically verified the rule lands in pillar/zedbox's network namespace (it sharesnet:[4026531840]with the dom0 shell), so it does silently drop zedagent's controller-bound traffic./persist(sparse — declared size feedsreservedAppDiskUsage, on-disk cost is small) + 70 % filler in/persist/log/to pushdynamicUsedByDom0past the 2 GiBDom0DiskUsageMaxBytesstatic cap. The filler MUST NOT go to/persist/newlog/— that directory is in volumemgr'sexcludeDirsand so cancels out inusedByDom0 = device.Used - appUsage.timer.metric.diskscan.interval=30makes volumemgr's recompute window a tractable test duration.exec -t 30m bash run_test.sh, withtrap cleanup EXITtearing down the multi-GiB filler and the blank volume regardless of which assertion fails. The escript dialect has nodeferof its own, so without the wrapper a flakyMaintenanceMode:truematch (info-publish lag, propagation timeout) would halt the script with the filler still pinned to disk and wedge/persistfor any subsequent run.messageline on device models other thanZedVirtual-4G/VBox(noeden eve link/eden eve sshsupport).Suite structure
The suite is not wired into
tests/workflow/smoke.tests.txtoreve-upgrade.tests.txtin this PR — the six tests are addressable on their own (e.g.eden.escript.test -test.run TestEdenScripts/reset_on_disconnect_link_down -testdata tests/nodeagent/testdata/). Folding them into a broader suite is a follow-up after we see how they behave under CI.Why draft
Want a chance for someone familiar with eden's harness to look at the patterns before marking ready — particularly the per-test timer tuning, the idempotency dance for the fallback tests, and the blank-volume-plus-filler trick for the disk-space test.