Skip to content

[14.5-stable] tests/zfs: bump /persist metric wait budgets#1184

Open
eriknordmark wants to merge 1 commit into
lf-edge:EVE-14.5-stablefrom
eriknordmark:backport-zfs-baseline-timeout-14.5
Open

[14.5-stable] tests/zfs: bump /persist metric wait budgets#1184
eriknordmark wants to merge 1 commit into
lf-edge:EVE-14.5-stablefrom
eriknordmark:backport-zfs-baseline-timeout-14.5

Conversation

@eriknordmark
Copy link
Copy Markdown
Contributor

@eriknordmark eriknordmark commented May 19, 2026

Backport of #1181.

How to test and validate this PR

Covered by the eden CI matrix on this PR — Storage (zfs) and Smoke (zfs, *) jobs exercise state_and_layout_check.txt directly. The fix bumps the inner shell DEADLINEs and outer exec -t budgets in two embedded scripts so a slow-boot run where volumemgr's diskMetricsTimerTask lands ~45 s after onboarding (and zedagent ships the next metric tick ~60 s later) no longer hits the 60 s deadline on capture-persist-baseline.sh. Both scripts exit on the first successful poll, so the happy path is unchanged.

Cherry-pick applied cleanly from master with no conflicts; nothing in the surrounding test depends on master-only infrastructure.

Once #1181 is merged on master, the (cherry picked from commit ...) line here points at the topic-branch SHA (0f6a0db) rather than the master squash SHA — content is identical and can be amended on merge if preferred.

Changelog notes

No user-facing changes.

PR Backports

The 1-min budget on capture-persist-baseline.sh was too tight against
EVE's actual time-to-first-metric. On slow-boot CI runs volumemgr's
diskMetricsTimerTask doesn't tick for ~45 s after onboarding completes,
and zedagent ships device metrics on its own 60 s ticker that the test
doesn't reconfigure — so the first /persist entry can land ~100 s after
onboarding even on a healthy run. Two recent failures of
state_and_layout_check.txt:25 (lf-edge/eve runs 25730296544 on 2026-05-12
and 26033232173 on 2026-05-18) hit the deadline at exactly 60 s, with
zedagent's metric.log showing four metric reports in the window all
carrying empty dm.disk arrays.

Bump the baseline capture from 1 min to 5 min, and the post-resize
wait-for-persist-grew from 4 min to 5.5 min so a worst-case zedagent
metric-tick alignment after the perturbation doesn't trip it either.
Both scripts exit on the first successful poll, so the happy path is
unchanged; only genuinely-slow boots burn the extra time.

Signed-off-by: eriknordmark <erik@zededa.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
(cherry picked from commit 0f6a0db)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant