eve-k: kube/longhorn: tune replica rebuild, add snapshot management, fix replica/PVC size reporting by andrewd-zededa · Pull Request #5955 · lf-edge/eve

andrewd-zededa · 2026-05-13T21:28:26Z

Description

Longhorn settings (pkg/kube/cfg-manifests/longhorn-cfg.yaml):

Set replica-replenishment-wait-interval=600: prevents Longhorn from creating a
replacement replica object while original nodes are recovering from a simultaneous
power reset. Once a replacement is created Longhorn marks the slot as usable, blocking
CheckAndReuseFailedReplica and forcing a full rebuild instead of a cheap delta sync.
The 600 s default is sized against EVE's 3-node boot stagger: the last node is
Longhorn-ready at ~T=3:50 from power-on; the interval clock starts at ~T=2:00.
Set concurrent-replica-rebuild-per-node-limit=2: caps parallel rebuilds to avoid
saturating disk I/O on edge hardware when all volumes need resync simultaneously.
Set auto-salvage=true: enables self-recovery for faulted volumes on unattended devices.
upgrade-checker settings.

Storage classes (pkg/kube/manifests/storage-classes.yaml):

Add recurringJobSelector to all three storage classes so that volumes pick up the
recurring snapshot job automatically via the default group.

Recurring snapshot management:

Add kubeapi.SetLonghornRecurringSnapshot() (longhornsnap.go): creates, updates, or
deletes a cluster-scoped Longhorn RecurringJob CR matching the LonghornSnapshotCron
GlobalConfig value. Empty string disables the feature.
Add cronValidator in types/global.go: validates LonghornSnapshotCron accepts only a
5-field cron expression or empty string; rejects @daily/@hourly shorthands and
6-field Quartz syntax.
Wire applyLonghornRecurringSnapshot() into zedkube's main loop and GlobalConfig change
handler (zedkube.go); idempotent via longhornSnapshotSet flag.
Add LonghornSnapshotCron to CONFIG-PROPERTIES.md and EVE-K.md.

Replica progress reporting fix (longhorninfo.go):

Fix false-100% progress for WO replicas that have no RebuildStatus entry yet.
Previously replicaModeProgress returned 100% for WO mode when the rebuild entry was
absent (transfer queued but not yet started). Now returns 0% / Rebuilding status.

Snapshot bytes in PVC size reporting (longhorninfo.go, vitoapiserver.go):

Add LonghornVolumeSnapshotBytes(): queries Longhorn snapshot CRDs via the
longhornvolume= label selector and sums Status.Size for all non-head snapshots.
MarkRemoved snapshots are included intentionally — they consume disk until GC completes.
Add snapshot bytes to VolumeStatus.CurrentSize via GetPVCInfo() (vitoapiserver.go).
Add snapshot bytes to KubeClusterInfo.AllocatedBytes via populateKVIFromPVCName()
(longhorninfo.go). Both paths previously reported live data only (status.actualSize),
understating real on-disk usage when snapshot chains accumulate.

Observability:

Add longhorn-snapshot-overhead.sh: shell tool reporting per-volume snapshot CoW
overhead; installed into the kube container image via Dockerfile.
Add longhorn-snapshot-overhead.sh -v to collect-info.sh longhorn_info section
(VERSION 43→44).

Tests:

TestSumSnapshotBytes: unit test for the pure sumSnapshotBytes helper covering empty
list, volume-head exclusion, MarkRemoved inclusion, and mixed cases.
TestReplicaModeProgress: unit tests for replicaModeProgress covering all mode
combinations including the WO-without-rebuild-entry false-100% bug case.
TestCronValidator: unit tests for cronValidator covering valid and invalid inputs.
TestSetLonghornRecurringSnapshotIntegration: integration test for the
create/update/delete state machine; skips when no cluster is reachable.

Verification tooling (pkg/kube/test/):

Add kube-test-longhorn-pvc-size.sh: audit script comparing Longhorn ground-truth PVC
sizes (actualSize + snapshot chain) against EVE's pubsub-reported CurrentSize and
AllocatedBytes, with configurable drift tolerance for EVE's ~60 s polling cycle.
Add README explaining pkg/kube/test/ as a home for ad-hoc test tools not installed
via the Dockerfile, intended for bind-mount use during cluster debugging.

PR dependencies

None

How to test and validate this PR

TODO

Changelog notes

TODO

PR Backports

16.0-stable: No
14.5-stable: No, as the feature is not available there.
13.4-stable: No, as the feature is not available there.

Checklist

I've provided a proper description
I've added the proper documentation
I've tested my PR on amd64 device
I've tested my PR on arm64 device
I've written the test verification instructions
I've set the proper labels to this PR

And the last but not least:

I've checked the boxes above, or I've provided a good reason why I didn't
check them.

Please, check the boxes above after submitting the PR in interactive mode.

codecov · 2026-05-15T19:49:30Z

Codecov Report

❌ Patch coverage is 37.12575% with 105 lines in your changes missing coverage. Please review.
✅ Project coverage is 21.11%. Comparing base (aa7ce4c) to head (3ad5743).
⚠️ Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
pkg/pillar/kubeapi/longhornsnap.go	0.00%	47 Missing ⚠️
pkg/pillar/kubeapi/longhorninfo.go	55.55%	32 Missing ⚠️
pkg/pillar/cmd/zedkube/zedkube.go	0.00%	17 Missing ⚠️
pkg/pillar/kubeapi/vitoapiserver.go	0.00%	5 Missing ⚠️
pkg/pillar/types/global.go	84.61%	2 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5955      +/-   ##
==========================================
+ Coverage   20.67%   21.11%   +0.44%     
==========================================
  Files         490      501      +11     
  Lines       90460    92297    +1837     
==========================================
+ Hits        18699    19488     +789     
- Misses      70186    71049     +863     
- Partials     1575     1760     +185

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

eriknordmark

Kick off tests (but build might fail due to the docker hashes issue)

andrewd-zededa · 2026-05-21T09:52:51Z

hash cleaned up, awaiting new CI checks

@hourly

…lica/PVC size reporting Longhorn settings (pkg/kube/cfg-manifests/longhorn-cfg.yaml): - Set replica-replenishment-wait-interval=600: prevents Longhorn from creating a replacement replica object while original nodes are recovering from a simultaneous power reset. Once a replacement is created Longhorn marks the slot as usable, blocking CheckAndReuseFailedReplica and forcing a full rebuild instead of a cheap delta sync. The 600 s default is sized against EVE's 3-node boot stagger: the last node is Longhorn-ready at ~T=3:50 from power-on; the interval clock starts at ~T=2:00. - Set concurrent-replica-rebuild-per-node-limit=2: caps parallel rebuilds to avoid saturating disk I/O on edge hardware when all volumes need resync simultaneously. - Set auto-salvage=true: enables self-recovery for faulted volumes on unattended devices. upgrade-checker settings. Storage classes (pkg/kube/manifests/storage-classes.yaml): - Add recurringJobSelector to all three storage classes so that volumes pick up the recurring snapshot job automatically via the default group. Recurring snapshot management: - Add kubeapi.SetLonghornRecurringSnapshot() (longhornsnap.go): creates, updates, or deletes a cluster-scoped Longhorn RecurringJob CR matching the LonghornSnapshotCron GlobalConfig value. Empty string disables the feature. - Add cronValidator in types/global.go: validates LonghornSnapshotCron accepts only a 5-field cron expression or empty string; rejects @daily/@hourly shorthands and 6-field Quartz syntax. - Wire applyLonghornRecurringSnapshot() into zedkube's main loop and GlobalConfig change handler (zedkube.go); idempotent via longhornSnapshotSet flag. - Add LonghornSnapshotCron to CONFIG-PROPERTIES.md and EVE-K.md. Replica progress reporting fix (longhorninfo.go): - Fix false-100% progress for WO replicas that have no RebuildStatus entry yet. Previously replicaModeProgress returned 100% for WO mode when the rebuild entry was absent (transfer queued but not yet started). Now returns 0% / Rebuilding status. Snapshot bytes in PVC size reporting (longhorninfo.go, vitoapiserver.go): - Add LonghornVolumeSnapshotBytes(): queries Longhorn snapshot CRDs via the longhornvolume= label selector and sums Status.Size for all non-head snapshots. MarkRemoved snapshots are included intentionally — they consume disk until GC completes. - Add snapshot bytes to VolumeStatus.CurrentSize via GetPVCInfo() (vitoapiserver.go). - Add snapshot bytes to KubeClusterInfo.AllocatedBytes via populateKVIFromPVCName() (longhorninfo.go). Both paths previously reported live data only (status.actualSize), understating real on-disk usage when snapshot chains accumulate. Observability: - Add longhorn-snapshot-overhead.sh: shell tool reporting per-volume snapshot CoW overhead; installed into the kube container image via Dockerfile. - Add longhorn-snapshot-overhead.sh -v to collect-info.sh longhorn_info section (VERSION 43→44). Tests: - TestSumSnapshotBytes: unit test for the pure sumSnapshotBytes helper covering empty list, volume-head exclusion, MarkRemoved inclusion, and mixed cases. - TestReplicaModeProgress: unit tests for replicaModeProgress covering all mode combinations including the WO-without-rebuild-entry false-100% bug case. - TestCronValidator: unit tests for cronValidator covering valid and invalid inputs. - TestSetLonghornRecurringSnapshotIntegration: integration test for the create/update/delete state machine; skips when no cluster is reachable. Verification tooling (pkg/kube/test/): - Add kube-test-longhorn-pvc-size.sh: audit script comparing Longhorn ground-truth PVC sizes (actualSize + snapshot chain) against EVE's pubsub-reported CurrentSize and AllocatedBytes, with configurable drift tolerance for EVE's ~60 s polling cycle. - Add README explaining pkg/kube/test/ as a home for ad-hoc test tools not installed via the Dockerfile, intended for bind-mount use during cluster debugging. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Andrew Durbin <andrewd@zededa.com>

andrewd-zededa · 2026-05-21T13:49:25Z

hash fixed again, awaiting tests run

github-actions Bot requested review from eriknordmark, naiming-zededa and zedi-pramodh May 13, 2026 21:29

andrewd-zededa force-pushed the eve-k-lh-replica-tuning branch 4 times, most recently from aa728f7 to d3d9207 Compare May 15, 2026 18:33

andrewd-zededa changed the title ~~eve-k: kube/longhorn tune replica rebuild and add configurable recurring snapshots~~ eve-k: kube/longhorn: tune replica rebuild, add snapshot management, fix replica/PVC size reporting May 15, 2026

andrewd-zededa marked this pull request as ready for review May 21, 2026 08:32

eriknordmark approved these changes May 21, 2026

View reviewed changes

andrewd-zededa force-pushed the eve-k-lh-replica-tuning branch from d3d9207 to 02c16d8 Compare May 21, 2026 08:42

github-actions Bot requested a review from eriknordmark May 21, 2026 08:42

andrewd-zededa force-pushed the eve-k-lh-replica-tuning branch from 02c16d8 to b7a32e4 Compare May 21, 2026 09:38

andrewd-zededa force-pushed the eve-k-lh-replica-tuning branch from b7a32e4 to 3ad5743 Compare May 21, 2026 13:46

eriknordmark merged commit 324bd3c into lf-edge:master May 21, 2026
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eve-k: kube/longhorn: tune replica rebuild, add snapshot management, fix replica/PVC size reporting#5955

eve-k: kube/longhorn: tune replica rebuild, add snapshot management, fix replica/PVC size reporting#5955
eriknordmark merged 1 commit into
lf-edge:masterfrom
andrewd-zededa:eve-k-lh-replica-tuning

andrewd-zededa commented May 13, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 15, 2026 •

edited

Loading

Uh oh!

eriknordmark left a comment

Uh oh!

andrewd-zededa commented May 21, 2026

Uh oh!

andrewd-zededa commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andrewd-zededa commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

PR dependencies

How to test and validate this PR

Changelog notes

PR Backports

Checklist

Uh oh!

codecov Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

eriknordmark left a comment

Choose a reason for hiding this comment

Uh oh!

andrewd-zededa commented May 21, 2026

Uh oh!

andrewd-zededa commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

andrewd-zededa commented May 13, 2026 •

edited

Loading

codecov Bot commented May 15, 2026 •

edited

Loading