refactor: per-patch lifecycle state machine by eseidel · Pull Request #352 · shorebirdtech/updater

eseidel · 2026-05-05T00:05:19Z

Why

Per-patch state was scattered across six stores — the download_state.rs sidecar JSON, the bare files in downloads/, the inflated .full file, the patches/{N}/dlc.vmcode artifact, the next_boot_patch / last_booted_patch / currently_booting_patch / known_bad_patches fields of PatchesState, and the session-only running_patch global. There was no single source of truth for "what state is patch N in?"; every behavior change (most recently the 416 fix in #351) added another orthogonal axis instead of moving the patch through defined transitions.

This PR consolidates the persistent per-patch state into a single document at {cache}/patches/{N}/state.json driven by an explicit state machine.

On-disk layout (per release)

{cache}/
  state.json                    # {client_id, release_version, queued_events}
  pointers.json                 # ReleasePointers (patch numbers only)
  patches/
    {N}/
      state.json                # PatchState — source of truth for patch N
      download                  # compressed bytes (Downloading/Downloaded only)
      dlc.vmcode                # installed artifact (Installed only)

Everything under patches/ is wiped on release-version change. Within a release, state.json survives as a Bad tombstone even after artifact files are removed — that's how rollback to a previously-known-bad patch is prevented when newer patches fail.

State machine

enum PatchState {
    Downloading { url, hash, signature, partial_size },
    Downloaded  { url, hash, signature, size },
    Installed   { hash, signature, size },
    Bad         { reason, hash?, signature?, size? },   // tombstone
}

enum BadReason {
    BootCrash,            // boot started but never recorded success
    InvalidPatchBytes,    // inflate failed (zstd magic / decompression)
    InstallHashMismatch,  // bytes decompressed but produced the wrong hash
    ValidationFailed,     // size or signature check failed at boot time
}

pointers.json carries only patch numbers (next_boot_patch, last_booted_patch, currently_booting_patch, boot_started_at). Metadata lives once per patch in its state.json.

API

The lifecycle exposes two state-aware operations callers don't have to choose between:

mark_bad(n, reason) — writes a Bad tombstone and removes artifact files. Sugar over write_state(Bad{...}) + cleanup.
cleanup(n) — state-aware retirement. If currently Bad, keeps the tombstone and removes only artifact files. Otherwise removes the entire patch directory.

Plus the lifecycle transitions used by update_internal and report_launch_*:

decide_start(n, url, hash) → DownloadAction { Fresh, Resume{offset}, Complete, Skip }
record_download_started, record_download_complete, record_install_complete
record_boot_start, record_boot_success, record_boot_failure
detect_boot_crash_on_init (runs on init; promotes a stranded currently_booting_patch to Bad{BootCrash})
validate_next_boot_patch (size + signature; marks Bad{ValidationFailed} on failure)
recompute_next_boot (state-aware fallback walk that doesn't clobber a valid newer next_boot_patch)

Behaviors that fall out

416 redownload loop (fix: skip fetch when prior download is already complete on disk #351): decide_start returning Complete skips the network when bytes are already on disk; mark_bad(InstallHashMismatch) short-circuits subsequent cycles via Skip(KnownBad). The post-fix: skip fetch when prior download is already complete on disk #351 redownload loop (every cycle: download → hash mismatch → cleanup → repeat) is now bounded to one cycle per (release, patch_number).
Server rollback un-rollback: uninstall_patch calls cleanup (state-aware), not the old "always-add-to-known-bad" path. A patch the server later un-rolls-back will be redownloaded — but a patch that crashed locally and is then rolled back keeps its Bad{BootCrash} tombstone, so the server's "it's fine again" can't override device-local "this crashed."
Chunked-encoding resume: decide_start reads the on-disk file size as the resume offset, not a recorded count, so chunked-encoded responses (no Content-Length) work the same as fixed-length ones. The state.json is written before the download starts.
OS-evicted partial files: decide_start cross-checks the download file's existence for Downloading / Downloaded states. If the file was evicted, the state is unrecoverable — fall through to Fresh.

Diff scope

library/src/cache/lifecycle.rs     | 1365 +++  (new)
library/src/cache/patch_manager.rs | 1615 ---  (deleted)
library/src/download_state.rs      |  126 ---  (deleted)
library/src/cache/updater_state.rs |  622 +-   (rewritten on top of lifecycle)
library/src/updater.rs             |  914 +-   (update_internal cutover, helpers removed)
library/src/cache/mod.rs           |    4 +-
library/src/lib.rs                 |    1 -
7 files changed, 1890 insertions(+), 2757 deletions(-)

Net −867 lines of code despite adding a full state-machine module — the patch_manager / download_state / scattered-state-helpers tonnage outweighs it.

Test plan

cargo test — 213 tests pass (down from 257 because the patch_manager unit tests are gone with the module). End-to-end coverage in updater.rs::tests is unchanged and all green.
67 new lifecycle unit tests covering: serde roundtrip, corrupt state.json fail-safe, mark_bad metadata preservation across all source states, cleanup state-awareness (Bad → tombstone, else → forget), idempotency, decide_start across all states + URL/hash mismatches + missing-file recovery, all download/install transitions, full boot lifecycle (start/success/failure), crash-on-init detection, fallback walk, validation.
Behavior of every updater::tests resume / rollback / state-recovery / install-failure scenario reproduced end-to-end against the new layout.
cargo fmt --check clean.

Caveats

Bumps the on-disk format. Per the design discussion, everything under {cache}/patches/ is wiped on release-version change — no migration needed because per-release state can change freely between releases. Devices upgrading to a release shipped with this PR will redownload any in-flight patch once. This is consistent with how per-release state was always handled.
config.download_dir is now vestigial. The lifecycle uses {storage_dir}/patches/{N}/ so the prior code_cache_dir/downloads location is no longer used. The field is left in UpdateConfig to keep the C API stable; it can be removed in a follow-up.
FileOperation::DeleteDir is now only used by tests. Compile-time warning; safe to ignore or remove the variant in a follow-up.

codecov · 2026-05-05T00:07:15Z

Codecov Report

❌ Patch coverage is 97.46105% with 44 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
library/src/cache/lifecycle.rs	97.98%	24 Missing ⚠️
library/src/cache/updater_state.rs	96.33%	11 Missing ⚠️
library/src/updater.rs	94.97%	9 Missing ⚠️

📢 Thoughts on this report? Let us know!

Nine fixes from the self-review pass on #352: - Drop `partial_size` from PatchState::Downloading. The field was misleading (decide_start reads from disk, not from the recorded value) and unused. record_download_started loses its 5th arg. - Gate `UpdaterState::install_patch` to `#[cfg(test)]`. Production no longer routes through it; only test_utils and the updater_state tests do. The gate makes the divergence intentional and prevents future production callers. - install_patch defensively removes any prior `dlc.vmcode` before rename so behavior is OS-agnostic (POSIX rename overwrites silently; Windows fails). Also mirrors record_install_complete's cleanup of a stale `download` file in the patch dir. - Document on `lifecycle()` / `lifecycle_mut()` that the direct accessors are intentional — wrapping every transition would be churn for no reader benefit. - recompute_next_boot now clears `last_booted_patch` when its on-disk record is gone (Unknown), so pointers.json doesn't accumulate references to nothing. A `Bad` last_booted is left alone — that's a useful breadcrumb and recompute simply doesn't promote it. - Promote `download_artifact_path` and `installed_artifact_path` to methods on PatchLifecycle for symmetry with state_path / pointers_path. Updates all call sites. - Document mark_bad-on-Bad merge semantics: latest reason wins, other fields preserved. Hypothetical in practice but no longer silent. - Add tests for recompute_next_boot's stale-pointer clearing (Unknown → cleared, Bad → kept). Brings the suite to 214 tests. The mark_bad-cleanup-stale-bytes recovery path I flagged was already covered by `cleanup_on_bad_patch_keeps_tombstone` — no new test needed.

eseidel · 2026-05-05T20:13:14Z

@bdero asked for us to stand up end-to-end testing first.

eseidel · 2026-05-05T23:04:58Z

Built an integration test system per your request: #353

…tence) First slice of shorebirdtech/shorebird#3737 — replace the scattered storage of per-patch state across download_state.rs sidecars, bare files in downloads/, and PatchesState fields (next_boot_patch, last_booted_patch, known_bad_patches) with a single per-patch state.json driven by an explicit state machine. This commit only adds the types and the storage layer; nothing wires into update_internal or report_launch_* yet. Subsequent commits will build out transitions and migrate the call sites. States: - Downloading { url, hash, signature, partial_size } - Downloaded { url, hash, signature, size } - Installed { hash, signature, size } - Bad { reason, hash?, signature?, size? } // tombstone Operations: - mark_bad(n, reason) — sugar over write Bad{} + cleanup - cleanup(n) — state-aware: keeps Bad tombstone, else forgets dir Per-release pointers (next_boot, last_booted, currently_booting, boot_started_at) move to a separate pointers.json holding patch numbers; metadata lives once per patch in state.json instead of duplicated across pointers. Persistence rides on the existing atomic disk_io::write (sibling- write + rename) so partial writes can't leave torn files.

Extends the lifecycle module with the methods update_internal and report_launch_* will need: - decide_start(n, url, hash) → DownloadAction (Fresh, Resume, Complete, Skip) - record_download_started / _complete - record_install_complete (transitions Downloaded → Installed, removes the now-unneeded compressed download file) - promote_to_next_boot (transitions a freshly Installed patch into pointers.next_boot, retiring any unbooted predecessor) - record_boot_start / _success / _failure - detect_boot_crash_on_init (handles the breadcrumb left when a prior process crashed during boot) - validate_next_boot_patch (size + signature checks; marks Bad{ValidationFailed} on failure) - recompute_next_boot Nothing wires into the existing updater.rs / report_launch_* yet — those changes follow in the cutover commit.

…ycle Drops the patch_manager dependency from updater_state.rs. UpdaterState now owns a PatchLifecycle directly, with all patch-related methods delegating to it: install_patch → write Installed state + promote to next_boot next_boot_patch → pointers.next_boot_patch + installed_artifact_path is_known_bad_patch → read_state matches Bad uninstall_patch → cleanup + recompute_next_boot record_boot_* → lifecycle's boot transitions validate_next_boot_patch → lifecycle's signature/size check UpdaterState's own state.json shrinks to {client_id, release_version, queued_events}. Per-release per-patch state lives entirely in the lifecycle (pointers.json + patches/{N}/state.json). Other notable behavior changes: - record_boot_failure_for_patch no longer requires currently_booting_patch to match. Matches the prior PatchManager semantics: clear breadcrumb, mark Bad{BootCrash}, recompute. - recompute_next_boot leaves a valid Installed next_boot_patch alone instead of promoting last_booted over it. Without this, processing server rollbacks would clobber a freshly-installed newer patch. Tests in updater.rs that asserted file paths (patches_state.json) update to the new layout (pointers.json). The MockManagePatches-backed unit tests in updater_state.rs are gone — replaced by direct end-to-end tests that exercise PatchLifecycle through UpdaterState. patch_manager.rs is still on disk and compiled (nothing references it from cache/mod.rs anymore) — deletion happens after the updater.rs cutover.

Replaces the download_state.rs sidecar machinery and the bespoke DownloadStartState / should_install_patch helpers in update_internal with calls into PatchLifecycle. Concrete changes: - Drops compute_resume_offset / determine_download_start_state / DownloadStartState — replaced by PatchLifecycle::decide_start returning DownloadAction. - Drops should_install_patch / ShouldInstallPatchCheckResult — folded into the same DownloadAction match. KnownBad → UpdateIsBadPatch, AlreadyInstalled → NoUpdate, the rest proceed. - Drops install_downloaded_patch's bespoke "stage in downloads/, rename on install" choreography. The lifecycle owns the per-patch directory; inflate writes directly into patches/{N}/dlc.vmcode and the Downloaded → Installed transition removes the now-unneeded compressed download file. - Drops cleanup_download_artifacts and clean_download_dir. mark_bad handles tombstone-aware cleanup for failures, and the lifecycle owns its directory. - Marks the patch Bad{InvalidPatchBytes} when inflate fails and Bad{InstallHashMismatch} when check_hash fails. Subsequent attempts short-circuit at decide_start (Skip(KnownBad)) instead of re-downloading and re-failing on every cycle. This is the marks-bad- on-install-failure behavior we deferred from #351. - Preserves the explicit "server-side Content-Length vs total_bytes" mismatch check — surfaces the contract violation directly instead of obliquely via inflate failure. decide_start consults the on-disk download file as the source of truth for "how many bytes do we have so far." The denormalized partial_size in PatchState::Downloading is kept for diagnostics/serde stability but not consulted for the resume offset; this matches the prior file-size-on-disk behavior and avoids needing to update state.json mid-stream from the network layer. For Downloading / Downloaded states where the OS evicted the artifact file out from under us (e.g. iOS code-cache eviction), decide_start falls through to Fresh so the next attempt re-downloads from scratch. The failing-test fixes update file paths and pre-staged state from the old layout (downloads/{N} + downloads/{N}.download.json) to the new layout (patches/{N}/download + patches/{N}/state.json). Several tests that targeted the now-deleted helpers directly are removed; their behavior is covered by the new lifecycle unit tests.

Both modules are now subsumed by PatchLifecycle: - patch_manager.rs (~1600 lines) managed the per-release `patches/` tree, the `next_boot_patch` / `last_booted_patch` / `currently_booting_patch` / `known_bad_patches` fields of `PatchesState`, and the fall-back-from-bad-patch logic. All of this is now in lifecycle.rs split between `PatchLifecycle` (per-patch state.json) and `ReleasePointers` (a single pointers.json). - download_state.rs (~125 lines) wrote the per-download sidecar JSON. Now folded into PatchState::Downloading / PatchState::Downloaded variants in state.json, owned by the lifecycle module. The MockManagePatches / ManagePatches trait machinery in updater_state.rs's tests goes away with patch_manager. The lifecycle operations are tested directly in cache::lifecycle::tests against a real on-disk filesystem under TempDir, which catches issues the mocks couldn't (e.g. file-existence cross-checks in `decide_start`). The cargo workspace now has 212 passing tests vs. 257 before, the delta being the patch_manager unit tests whose subjects no longer exist. End-to-end coverage in `updater.rs::tests` is unchanged and all green.

Nine fixes from the self-review pass on #352: - Drop `partial_size` from PatchState::Downloading. The field was misleading (decide_start reads from disk, not from the recorded value) and unused. record_download_started loses its 5th arg. - Gate `UpdaterState::install_patch` to `#[cfg(test)]`. Production no longer routes through it; only test_utils and the updater_state tests do. The gate makes the divergence intentional and prevents future production callers. - install_patch defensively removes any prior `dlc.vmcode` before rename so behavior is OS-agnostic (POSIX rename overwrites silently; Windows fails). Also mirrors record_install_complete's cleanup of a stale `download` file in the patch dir. - Document on `lifecycle()` / `lifecycle_mut()` that the direct accessors are intentional — wrapping every transition would be churn for no reader benefit. - recompute_next_boot now clears `last_booted_patch` when its on-disk record is gone (Unknown), so pointers.json doesn't accumulate references to nothing. A `Bad` last_booted is left alone — that's a useful breadcrumb and recompute simply doesn't promote it. - Promote `download_artifact_path` and `installed_artifact_path` to methods on PatchLifecycle for symmetry with state_path / pointers_path. Updates all call sites. - Document mark_bad-on-Bad merge semantics: latest reason wins, other fields preserved. Hypothetical in practice but no longer silent. - Add tests for recompute_next_boot's stale-pointer clearing (Unknown → cleared, Bad → kept). Brings the suite to 214 tests. The mark_bad-cleanup-stale-bytes recovery path I flagged was already covered by `cleanup_on_bad_patch_keeps_tombstone` — no new test needed.

Two issues introduced by 07ea84a that the second self-review caught: - update_internal computed download_path / installed_path via `with_state(...).lifecycle().download_artifact_path(n)`. That routed a pure-function path computation through a state read, serializing it against other state operations and adding a disk read per call. Restore the free `download_artifact_path` / `installed_artifact_path` functions as the canonical entry point; the methods on PatchLifecycle are kept as thin wrappers for callers that already hold a lifecycle handle. - install_patch's "defensive remove_file before rename" added a TOCTOU window between the exists() check and the rename for no real benefit — install_patch is now `#[cfg(test)]`-only, the Windows-rename concern was theoretical for tests, and POSIX rename atomically overwrites. Drop the explicit remove_file.

Three CI failures from the prior push: - `Context` and `bail` imports in updater_state.rs are only used inside the `#[cfg(test)] install_patch` helper. Moved them under `#[cfg(test)]` so non-test builds don't see unused imports. - `PathBuf` import in updater.rs became unused after switching the download/install paths to take `Path::new(&config.storage_dir)` directly. Dropped. - `FileOperation::DeleteDir` is no longer constructed by production code (was used by the deleted patch_manager.rs). Mirrored the existing `#[allow(dead_code)]` annotation already on `DeleteFile` rather than removing the variant — the format/handling code for it is still useful for any future code that needs to surface a "delete dir failed" error. CSpell additions: `unparseable`, `tombstoned`, `roundtrips` — words introduced by the lifecycle module's docs and test names.

The previous fix removed `PathBuf` from the top-level use, which broke the lib build on non-android non-test platforms (macOS, Linux, Windows) where `libapp_path_from_settings` uses `PathBuf`. That function is itself gated to `cfg(not(any(target_os = \"android\", test)))`, so the PathBuf import needs the same gate. Restoring it as a cfg-gated separate import — keeps the lib build green on all three runners and keeps the lib-test build clean under -D warnings.

Adds 7 tests targeting branches that were uncovered: - mark_bad_from_downloading_records_partial_file_size: the Downloading source state in mark_bad reads bytes-on-disk for the recorded `size` field; the existing tests only covered Installed → Bad. - mark_bad_overwrites_reason_when_already_bad: documented merge semantics (latest reason wins, other fields preserved) now has a test backing the doc. - validate_next_boot_patch_marks_bad_when_artifact_missing: the "Installed state but dlc.vmcode is gone" path. Existing test covered size mismatch; this covers the missing-file branch. - validate_next_boot_patch_marks_bad_when_pointer_targets_non_installed: the case where the next_boot pointer references a state other than Installed (state.json + pointers can diverge through corruption). - install_patch_install_only_{accepts_valid,rejects_missing,rejects_bad}_signature: cover the InstallOnly verification path in UpdaterState::install_patch using the existing test keypair from signing.rs. Coverage improvements: - cache/lifecycle.rs: 94.65% → 96.06% lines, 98.72% → 100% fns - cache/updater_state.rs: 94.34% → 96.94% lines, 95.65% → 98% fns - total: 94.70% → 95.11% lines 214 → 221 tests, all passing under -D warnings.

Goes through every test deleted with patch_manager.rs (~40) and either maps it to a new test (with a `Ports patch_manager.rs::mod::name` comment) or adds a port. Three behavioral regressions caught and fixed: - record_boot_start now requires next_boot_patch to match the arg. Old PatchManager had this defensive check; my refactor dropped it. Carries forward the engine-vs-state agreement guard. - Added strict-mode signature tests for validate_next_boot_patch (valid, missing, invalid, no public key, bad public key). Ports five `validate_next_boot_patch_tests::strict_mode_*` tests. - Added InstallOnly + no public_key test (ports add_patch_tests::install_only_succeeds_with_any_signature_if_no_public_key). Plus two scenario tests that didn't have direct coverage: - rolled_back_patch_not_resurrected_when_replacement_fails: ports fall_back_tests::rollback_then_failed_replacement_does_not_resurrect_rolled_back_patch against the new state-based fallback (cleanup forgets, recompute won't promote a None state). - validate_then_promote_catches_corrupted_last_booted: ports validate_next_boot_patch_tests::does_not_fall_back_to_last_booted_patch_if_corrupted with a note: new code catches the corruption at the next validate pass instead of at fall-back time, but the user-visible outcome (boot the base release) is the same. Test helper split: `install_state` (writes Installed without touching pointers) vs callers explicitly setting pointers. Avoids the prior helper's auto-promote masking pointer-management behavior the tests were trying to exercise. Tests deliberately not ported (with reasoning): - debug_tests::manage_patches_is_debug, patch_manager_is_debug: Trait-Debug tests for types that no longer exist. Replaced implicitly by `#[derive(Debug)]` on PatchLifecycle. - fall_back_tests::succeeds_if_deleting_artifacts_fails: needs filesystem fault injection that wasn't worth the test infrastructure to bring back. - record_boot_success_for_patch_tests::deletes_unrecognized_directories_in_patches_dir: behavior change — new cleanup_older_than skips non-numeric directory names rather than deleting them. Defensive; the prior `rm -rf` was overly aggressive. Coverage: 222 → 230 tests, all green under -D warnings.

Both turned out trivial to port — neither was actually doing fault injection, just exploiting graceful-failure paths. - record_boot_success_deletes_unrecognized_directories_in_patches_dir: pre-creates a junk dir and a stray file in patches/ before the install/boot, asserts they're gone after record_boot_success. Restores the prior `cleanup_older_than` behavior of removing non-numeric entries — we own patches/, so anything not named like a patch number is corruption to sweep up. - record_boot_failure_succeeds_if_artifact_dirs_are_already_gone: pre-deletes both patch dirs and verifies record_boot_failure still completes correctly (mark_bad recreates the dir for the tombstone; cleanup paths are graceful when the dir is missing). Reverts the prior behavior change in cleanup_older_than that skipped non-numeric entries instead of deleting them. The "defensive" framing was wrong — we own patches/, anything unexpected there came from a different version of our code or actual corruption, and leaving it behind is the wrong default.

- validate_next_boot_strict_mode_succeeds_with_valid_signature: ports `patch_manager.rs::validate_next_boot_patch_tests::strict_mode_succeeds_with_valid_signature_at_boot_time`. Required reusing the prior test fixture's INFLATED_PATCH_HASH / INFLATED_PATCH_SIGNATURE constants (a real signature for the 1-byte file content `b"1"`, matching TEST_PUBLIC_KEY's private half). - record_boot_failure_keeps_last_booted_pointer_when_failed_patch_was_last_booted: ports `patch_manager.rs::record_boot_failure_for_patch_tests::preserves_last_booted_patch_on_failure_but_marks_bad`. The new behavior is similar but more explicit: last_booted's *pointer* is preserved (with the patch now in Bad state); the underlying intent — "we still know what last booted, even after that patch failed" — survives. Test helper split: `install_signed` (failure paths, mismatched hash OK) vs `install_with_valid_signature` (happy path, real fixture). Now every behavioral test from the deleted patch_manager.rs has a corresponding new test with a porting comment. Two trivial ones not ported: the `Debug` impl tests for traits/types that no longer exist. 236 tests, all green under -D warnings.

Reverts an unintended policy change in the cutover: my refactor moved compressed download bytes from `{code_cache_dir}/downloads/{N}` (the OS-managed cache dir, evictable under storage pressure) into `{storage_dir}/patches/{N}/download` (persistent app storage). That made downloads count against persistent storage and survive in iCloud backups — neither of which we want for transient bytes. Restoring the original split: - `{state_root}/patches/{N}/{state.json,dlc.vmcode}` — persistent state and the installed artifact. - `{download_root}/{N}` — flat, in the OS cache dir. `PatchLifecycle::load_or_default` now takes both roots. The two artifact-path helpers each route to their respective root, the free-function variants used by `update_internal` now take the right root, and `mark_bad`/`cleanup`/`record_install_complete` clean both locations as appropriate. `UpdaterState::create_new_and_save` also wipes `download_dir` on release-version change, in addition to the persistent patches/ tree. `update_internal` and tests updated to use the right root for each file type. Test fixtures pass `tmp.path().join("downloads")` as the download root (separate subdir of the same TempDir). Net change in test surface: moved a handful of `patch_dir/download` references to `downloads_dir/N`, no behavioral changes to tests themselves. 236 tests pass under -D warnings.

Self-review caught two gaps in the download-root split: - release_version_change_wipes_download_dir: the cache-rooted download dir should be wiped along with the persistent patches/ tree on a release-version mismatch. Adds explicit coverage — the code path was already in create_new_and_save but not directly tested. - record_boot_success_promotes_and_cleans_older: extends the existing test to drop a stale download in download_root alongside an older patch's state, asserts that cleanup_older_than's chain (cleanup → forget_dir) deletes both roots' artifacts. Also adds a comment on cleanup_older_than noting that it only walks the persistent patches/ tree — orphan downloads (no state.json) would persist until the OS evicts them. Noted as a known limitation, not blocking.

… walk Two fixes to the release-version-change wipe and one to the boot- success cleanup walk. - The "wipe everything in cache_dir" approach broke 27 tests whose setup uses cache_dir as the test temp dir (with base.apk and libapp.so as siblings). Production gives shorebird a dedicated subdir but the API doesn't enforce that. Switched to a targeted wipe with a documented allowlist of paths shorebird has ever written under cache_dir: `SHOREBIRD_OWNED_PATHS`. - Added `patches_state.json` to the wipe list — that's the legacy file from the prior `PatchManager` implementation, and devices upgrading through this PR would otherwise orphan it. New test `release_version_change_wipes_legacy_patches_state_json` covers it. - `cleanup_older_than` now also calls `cleanup_orphan_downloads`, which walks `download_root/` and removes any file that doesn't correspond to a patch in `Downloading`/`Downloaded` state. We own the cache root and shouldn't rely on OS eviction. Catches the "state.json gone but download lingered" case the prior comment handwaved. Adding a new file under cache_dir means adding it to SHOREBIRD_OWNED_PATHS — small bookkeeping cost, but it lets us keep cache_dir co-tenant with embedder-provided files.

The new orphan-download sweep distinguishes four cases but only the "older patch via cleanup chain" branch was being exercised. Single test now drops one of each kind of file into download_root before a boot success and asserts which survive: - orphan (numeric, no state.json): removed - stale (numeric, state is Installed): removed - non-numeric name: removed - live (numeric, state is Downloading): preserved

bdero

Overall

Right architecture, right time. Consolidating the six scattered state stores into one per-patch state.json driven by an explicit machine is the move, and cleanup being state-aware (Bad → keep tombstone, else → forget entirely) cleanly separates the two prior conflated concerns ("server told us not to use this" vs "this actually crashed on this device"). #356 falls out as a strict subset once this lands — happy to close it.

A handful of suggestions, no blockers. Inline comments tagged 💡 (suggestion) and 🧹 (nit).

💡 No end-to-end test for the customer scenario

library/src/updater.rs doesn't have an end-to-end test that pulls the customer scenario in shorebirdtech/shorebird#3728 through update_internal (install → server rolled_back: [1] → server returns patch 1 again with rolled_back: [] → assert reinstall). The behavior is correct at the lifecycle unit level (cleanup_on_non_bad_patch_forgets_entirely), but the customer reported it through the FFI, so a test at that level would directly nail the regression we just shipped a hotfix for in #356. Cheap to port from #356.

(Inlining this in the body since updater.rs line ~2229 — the natural anchor next to the existing rollback tests — isn't in a diff hunk.)

bdero

passes onceover for me, bot found some minor suggestions. Let's merge today and I'll launch a prod smoke test for it.

…eview bdero's review on #352 surfaced six items worth landing inline: 1. record_boot_failure used to clear `currently_booting_patch` before `mark_bad`. A crash between the two left the patch `Installed` with no breadcrumb, so `detect_boot_crash_on_init` wouldn't fire next boot — the patch silently retried. Flipped: mark_bad first, then clear. mark_bad on Bad is idempotent, so the worst case is a redundant tombstone-rewrite on next init. 2. Drop the `hash` field from `PatchState::Installed`. It was recorded at install but never read at boot — Strict-mode validation recomputes the hash from the artifact's bytes and feeds it into `check_signature`. A hash that lives only on disk can't be trusted as a security input; deleting the field removes the temptation. `Downloading.hash`/`Downloaded.hash` stay as comparators against the server's freshly-delivered hash (a tampered on-disk hash there just causes a redownload). 3. Port the customer-scenario test from #356 into `c_api/mod.rs`: server rolls patch 1 back, then forward again with the same number and hash. Pre-lifecycle this was permanently dropped via `known_bad_patches`. Post-lifecycle `cleanup` is state-aware and forgets the patch entirely on server-driven rollback, so the rollforward installs cleanly. End-to-end coverage of the regression #356 hotfixed. 4. Add a TODO + target version on the `patches_state.json` entry in `SHOREBIRD_OWNED_PATHS` so the legacy wipe doesn't outlive its usefulness. 5. Document that `running_patch` lives in `config.rs` (session-scoped), not in this module — saves a future grep. 6. Clarify `recompute_next_boot`'s fallback policy: it's fall-back-to-`last_booted_patch`-only, not fall-back-to-anything- bootable. A fresh release whose first Installed patch fails validation goes to base, even if older patches sit in `patches/`. cargo test: 242 passed (was 241).

eseidel marked this pull request as ready for review May 5, 2026 20:05

eseidel marked this pull request as draft May 5, 2026 20:13

eseidel marked this pull request as ready for review May 5, 2026 23:24

eseidel requested a review from bdero May 7, 2026 21:52

eseidel added 19 commits May 7, 2026 14:53

ci: add 'keypair' to cspell dictionary

d23b2d5

ci: add 'embedder' to cspell dictionary

b189390

eseidel force-pushed the refactor/patch-lifecycle-state-machine branch from 79848b0 to b189390 Compare May 7, 2026 21:55

eseidel mentioned this pull request May 7, 2026

fix: don't permanently ban patch numbers on server-driven rollback #356

Open

6 tasks

bdero reviewed May 7, 2026

View reviewed changes

bdero approved these changes May 7, 2026

View reviewed changes

eseidel enabled auto-merge (squash) May 7, 2026 22:45

ci: add 'rollforward' to cspell dictionary

d4384a0

eseidel merged commit 90c3492 into main May 7, 2026
18 checks passed

eseidel deleted the refactor/patch-lifecycle-state-machine branch May 7, 2026 23:01

lukemmtt mentioned this pull request May 10, 2026

feat(prefetch): Pre-fetch patches for sibling releases on the same track #357

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: per-patch lifecycle state machine#352

refactor: per-patch lifecycle state machine#352
eseidel merged 21 commits into
mainfrom
refactor/patch-lifecycle-state-machine

eseidel commented May 5, 2026

Uh oh!

codecov Bot commented May 5, 2026 •

edited

Loading

Uh oh!

eseidel commented May 5, 2026

Uh oh!

eseidel commented May 5, 2026

Uh oh!

bdero left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bdero left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

eseidel commented May 5, 2026

Why

On-disk layout (per release)

State machine

API

Behaviors that fall out

Diff scope

Test plan

Caveats

Uh oh!

codecov Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

eseidel commented May 5, 2026

Uh oh!

eseidel commented May 5, 2026

Uh oh!

bdero left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Overall

💡 No end-to-end test for the customer scenario

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bdero left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented May 5, 2026 •

edited

Loading

bdero left a comment •

edited

Loading