Skip to content

refactor: per-patch lifecycle state machine#352

Merged
eseidel merged 21 commits into
mainfrom
refactor/patch-lifecycle-state-machine
May 7, 2026
Merged

refactor: per-patch lifecycle state machine#352
eseidel merged 21 commits into
mainfrom
refactor/patch-lifecycle-state-machine

Conversation

@eseidel
Copy link
Copy Markdown
Contributor

@eseidel eseidel commented May 5, 2026

Closes shorebirdtech/shorebird#3737.

Why

Per-patch state was scattered across six stores — the download_state.rs sidecar JSON, the bare files in downloads/, the inflated .full file, the patches/{N}/dlc.vmcode artifact, the next_boot_patch / last_booted_patch / currently_booting_patch / known_bad_patches fields of PatchesState, and the session-only running_patch global. There was no single source of truth for "what state is patch N in?"; every behavior change (most recently the 416 fix in #351) added another orthogonal axis instead of moving the patch through defined transitions.

This PR consolidates the persistent per-patch state into a single document at {cache}/patches/{N}/state.json driven by an explicit state machine.

On-disk layout (per release)

{cache}/
  state.json                    # {client_id, release_version, queued_events}
  pointers.json                 # ReleasePointers (patch numbers only)
  patches/
    {N}/
      state.json                # PatchState — source of truth for patch N
      download                  # compressed bytes (Downloading/Downloaded only)
      dlc.vmcode                # installed artifact (Installed only)

Everything under patches/ is wiped on release-version change. Within a release, state.json survives as a Bad tombstone even after artifact files are removed — that's how rollback to a previously-known-bad patch is prevented when newer patches fail.

State machine

enum PatchState {
    Downloading { url, hash, signature, partial_size },
    Downloaded  { url, hash, signature, size },
    Installed   { hash, signature, size },
    Bad         { reason, hash?, signature?, size? },   // tombstone
}

enum BadReason {
    BootCrash,            // boot started but never recorded success
    InvalidPatchBytes,    // inflate failed (zstd magic / decompression)
    InstallHashMismatch,  // bytes decompressed but produced the wrong hash
    ValidationFailed,     // size or signature check failed at boot time
}

pointers.json carries only patch numbers (next_boot_patch, last_booted_patch, currently_booting_patch, boot_started_at). Metadata lives once per patch in its state.json.

API

The lifecycle exposes two state-aware operations callers don't have to choose between:

  • mark_bad(n, reason) — writes a Bad tombstone and removes artifact files. Sugar over write_state(Bad{...}) + cleanup.
  • cleanup(n) — state-aware retirement. If currently Bad, keeps the tombstone and removes only artifact files. Otherwise removes the entire patch directory.

Plus the lifecycle transitions used by update_internal and report_launch_*:

  • decide_start(n, url, hash) → DownloadAction { Fresh, Resume{offset}, Complete, Skip }
  • record_download_started, record_download_complete, record_install_complete
  • record_boot_start, record_boot_success, record_boot_failure
  • detect_boot_crash_on_init (runs on init; promotes a stranded currently_booting_patch to Bad{BootCrash})
  • validate_next_boot_patch (size + signature; marks Bad{ValidationFailed} on failure)
  • recompute_next_boot (state-aware fallback walk that doesn't clobber a valid newer next_boot_patch)

Behaviors that fall out

  • 416 redownload loop (fix: skip fetch when prior download is already complete on disk #351): decide_start returning Complete skips the network when bytes are already on disk; mark_bad(InstallHashMismatch) short-circuits subsequent cycles via Skip(KnownBad). The post-fix: skip fetch when prior download is already complete on disk #351 redownload loop (every cycle: download → hash mismatch → cleanup → repeat) is now bounded to one cycle per (release, patch_number).
  • Server rollback un-rollback: uninstall_patch calls cleanup (state-aware), not the old "always-add-to-known-bad" path. A patch the server later un-rolls-back will be redownloaded — but a patch that crashed locally and is then rolled back keeps its Bad{BootCrash} tombstone, so the server's "it's fine again" can't override device-local "this crashed."
  • Chunked-encoding resume: decide_start reads the on-disk file size as the resume offset, not a recorded count, so chunked-encoded responses (no Content-Length) work the same as fixed-length ones. The state.json is written before the download starts.
  • OS-evicted partial files: decide_start cross-checks the download file's existence for Downloading / Downloaded states. If the file was evicted, the state is unrecoverable — fall through to Fresh.

Diff scope

library/src/cache/lifecycle.rs     | 1365 +++  (new)
library/src/cache/patch_manager.rs | 1615 ---  (deleted)
library/src/download_state.rs      |  126 ---  (deleted)
library/src/cache/updater_state.rs |  622 +-   (rewritten on top of lifecycle)
library/src/updater.rs             |  914 +-   (update_internal cutover, helpers removed)
library/src/cache/mod.rs           |    4 +-
library/src/lib.rs                 |    1 -
7 files changed, 1890 insertions(+), 2757 deletions(-)

Net −867 lines of code despite adding a full state-machine module — the patch_manager / download_state / scattered-state-helpers tonnage outweighs it.

Test plan

  • cargo test — 213 tests pass (down from 257 because the patch_manager unit tests are gone with the module). End-to-end coverage in updater.rs::tests is unchanged and all green.
  • 67 new lifecycle unit tests covering: serde roundtrip, corrupt state.json fail-safe, mark_bad metadata preservation across all source states, cleanup state-awareness (Bad → tombstone, else → forget), idempotency, decide_start across all states + URL/hash mismatches + missing-file recovery, all download/install transitions, full boot lifecycle (start/success/failure), crash-on-init detection, fallback walk, validation.
  • Behavior of every updater::tests resume / rollback / state-recovery / install-failure scenario reproduced end-to-end against the new layout.
  • cargo fmt --check clean.

Caveats

  • Bumps the on-disk format. Per the design discussion, everything under {cache}/patches/ is wiped on release-version change — no migration needed because per-release state can change freely between releases. Devices upgrading to a release shipped with this PR will redownload any in-flight patch once. This is consistent with how per-release state was always handled.
  • config.download_dir is now vestigial. The lifecycle uses {storage_dir}/patches/{N}/ so the prior code_cache_dir/downloads location is no longer used. The field is left in UpdateConfig to keep the C API stable; it can be removed in a follow-up.
  • FileOperation::DeleteDir is now only used by tests. Compile-time warning; safe to ignore or remove the variant in a follow-up.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 5, 2026

Codecov Report

❌ Patch coverage is 97.46105% with 44 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
library/src/cache/lifecycle.rs 97.98% 24 Missing ⚠️
library/src/cache/updater_state.rs 96.33% 11 Missing ⚠️
library/src/updater.rs 94.97% 9 Missing ⚠️

📢 Thoughts on this report? Let us know!

eseidel added a commit that referenced this pull request May 5, 2026
Nine fixes from the self-review pass on #352:

  - Drop `partial_size` from PatchState::Downloading. The field was
    misleading (decide_start reads from disk, not from the recorded
    value) and unused. record_download_started loses its 5th arg.
  - Gate `UpdaterState::install_patch` to `#[cfg(test)]`. Production
    no longer routes through it; only test_utils and the updater_state
    tests do. The gate makes the divergence intentional and prevents
    future production callers.
  - install_patch defensively removes any prior `dlc.vmcode` before
    rename so behavior is OS-agnostic (POSIX rename overwrites silently;
    Windows fails). Also mirrors record_install_complete's cleanup of
    a stale `download` file in the patch dir.
  - Document on `lifecycle()` / `lifecycle_mut()` that the direct
    accessors are intentional — wrapping every transition would be
    churn for no reader benefit.
  - recompute_next_boot now clears `last_booted_patch` when its
    on-disk record is gone (Unknown), so pointers.json doesn't
    accumulate references to nothing. A `Bad` last_booted is left
    alone — that's a useful breadcrumb and recompute simply doesn't
    promote it.
  - Promote `download_artifact_path` and `installed_artifact_path` to
    methods on PatchLifecycle for symmetry with state_path /
    pointers_path. Updates all call sites.
  - Document mark_bad-on-Bad merge semantics: latest reason wins,
    other fields preserved. Hypothetical in practice but no longer
    silent.
  - Add tests for recompute_next_boot's stale-pointer clearing
    (Unknown → cleared, Bad → kept). Brings the suite to 214 tests.

The mark_bad-cleanup-stale-bytes recovery path I flagged was already
covered by `cleanup_on_bad_patch_keeps_tombstone` — no new test needed.
@eseidel eseidel marked this pull request as ready for review May 5, 2026 20:05
@eseidel eseidel marked this pull request as draft May 5, 2026 20:13
@eseidel
Copy link
Copy Markdown
Contributor Author

eseidel commented May 5, 2026

@bdero asked for us to stand up end-to-end testing first.

@eseidel
Copy link
Copy Markdown
Contributor Author

eseidel commented May 5, 2026

Built an integration test system per your request: #353

@eseidel eseidel marked this pull request as ready for review May 5, 2026 23:24
@eseidel eseidel requested a review from bdero May 7, 2026 21:52
eseidel added 19 commits May 7, 2026 14:53
…tence)

First slice of shorebirdtech/shorebird#3737 — replace the scattered
storage of per-patch state across download_state.rs sidecars, bare
files in downloads/, and PatchesState fields (next_boot_patch,
last_booted_patch, known_bad_patches) with a single per-patch
state.json driven by an explicit state machine.

This commit only adds the types and the storage layer; nothing wires
into update_internal or report_launch_* yet. Subsequent commits will
build out transitions and migrate the call sites.

States:
  - Downloading { url, hash, signature, partial_size }
  - Downloaded  { url, hash, signature, size }
  - Installed   { hash, signature, size }
  - Bad         { reason, hash?, signature?, size? }      // tombstone

Operations:
  - mark_bad(n, reason) — sugar over write Bad{} + cleanup
  - cleanup(n) — state-aware: keeps Bad tombstone, else forgets dir

Per-release pointers (next_boot, last_booted, currently_booting,
boot_started_at) move to a separate pointers.json holding patch
numbers; metadata lives once per patch in state.json instead of
duplicated across pointers.

Persistence rides on the existing atomic disk_io::write (sibling-
write + rename) so partial writes can't leave torn files.
Extends the lifecycle module with the methods update_internal and
report_launch_* will need:

  - decide_start(n, url, hash) → DownloadAction (Fresh, Resume, Complete, Skip)
  - record_download_started / _complete
  - record_install_complete (transitions Downloaded → Installed,
    removes the now-unneeded compressed download file)
  - promote_to_next_boot (transitions a freshly Installed patch into
    pointers.next_boot, retiring any unbooted predecessor)
  - record_boot_start / _success / _failure
  - detect_boot_crash_on_init (handles the breadcrumb left when a
    prior process crashed during boot)
  - validate_next_boot_patch (size + signature checks; marks
    Bad{ValidationFailed} on failure)
  - recompute_next_boot

Nothing wires into the existing updater.rs / report_launch_* yet —
those changes follow in the cutover commit.
…ycle

Drops the patch_manager dependency from updater_state.rs. UpdaterState
now owns a PatchLifecycle directly, with all patch-related methods
delegating to it:

  install_patch        → write Installed state + promote to next_boot
  next_boot_patch      → pointers.next_boot_patch + installed_artifact_path
  is_known_bad_patch   → read_state matches Bad
  uninstall_patch      → cleanup + recompute_next_boot
  record_boot_*        → lifecycle's boot transitions
  validate_next_boot_patch → lifecycle's signature/size check

UpdaterState's own state.json shrinks to {client_id, release_version,
queued_events}. Per-release per-patch state lives entirely in the
lifecycle (pointers.json + patches/{N}/state.json).

Other notable behavior changes:

  - record_boot_failure_for_patch no longer requires
    currently_booting_patch to match. Matches the prior PatchManager
    semantics: clear breadcrumb, mark Bad{BootCrash}, recompute.
  - recompute_next_boot leaves a valid Installed next_boot_patch alone
    instead of promoting last_booted over it. Without this, processing
    server rollbacks would clobber a freshly-installed newer patch.

Tests in updater.rs that asserted file paths (patches_state.json) update
to the new layout (pointers.json). The MockManagePatches-backed unit
tests in updater_state.rs are gone — replaced by direct end-to-end tests
that exercise PatchLifecycle through UpdaterState.

patch_manager.rs is still on disk and compiled (nothing references it
from cache/mod.rs anymore) — deletion happens after the updater.rs
cutover.
Replaces the download_state.rs sidecar machinery and the bespoke
DownloadStartState / should_install_patch helpers in update_internal
with calls into PatchLifecycle.

Concrete changes:

  - Drops compute_resume_offset / determine_download_start_state /
    DownloadStartState — replaced by PatchLifecycle::decide_start
    returning DownloadAction.
  - Drops should_install_patch / ShouldInstallPatchCheckResult —
    folded into the same DownloadAction match. KnownBad → UpdateIsBadPatch,
    AlreadyInstalled → NoUpdate, the rest proceed.
  - Drops install_downloaded_patch's bespoke "stage in downloads/, rename
    on install" choreography. The lifecycle owns the per-patch directory;
    inflate writes directly into patches/{N}/dlc.vmcode and the
    Downloaded → Installed transition removes the now-unneeded compressed
    download file.
  - Drops cleanup_download_artifacts and clean_download_dir. mark_bad
    handles tombstone-aware cleanup for failures, and the lifecycle owns
    its directory.
  - Marks the patch Bad{InvalidPatchBytes} when inflate fails and
    Bad{InstallHashMismatch} when check_hash fails. Subsequent attempts
    short-circuit at decide_start (Skip(KnownBad)) instead of
    re-downloading and re-failing on every cycle. This is the marks-bad-
    on-install-failure behavior we deferred from #351.
  - Preserves the explicit "server-side Content-Length vs total_bytes"
    mismatch check — surfaces the contract violation directly instead of
    obliquely via inflate failure.

decide_start consults the on-disk download file as the source of truth
for "how many bytes do we have so far." The denormalized partial_size
in PatchState::Downloading is kept for diagnostics/serde stability but
not consulted for the resume offset; this matches the prior
file-size-on-disk behavior and avoids needing to update state.json
mid-stream from the network layer.

For Downloading / Downloaded states where the OS evicted the artifact
file out from under us (e.g. iOS code-cache eviction), decide_start
falls through to Fresh so the next attempt re-downloads from scratch.

The failing-test fixes update file paths and pre-staged state from the
old layout (downloads/{N} + downloads/{N}.download.json) to the new
layout (patches/{N}/download + patches/{N}/state.json). Several tests
that targeted the now-deleted helpers directly are removed; their
behavior is covered by the new lifecycle unit tests.
Both modules are now subsumed by PatchLifecycle:

  - patch_manager.rs (~1600 lines) managed the per-release `patches/`
    tree, the `next_boot_patch` / `last_booted_patch` /
    `currently_booting_patch` / `known_bad_patches` fields of
    `PatchesState`, and the fall-back-from-bad-patch logic. All of
    this is now in lifecycle.rs split between `PatchLifecycle`
    (per-patch state.json) and `ReleasePointers` (a single
    pointers.json).
  - download_state.rs (~125 lines) wrote the per-download sidecar
    JSON. Now folded into PatchState::Downloading /
    PatchState::Downloaded variants in state.json, owned by the
    lifecycle module.

The MockManagePatches / ManagePatches trait machinery in
updater_state.rs's tests goes away with patch_manager. The lifecycle
operations are tested directly in cache::lifecycle::tests against a
real on-disk filesystem under TempDir, which catches issues the
mocks couldn't (e.g. file-existence cross-checks in `decide_start`).

The cargo workspace now has 212 passing tests vs. 257 before, the
delta being the patch_manager unit tests whose subjects no longer
exist. End-to-end coverage in `updater.rs::tests` is unchanged and
all green.
Nine fixes from the self-review pass on #352:

  - Drop `partial_size` from PatchState::Downloading. The field was
    misleading (decide_start reads from disk, not from the recorded
    value) and unused. record_download_started loses its 5th arg.
  - Gate `UpdaterState::install_patch` to `#[cfg(test)]`. Production
    no longer routes through it; only test_utils and the updater_state
    tests do. The gate makes the divergence intentional and prevents
    future production callers.
  - install_patch defensively removes any prior `dlc.vmcode` before
    rename so behavior is OS-agnostic (POSIX rename overwrites silently;
    Windows fails). Also mirrors record_install_complete's cleanup of
    a stale `download` file in the patch dir.
  - Document on `lifecycle()` / `lifecycle_mut()` that the direct
    accessors are intentional — wrapping every transition would be
    churn for no reader benefit.
  - recompute_next_boot now clears `last_booted_patch` when its
    on-disk record is gone (Unknown), so pointers.json doesn't
    accumulate references to nothing. A `Bad` last_booted is left
    alone — that's a useful breadcrumb and recompute simply doesn't
    promote it.
  - Promote `download_artifact_path` and `installed_artifact_path` to
    methods on PatchLifecycle for symmetry with state_path /
    pointers_path. Updates all call sites.
  - Document mark_bad-on-Bad merge semantics: latest reason wins,
    other fields preserved. Hypothetical in practice but no longer
    silent.
  - Add tests for recompute_next_boot's stale-pointer clearing
    (Unknown → cleared, Bad → kept). Brings the suite to 214 tests.

The mark_bad-cleanup-stale-bytes recovery path I flagged was already
covered by `cleanup_on_bad_patch_keeps_tombstone` — no new test needed.
Two issues introduced by 07ea84a that the second self-review caught:

  - update_internal computed download_path / installed_path via
    `with_state(...).lifecycle().download_artifact_path(n)`. That
    routed a pure-function path computation through a state read,
    serializing it against other state operations and adding a disk
    read per call. Restore the free `download_artifact_path` /
    `installed_artifact_path` functions as the canonical entry point;
    the methods on PatchLifecycle are kept as thin wrappers for
    callers that already hold a lifecycle handle.
  - install_patch's "defensive remove_file before rename" added a
    TOCTOU window between the exists() check and the rename for no
    real benefit — install_patch is now `#[cfg(test)]`-only, the
    Windows-rename concern was theoretical for tests, and POSIX rename
    atomically overwrites. Drop the explicit remove_file.
Three CI failures from the prior push:

  - `Context` and `bail` imports in updater_state.rs are only used
    inside the `#[cfg(test)] install_patch` helper. Moved them under
    `#[cfg(test)]` so non-test builds don't see unused imports.
  - `PathBuf` import in updater.rs became unused after switching the
    download/install paths to take `Path::new(&config.storage_dir)`
    directly. Dropped.
  - `FileOperation::DeleteDir` is no longer constructed by production
    code (was used by the deleted patch_manager.rs). Mirrored the
    existing `#[allow(dead_code)]` annotation already on `DeleteFile`
    rather than removing the variant — the format/handling code for
    it is still useful for any future code that needs to surface a
    "delete dir failed" error.

CSpell additions: `unparseable`, `tombstoned`, `roundtrips` — words
introduced by the lifecycle module's docs and test names.
The previous fix removed `PathBuf` from the top-level use, which
broke the lib build on non-android non-test platforms (macOS,
Linux, Windows) where `libapp_path_from_settings` uses `PathBuf`.
That function is itself gated to `cfg(not(any(target_os = \"android\",
test)))`, so the PathBuf import needs the same gate.

Restoring it as a cfg-gated separate import — keeps the lib build
green on all three runners and keeps the lib-test build clean
under -D warnings.
Adds 7 tests targeting branches that were uncovered:

  - mark_bad_from_downloading_records_partial_file_size: the
    Downloading source state in mark_bad reads bytes-on-disk for
    the recorded `size` field; the existing tests only covered
    Installed → Bad.
  - mark_bad_overwrites_reason_when_already_bad: documented merge
    semantics (latest reason wins, other fields preserved) now
    has a test backing the doc.
  - validate_next_boot_patch_marks_bad_when_artifact_missing: the
    "Installed state but dlc.vmcode is gone" path. Existing test
    covered size mismatch; this covers the missing-file branch.
  - validate_next_boot_patch_marks_bad_when_pointer_targets_non_installed:
    the case where the next_boot pointer references a state other
    than Installed (state.json + pointers can diverge through
    corruption).
  - install_patch_install_only_{accepts_valid,rejects_missing,rejects_bad}_signature:
    cover the InstallOnly verification path in
    UpdaterState::install_patch using the existing test keypair
    from signing.rs.

Coverage improvements:
  - cache/lifecycle.rs:    94.65% → 96.06% lines, 98.72% → 100% fns
  - cache/updater_state.rs: 94.34% → 96.94% lines, 95.65% → 98% fns
  - total:                 94.70% → 95.11% lines

214 → 221 tests, all passing under -D warnings.
Goes through every test deleted with patch_manager.rs (~40) and either
maps it to a new test (with a `Ports patch_manager.rs::mod::name`
comment) or adds a port. Three behavioral regressions caught and fixed:

  - record_boot_start now requires next_boot_patch to match the arg.
    Old PatchManager had this defensive check; my refactor dropped it.
    Carries forward the engine-vs-state agreement guard.
  - Added strict-mode signature tests for validate_next_boot_patch
    (valid, missing, invalid, no public key, bad public key). Ports
    five `validate_next_boot_patch_tests::strict_mode_*` tests.
  - Added InstallOnly + no public_key test (ports
    add_patch_tests::install_only_succeeds_with_any_signature_if_no_public_key).

Plus two scenario tests that didn't have direct coverage:

  - rolled_back_patch_not_resurrected_when_replacement_fails: ports
    fall_back_tests::rollback_then_failed_replacement_does_not_resurrect_rolled_back_patch
    against the new state-based fallback (cleanup forgets, recompute
    won't promote a None state).
  - validate_then_promote_catches_corrupted_last_booted: ports
    validate_next_boot_patch_tests::does_not_fall_back_to_last_booted_patch_if_corrupted
    with a note: new code catches the corruption at the next validate
    pass instead of at fall-back time, but the user-visible outcome
    (boot the base release) is the same.

Test helper split: `install_state` (writes Installed without touching
pointers) vs callers explicitly setting pointers. Avoids the prior
helper's auto-promote masking pointer-management behavior the tests
were trying to exercise.

Tests deliberately not ported (with reasoning):

  - debug_tests::manage_patches_is_debug, patch_manager_is_debug:
    Trait-Debug tests for types that no longer exist. Replaced
    implicitly by `#[derive(Debug)]` on PatchLifecycle.
  - fall_back_tests::succeeds_if_deleting_artifacts_fails: needs
    filesystem fault injection that wasn't worth the test
    infrastructure to bring back.
  - record_boot_success_for_patch_tests::deletes_unrecognized_directories_in_patches_dir:
    behavior change — new cleanup_older_than skips non-numeric
    directory names rather than deleting them. Defensive; the prior
    `rm -rf` was overly aggressive.

Coverage: 222 → 230 tests, all green under -D warnings.
Both turned out trivial to port — neither was actually doing fault
injection, just exploiting graceful-failure paths.

  - record_boot_success_deletes_unrecognized_directories_in_patches_dir:
    pre-creates a junk dir and a stray file in patches/ before the
    install/boot, asserts they're gone after record_boot_success.
    Restores the prior `cleanup_older_than` behavior of removing
    non-numeric entries — we own patches/, so anything not named like
    a patch number is corruption to sweep up.
  - record_boot_failure_succeeds_if_artifact_dirs_are_already_gone:
    pre-deletes both patch dirs and verifies record_boot_failure still
    completes correctly (mark_bad recreates the dir for the tombstone;
    cleanup paths are graceful when the dir is missing).

Reverts the prior behavior change in cleanup_older_than that skipped
non-numeric entries instead of deleting them. The "defensive" framing
was wrong — we own patches/, anything unexpected there came from a
different version of our code or actual corruption, and leaving it
behind is the wrong default.
  - validate_next_boot_strict_mode_succeeds_with_valid_signature: ports
    `patch_manager.rs::validate_next_boot_patch_tests::strict_mode_succeeds_with_valid_signature_at_boot_time`.
    Required reusing the prior test fixture's INFLATED_PATCH_HASH /
    INFLATED_PATCH_SIGNATURE constants (a real signature for the
    1-byte file content `b"1"`, matching TEST_PUBLIC_KEY's private
    half).
  - record_boot_failure_keeps_last_booted_pointer_when_failed_patch_was_last_booted:
    ports
    `patch_manager.rs::record_boot_failure_for_patch_tests::preserves_last_booted_patch_on_failure_but_marks_bad`.
    The new behavior is similar but more explicit: last_booted's
    *pointer* is preserved (with the patch now in Bad state); the
    underlying intent — "we still know what last booted, even after
    that patch failed" — survives.

Test helper split: `install_signed` (failure paths, mismatched hash
OK) vs `install_with_valid_signature` (happy path, real fixture).

Now every behavioral test from the deleted patch_manager.rs has a
corresponding new test with a porting comment. Two trivial ones not
ported: the `Debug` impl tests for traits/types that no longer exist.

236 tests, all green under -D warnings.
Reverts an unintended policy change in the cutover: my refactor moved
compressed download bytes from `{code_cache_dir}/downloads/{N}` (the
OS-managed cache dir, evictable under storage pressure) into
`{storage_dir}/patches/{N}/download` (persistent app storage). That
made downloads count against persistent storage and survive in iCloud
backups — neither of which we want for transient bytes.

Restoring the original split:
  - `{state_root}/patches/{N}/{state.json,dlc.vmcode}` — persistent
    state and the installed artifact.
  - `{download_root}/{N}` — flat, in the OS cache dir.

`PatchLifecycle::load_or_default` now takes both roots. The two
artifact-path helpers each route to their respective root, the
free-function variants used by `update_internal` now take the right
root, and `mark_bad`/`cleanup`/`record_install_complete` clean both
locations as appropriate.

`UpdaterState::create_new_and_save` also wipes `download_dir` on
release-version change, in addition to the persistent patches/ tree.

`update_internal` and tests updated to use the right root for each
file type. Test fixtures pass `tmp.path().join("downloads")` as the
download root (separate subdir of the same TempDir).

Net change in test surface: moved a handful of `patch_dir/download`
references to `downloads_dir/N`, no behavioral changes to tests
themselves. 236 tests pass under -D warnings.
Self-review caught two gaps in the download-root split:

  - release_version_change_wipes_download_dir: the cache-rooted
    download dir should be wiped along with the persistent patches/
    tree on a release-version mismatch. Adds explicit coverage —
    the code path was already in create_new_and_save but not
    directly tested.
  - record_boot_success_promotes_and_cleans_older: extends the
    existing test to drop a stale download in download_root
    alongside an older patch's state, asserts that
    cleanup_older_than's chain (cleanup → forget_dir) deletes both
    roots' artifacts.

Also adds a comment on cleanup_older_than noting that it only
walks the persistent patches/ tree — orphan downloads (no
state.json) would persist until the OS evicts them. Noted as a
known limitation, not blocking.
… walk

Two fixes to the release-version-change wipe and one to the boot-
success cleanup walk.

  - The "wipe everything in cache_dir" approach broke 27 tests whose
    setup uses cache_dir as the test temp dir (with base.apk and
    libapp.so as siblings). Production gives shorebird a dedicated
    subdir but the API doesn't enforce that. Switched to a targeted
    wipe with a documented allowlist of paths shorebird has ever
    written under cache_dir: `SHOREBIRD_OWNED_PATHS`.
  - Added `patches_state.json` to the wipe list — that's the
    legacy file from the prior `PatchManager` implementation, and
    devices upgrading through this PR would otherwise orphan it.
    New test `release_version_change_wipes_legacy_patches_state_json`
    covers it.
  - `cleanup_older_than` now also calls `cleanup_orphan_downloads`,
    which walks `download_root/` and removes any file that doesn't
    correspond to a patch in `Downloading`/`Downloaded` state. We own
    the cache root and shouldn't rely on OS eviction. Catches the
    "state.json gone but download lingered" case the prior comment
    handwaved.

Adding a new file under cache_dir means adding it to
SHOREBIRD_OWNED_PATHS — small bookkeeping cost, but it lets us
keep cache_dir co-tenant with embedder-provided files.
The new orphan-download sweep distinguishes four cases but only the
"older patch via cleanup chain" branch was being exercised. Single
test now drops one of each kind of file into download_root before a
boot success and asserts which survive:

  - orphan (numeric, no state.json): removed
  - stale (numeric, state is Installed): removed
  - non-numeric name: removed
  - live (numeric, state is Downloading): preserved
@eseidel eseidel force-pushed the refactor/patch-lifecycle-state-machine branch from 79848b0 to b189390 Compare May 7, 2026 21:55
Copy link
Copy Markdown
Member

@bdero bdero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall

Right architecture, right time. Consolidating the six scattered state stores into one per-patch state.json driven by an explicit machine is the move, and cleanup being state-aware (Bad → keep tombstone, else → forget entirely) cleanly separates the two prior conflated concerns ("server told us not to use this" vs "this actually crashed on this device"). #356 falls out as a strict subset once this lands — happy to close it.

A handful of suggestions, no blockers. Inline comments tagged 💡 (suggestion) and 🧹 (nit).

💡 No end-to-end test for the customer scenario

library/src/updater.rs doesn't have an end-to-end test that pulls the customer scenario in shorebirdtech/shorebird#3728 through update_internal (install → server rolled_back: [1] → server returns patch 1 again with rolled_back: [] → assert reinstall). The behavior is correct at the lifecycle unit level (cleanup_on_non_bad_patch_forgets_entirely), but the customer reported it through the FFI, so a test at that level would directly nail the regression we just shipped a hotfix for in #356. Cheap to port from #356.

(Inlining this in the body since updater.rs line ~2229 — the natural anchor next to the existing rollback tests — isn't in a diff hunk.)

Comment thread library/src/cache/lifecycle.rs
Comment thread library/src/cache/lifecycle.rs
Comment thread library/src/cache/lifecycle.rs
Comment thread library/src/cache/lifecycle.rs
Comment thread library/src/cache/updater_state.rs
Comment thread library/src/cache/lifecycle.rs
Comment thread library/src/cache/lifecycle.rs
Comment thread library/src/cache/updater_state.rs
Comment thread library/src/cache/lifecycle.rs
Comment thread library/src/cache/lifecycle.rs
Copy link
Copy Markdown
Member

@bdero bdero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

passes onceover for me, bot found some minor suggestions. Let's merge today and I'll launch a prod smoke test for it.

…eview

bdero's review on #352 surfaced six items worth landing inline:

1. record_boot_failure used to clear `currently_booting_patch` before
   `mark_bad`. A crash between the two left the patch `Installed` with
   no breadcrumb, so `detect_boot_crash_on_init` wouldn't fire next
   boot — the patch silently retried. Flipped: mark_bad first, then
   clear. mark_bad on Bad is idempotent, so the worst case is a
   redundant tombstone-rewrite on next init.

2. Drop the `hash` field from `PatchState::Installed`. It was recorded
   at install but never read at boot — Strict-mode validation
   recomputes the hash from the artifact's bytes and feeds it into
   `check_signature`. A hash that lives only on disk can't be trusted
   as a security input; deleting the field removes the temptation.
   `Downloading.hash`/`Downloaded.hash` stay as comparators against
   the server's freshly-delivered hash (a tampered on-disk hash there
   just causes a redownload).

3. Port the customer-scenario test from #356 into `c_api/mod.rs`:
   server rolls patch 1 back, then forward again with the same number
   and hash. Pre-lifecycle this was permanently dropped via
   `known_bad_patches`. Post-lifecycle `cleanup` is state-aware and
   forgets the patch entirely on server-driven rollback, so the
   rollforward installs cleanly. End-to-end coverage of the regression
   #356 hotfixed.

4. Add a TODO + target version on the `patches_state.json` entry in
   `SHOREBIRD_OWNED_PATHS` so the legacy wipe doesn't outlive its
   usefulness.

5. Document that `running_patch` lives in `config.rs` (session-scoped),
   not in this module — saves a future grep.

6. Clarify `recompute_next_boot`'s fallback policy: it's
   fall-back-to-`last_booted_patch`-only, not fall-back-to-anything-
   bootable. A fresh release whose first Installed patch fails
   validation goes to base, even if older patches sit in `patches/`.

cargo test: 242 passed (was 241).
@eseidel eseidel enabled auto-merge (squash) May 7, 2026 22:45
@eseidel eseidel merged commit 90c3492 into main May 7, 2026
18 checks passed
@eseidel eseidel deleted the refactor/patch-lifecycle-state-machine branch May 7, 2026 23:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tech debt: consolidate per-patch state machine in updater

2 participants