refactor: per-patch lifecycle state machine#352
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Nine fixes from the self-review pass on #352: - Drop `partial_size` from PatchState::Downloading. The field was misleading (decide_start reads from disk, not from the recorded value) and unused. record_download_started loses its 5th arg. - Gate `UpdaterState::install_patch` to `#[cfg(test)]`. Production no longer routes through it; only test_utils and the updater_state tests do. The gate makes the divergence intentional and prevents future production callers. - install_patch defensively removes any prior `dlc.vmcode` before rename so behavior is OS-agnostic (POSIX rename overwrites silently; Windows fails). Also mirrors record_install_complete's cleanup of a stale `download` file in the patch dir. - Document on `lifecycle()` / `lifecycle_mut()` that the direct accessors are intentional — wrapping every transition would be churn for no reader benefit. - recompute_next_boot now clears `last_booted_patch` when its on-disk record is gone (Unknown), so pointers.json doesn't accumulate references to nothing. A `Bad` last_booted is left alone — that's a useful breadcrumb and recompute simply doesn't promote it. - Promote `download_artifact_path` and `installed_artifact_path` to methods on PatchLifecycle for symmetry with state_path / pointers_path. Updates all call sites. - Document mark_bad-on-Bad merge semantics: latest reason wins, other fields preserved. Hypothetical in practice but no longer silent. - Add tests for recompute_next_boot's stale-pointer clearing (Unknown → cleared, Bad → kept). Brings the suite to 214 tests. The mark_bad-cleanup-stale-bytes recovery path I flagged was already covered by `cleanup_on_bad_patch_keeps_tombstone` — no new test needed.
|
@bdero asked for us to stand up end-to-end testing first. |
|
Built an integration test system per your request: #353 |
…tence) First slice of shorebirdtech/shorebird#3737 — replace the scattered storage of per-patch state across download_state.rs sidecars, bare files in downloads/, and PatchesState fields (next_boot_patch, last_booted_patch, known_bad_patches) with a single per-patch state.json driven by an explicit state machine. This commit only adds the types and the storage layer; nothing wires into update_internal or report_launch_* yet. Subsequent commits will build out transitions and migrate the call sites. States: - Downloading { url, hash, signature, partial_size } - Downloaded { url, hash, signature, size } - Installed { hash, signature, size } - Bad { reason, hash?, signature?, size? } // tombstone Operations: - mark_bad(n, reason) — sugar over write Bad{} + cleanup - cleanup(n) — state-aware: keeps Bad tombstone, else forgets dir Per-release pointers (next_boot, last_booted, currently_booting, boot_started_at) move to a separate pointers.json holding patch numbers; metadata lives once per patch in state.json instead of duplicated across pointers. Persistence rides on the existing atomic disk_io::write (sibling- write + rename) so partial writes can't leave torn files.
Extends the lifecycle module with the methods update_internal and
report_launch_* will need:
- decide_start(n, url, hash) → DownloadAction (Fresh, Resume, Complete, Skip)
- record_download_started / _complete
- record_install_complete (transitions Downloaded → Installed,
removes the now-unneeded compressed download file)
- promote_to_next_boot (transitions a freshly Installed patch into
pointers.next_boot, retiring any unbooted predecessor)
- record_boot_start / _success / _failure
- detect_boot_crash_on_init (handles the breadcrumb left when a
prior process crashed during boot)
- validate_next_boot_patch (size + signature checks; marks
Bad{ValidationFailed} on failure)
- recompute_next_boot
Nothing wires into the existing updater.rs / report_launch_* yet —
those changes follow in the cutover commit.
…ycle
Drops the patch_manager dependency from updater_state.rs. UpdaterState
now owns a PatchLifecycle directly, with all patch-related methods
delegating to it:
install_patch → write Installed state + promote to next_boot
next_boot_patch → pointers.next_boot_patch + installed_artifact_path
is_known_bad_patch → read_state matches Bad
uninstall_patch → cleanup + recompute_next_boot
record_boot_* → lifecycle's boot transitions
validate_next_boot_patch → lifecycle's signature/size check
UpdaterState's own state.json shrinks to {client_id, release_version,
queued_events}. Per-release per-patch state lives entirely in the
lifecycle (pointers.json + patches/{N}/state.json).
Other notable behavior changes:
- record_boot_failure_for_patch no longer requires
currently_booting_patch to match. Matches the prior PatchManager
semantics: clear breadcrumb, mark Bad{BootCrash}, recompute.
- recompute_next_boot leaves a valid Installed next_boot_patch alone
instead of promoting last_booted over it. Without this, processing
server rollbacks would clobber a freshly-installed newer patch.
Tests in updater.rs that asserted file paths (patches_state.json) update
to the new layout (pointers.json). The MockManagePatches-backed unit
tests in updater_state.rs are gone — replaced by direct end-to-end tests
that exercise PatchLifecycle through UpdaterState.
patch_manager.rs is still on disk and compiled (nothing references it
from cache/mod.rs anymore) — deletion happens after the updater.rs
cutover.
Replaces the download_state.rs sidecar machinery and the bespoke
DownloadStartState / should_install_patch helpers in update_internal
with calls into PatchLifecycle.
Concrete changes:
- Drops compute_resume_offset / determine_download_start_state /
DownloadStartState — replaced by PatchLifecycle::decide_start
returning DownloadAction.
- Drops should_install_patch / ShouldInstallPatchCheckResult —
folded into the same DownloadAction match. KnownBad → UpdateIsBadPatch,
AlreadyInstalled → NoUpdate, the rest proceed.
- Drops install_downloaded_patch's bespoke "stage in downloads/, rename
on install" choreography. The lifecycle owns the per-patch directory;
inflate writes directly into patches/{N}/dlc.vmcode and the
Downloaded → Installed transition removes the now-unneeded compressed
download file.
- Drops cleanup_download_artifacts and clean_download_dir. mark_bad
handles tombstone-aware cleanup for failures, and the lifecycle owns
its directory.
- Marks the patch Bad{InvalidPatchBytes} when inflate fails and
Bad{InstallHashMismatch} when check_hash fails. Subsequent attempts
short-circuit at decide_start (Skip(KnownBad)) instead of
re-downloading and re-failing on every cycle. This is the marks-bad-
on-install-failure behavior we deferred from #351.
- Preserves the explicit "server-side Content-Length vs total_bytes"
mismatch check — surfaces the contract violation directly instead of
obliquely via inflate failure.
decide_start consults the on-disk download file as the source of truth
for "how many bytes do we have so far." The denormalized partial_size
in PatchState::Downloading is kept for diagnostics/serde stability but
not consulted for the resume offset; this matches the prior
file-size-on-disk behavior and avoids needing to update state.json
mid-stream from the network layer.
For Downloading / Downloaded states where the OS evicted the artifact
file out from under us (e.g. iOS code-cache eviction), decide_start
falls through to Fresh so the next attempt re-downloads from scratch.
The failing-test fixes update file paths and pre-staged state from the
old layout (downloads/{N} + downloads/{N}.download.json) to the new
layout (patches/{N}/download + patches/{N}/state.json). Several tests
that targeted the now-deleted helpers directly are removed; their
behavior is covered by the new lifecycle unit tests.
Both modules are now subsumed by PatchLifecycle:
- patch_manager.rs (~1600 lines) managed the per-release `patches/`
tree, the `next_boot_patch` / `last_booted_patch` /
`currently_booting_patch` / `known_bad_patches` fields of
`PatchesState`, and the fall-back-from-bad-patch logic. All of
this is now in lifecycle.rs split between `PatchLifecycle`
(per-patch state.json) and `ReleasePointers` (a single
pointers.json).
- download_state.rs (~125 lines) wrote the per-download sidecar
JSON. Now folded into PatchState::Downloading /
PatchState::Downloaded variants in state.json, owned by the
lifecycle module.
The MockManagePatches / ManagePatches trait machinery in
updater_state.rs's tests goes away with patch_manager. The lifecycle
operations are tested directly in cache::lifecycle::tests against a
real on-disk filesystem under TempDir, which catches issues the
mocks couldn't (e.g. file-existence cross-checks in `decide_start`).
The cargo workspace now has 212 passing tests vs. 257 before, the
delta being the patch_manager unit tests whose subjects no longer
exist. End-to-end coverage in `updater.rs::tests` is unchanged and
all green.
Nine fixes from the self-review pass on #352: - Drop `partial_size` from PatchState::Downloading. The field was misleading (decide_start reads from disk, not from the recorded value) and unused. record_download_started loses its 5th arg. - Gate `UpdaterState::install_patch` to `#[cfg(test)]`. Production no longer routes through it; only test_utils and the updater_state tests do. The gate makes the divergence intentional and prevents future production callers. - install_patch defensively removes any prior `dlc.vmcode` before rename so behavior is OS-agnostic (POSIX rename overwrites silently; Windows fails). Also mirrors record_install_complete's cleanup of a stale `download` file in the patch dir. - Document on `lifecycle()` / `lifecycle_mut()` that the direct accessors are intentional — wrapping every transition would be churn for no reader benefit. - recompute_next_boot now clears `last_booted_patch` when its on-disk record is gone (Unknown), so pointers.json doesn't accumulate references to nothing. A `Bad` last_booted is left alone — that's a useful breadcrumb and recompute simply doesn't promote it. - Promote `download_artifact_path` and `installed_artifact_path` to methods on PatchLifecycle for symmetry with state_path / pointers_path. Updates all call sites. - Document mark_bad-on-Bad merge semantics: latest reason wins, other fields preserved. Hypothetical in practice but no longer silent. - Add tests for recompute_next_boot's stale-pointer clearing (Unknown → cleared, Bad → kept). Brings the suite to 214 tests. The mark_bad-cleanup-stale-bytes recovery path I flagged was already covered by `cleanup_on_bad_patch_keeps_tombstone` — no new test needed.
Two issues introduced by 07ea84a that the second self-review caught: - update_internal computed download_path / installed_path via `with_state(...).lifecycle().download_artifact_path(n)`. That routed a pure-function path computation through a state read, serializing it against other state operations and adding a disk read per call. Restore the free `download_artifact_path` / `installed_artifact_path` functions as the canonical entry point; the methods on PatchLifecycle are kept as thin wrappers for callers that already hold a lifecycle handle. - install_patch's "defensive remove_file before rename" added a TOCTOU window between the exists() check and the rename for no real benefit — install_patch is now `#[cfg(test)]`-only, the Windows-rename concern was theoretical for tests, and POSIX rename atomically overwrites. Drop the explicit remove_file.
Three CI failures from the prior push:
- `Context` and `bail` imports in updater_state.rs are only used
inside the `#[cfg(test)] install_patch` helper. Moved them under
`#[cfg(test)]` so non-test builds don't see unused imports.
- `PathBuf` import in updater.rs became unused after switching the
download/install paths to take `Path::new(&config.storage_dir)`
directly. Dropped.
- `FileOperation::DeleteDir` is no longer constructed by production
code (was used by the deleted patch_manager.rs). Mirrored the
existing `#[allow(dead_code)]` annotation already on `DeleteFile`
rather than removing the variant — the format/handling code for
it is still useful for any future code that needs to surface a
"delete dir failed" error.
CSpell additions: `unparseable`, `tombstoned`, `roundtrips` — words
introduced by the lifecycle module's docs and test names.
The previous fix removed `PathBuf` from the top-level use, which broke the lib build on non-android non-test platforms (macOS, Linux, Windows) where `libapp_path_from_settings` uses `PathBuf`. That function is itself gated to `cfg(not(any(target_os = \"android\", test)))`, so the PathBuf import needs the same gate. Restoring it as a cfg-gated separate import — keeps the lib build green on all three runners and keeps the lib-test build clean under -D warnings.
Adds 7 tests targeting branches that were uncovered:
- mark_bad_from_downloading_records_partial_file_size: the
Downloading source state in mark_bad reads bytes-on-disk for
the recorded `size` field; the existing tests only covered
Installed → Bad.
- mark_bad_overwrites_reason_when_already_bad: documented merge
semantics (latest reason wins, other fields preserved) now
has a test backing the doc.
- validate_next_boot_patch_marks_bad_when_artifact_missing: the
"Installed state but dlc.vmcode is gone" path. Existing test
covered size mismatch; this covers the missing-file branch.
- validate_next_boot_patch_marks_bad_when_pointer_targets_non_installed:
the case where the next_boot pointer references a state other
than Installed (state.json + pointers can diverge through
corruption).
- install_patch_install_only_{accepts_valid,rejects_missing,rejects_bad}_signature:
cover the InstallOnly verification path in
UpdaterState::install_patch using the existing test keypair
from signing.rs.
Coverage improvements:
- cache/lifecycle.rs: 94.65% → 96.06% lines, 98.72% → 100% fns
- cache/updater_state.rs: 94.34% → 96.94% lines, 95.65% → 98% fns
- total: 94.70% → 95.11% lines
214 → 221 tests, all passing under -D warnings.
Goes through every test deleted with patch_manager.rs (~40) and either
maps it to a new test (with a `Ports patch_manager.rs::mod::name`
comment) or adds a port. Three behavioral regressions caught and fixed:
- record_boot_start now requires next_boot_patch to match the arg.
Old PatchManager had this defensive check; my refactor dropped it.
Carries forward the engine-vs-state agreement guard.
- Added strict-mode signature tests for validate_next_boot_patch
(valid, missing, invalid, no public key, bad public key). Ports
five `validate_next_boot_patch_tests::strict_mode_*` tests.
- Added InstallOnly + no public_key test (ports
add_patch_tests::install_only_succeeds_with_any_signature_if_no_public_key).
Plus two scenario tests that didn't have direct coverage:
- rolled_back_patch_not_resurrected_when_replacement_fails: ports
fall_back_tests::rollback_then_failed_replacement_does_not_resurrect_rolled_back_patch
against the new state-based fallback (cleanup forgets, recompute
won't promote a None state).
- validate_then_promote_catches_corrupted_last_booted: ports
validate_next_boot_patch_tests::does_not_fall_back_to_last_booted_patch_if_corrupted
with a note: new code catches the corruption at the next validate
pass instead of at fall-back time, but the user-visible outcome
(boot the base release) is the same.
Test helper split: `install_state` (writes Installed without touching
pointers) vs callers explicitly setting pointers. Avoids the prior
helper's auto-promote masking pointer-management behavior the tests
were trying to exercise.
Tests deliberately not ported (with reasoning):
- debug_tests::manage_patches_is_debug, patch_manager_is_debug:
Trait-Debug tests for types that no longer exist. Replaced
implicitly by `#[derive(Debug)]` on PatchLifecycle.
- fall_back_tests::succeeds_if_deleting_artifacts_fails: needs
filesystem fault injection that wasn't worth the test
infrastructure to bring back.
- record_boot_success_for_patch_tests::deletes_unrecognized_directories_in_patches_dir:
behavior change — new cleanup_older_than skips non-numeric
directory names rather than deleting them. Defensive; the prior
`rm -rf` was overly aggressive.
Coverage: 222 → 230 tests, all green under -D warnings.
Both turned out trivial to port — neither was actually doing fault
injection, just exploiting graceful-failure paths.
- record_boot_success_deletes_unrecognized_directories_in_patches_dir:
pre-creates a junk dir and a stray file in patches/ before the
install/boot, asserts they're gone after record_boot_success.
Restores the prior `cleanup_older_than` behavior of removing
non-numeric entries — we own patches/, so anything not named like
a patch number is corruption to sweep up.
- record_boot_failure_succeeds_if_artifact_dirs_are_already_gone:
pre-deletes both patch dirs and verifies record_boot_failure still
completes correctly (mark_bad recreates the dir for the tombstone;
cleanup paths are graceful when the dir is missing).
Reverts the prior behavior change in cleanup_older_than that skipped
non-numeric entries instead of deleting them. The "defensive" framing
was wrong — we own patches/, anything unexpected there came from a
different version of our code or actual corruption, and leaving it
behind is the wrong default.
- validate_next_boot_strict_mode_succeeds_with_valid_signature: ports
`patch_manager.rs::validate_next_boot_patch_tests::strict_mode_succeeds_with_valid_signature_at_boot_time`.
Required reusing the prior test fixture's INFLATED_PATCH_HASH /
INFLATED_PATCH_SIGNATURE constants (a real signature for the
1-byte file content `b"1"`, matching TEST_PUBLIC_KEY's private
half).
- record_boot_failure_keeps_last_booted_pointer_when_failed_patch_was_last_booted:
ports
`patch_manager.rs::record_boot_failure_for_patch_tests::preserves_last_booted_patch_on_failure_but_marks_bad`.
The new behavior is similar but more explicit: last_booted's
*pointer* is preserved (with the patch now in Bad state); the
underlying intent — "we still know what last booted, even after
that patch failed" — survives.
Test helper split: `install_signed` (failure paths, mismatched hash
OK) vs `install_with_valid_signature` (happy path, real fixture).
Now every behavioral test from the deleted patch_manager.rs has a
corresponding new test with a porting comment. Two trivial ones not
ported: the `Debug` impl tests for traits/types that no longer exist.
236 tests, all green under -D warnings.
Reverts an unintended policy change in the cutover: my refactor moved
compressed download bytes from `{code_cache_dir}/downloads/{N}` (the
OS-managed cache dir, evictable under storage pressure) into
`{storage_dir}/patches/{N}/download` (persistent app storage). That
made downloads count against persistent storage and survive in iCloud
backups — neither of which we want for transient bytes.
Restoring the original split:
- `{state_root}/patches/{N}/{state.json,dlc.vmcode}` — persistent
state and the installed artifact.
- `{download_root}/{N}` — flat, in the OS cache dir.
`PatchLifecycle::load_or_default` now takes both roots. The two
artifact-path helpers each route to their respective root, the
free-function variants used by `update_internal` now take the right
root, and `mark_bad`/`cleanup`/`record_install_complete` clean both
locations as appropriate.
`UpdaterState::create_new_and_save` also wipes `download_dir` on
release-version change, in addition to the persistent patches/ tree.
`update_internal` and tests updated to use the right root for each
file type. Test fixtures pass `tmp.path().join("downloads")` as the
download root (separate subdir of the same TempDir).
Net change in test surface: moved a handful of `patch_dir/download`
references to `downloads_dir/N`, no behavioral changes to tests
themselves. 236 tests pass under -D warnings.
Self-review caught two gaps in the download-root split:
- release_version_change_wipes_download_dir: the cache-rooted
download dir should be wiped along with the persistent patches/
tree on a release-version mismatch. Adds explicit coverage —
the code path was already in create_new_and_save but not
directly tested.
- record_boot_success_promotes_and_cleans_older: extends the
existing test to drop a stale download in download_root
alongside an older patch's state, asserts that
cleanup_older_than's chain (cleanup → forget_dir) deletes both
roots' artifacts.
Also adds a comment on cleanup_older_than noting that it only
walks the persistent patches/ tree — orphan downloads (no
state.json) would persist until the OS evicts them. Noted as a
known limitation, not blocking.
… walk
Two fixes to the release-version-change wipe and one to the boot-
success cleanup walk.
- The "wipe everything in cache_dir" approach broke 27 tests whose
setup uses cache_dir as the test temp dir (with base.apk and
libapp.so as siblings). Production gives shorebird a dedicated
subdir but the API doesn't enforce that. Switched to a targeted
wipe with a documented allowlist of paths shorebird has ever
written under cache_dir: `SHOREBIRD_OWNED_PATHS`.
- Added `patches_state.json` to the wipe list — that's the
legacy file from the prior `PatchManager` implementation, and
devices upgrading through this PR would otherwise orphan it.
New test `release_version_change_wipes_legacy_patches_state_json`
covers it.
- `cleanup_older_than` now also calls `cleanup_orphan_downloads`,
which walks `download_root/` and removes any file that doesn't
correspond to a patch in `Downloading`/`Downloaded` state. We own
the cache root and shouldn't rely on OS eviction. Catches the
"state.json gone but download lingered" case the prior comment
handwaved.
Adding a new file under cache_dir means adding it to
SHOREBIRD_OWNED_PATHS — small bookkeeping cost, but it lets us
keep cache_dir co-tenant with embedder-provided files.
The new orphan-download sweep distinguishes four cases but only the "older patch via cleanup chain" branch was being exercised. Single test now drops one of each kind of file into download_root before a boot success and asserts which survive: - orphan (numeric, no state.json): removed - stale (numeric, state is Installed): removed - non-numeric name: removed - live (numeric, state is Downloading): preserved
79848b0 to
b189390
Compare
There was a problem hiding this comment.
Overall
Right architecture, right time. Consolidating the six scattered state stores into one per-patch state.json driven by an explicit machine is the move, and cleanup being state-aware (Bad → keep tombstone, else → forget entirely) cleanly separates the two prior conflated concerns ("server told us not to use this" vs "this actually crashed on this device"). #356 falls out as a strict subset once this lands — happy to close it.
A handful of suggestions, no blockers. Inline comments tagged 💡 (suggestion) and 🧹 (nit).
💡 No end-to-end test for the customer scenario
library/src/updater.rs doesn't have an end-to-end test that pulls the customer scenario in shorebirdtech/shorebird#3728 through update_internal (install → server rolled_back: [1] → server returns patch 1 again with rolled_back: [] → assert reinstall). The behavior is correct at the lifecycle unit level (cleanup_on_non_bad_patch_forgets_entirely), but the customer reported it through the FFI, so a test at that level would directly nail the regression we just shipped a hotfix for in #356. Cheap to port from #356.
(Inlining this in the body since updater.rs line ~2229 — the natural anchor next to the existing rollback tests — isn't in a diff hunk.)
bdero
left a comment
There was a problem hiding this comment.
passes onceover for me, bot found some minor suggestions. Let's merge today and I'll launch a prod smoke test for it.
…eview bdero's review on #352 surfaced six items worth landing inline: 1. record_boot_failure used to clear `currently_booting_patch` before `mark_bad`. A crash between the two left the patch `Installed` with no breadcrumb, so `detect_boot_crash_on_init` wouldn't fire next boot — the patch silently retried. Flipped: mark_bad first, then clear. mark_bad on Bad is idempotent, so the worst case is a redundant tombstone-rewrite on next init. 2. Drop the `hash` field from `PatchState::Installed`. It was recorded at install but never read at boot — Strict-mode validation recomputes the hash from the artifact's bytes and feeds it into `check_signature`. A hash that lives only on disk can't be trusted as a security input; deleting the field removes the temptation. `Downloading.hash`/`Downloaded.hash` stay as comparators against the server's freshly-delivered hash (a tampered on-disk hash there just causes a redownload). 3. Port the customer-scenario test from #356 into `c_api/mod.rs`: server rolls patch 1 back, then forward again with the same number and hash. Pre-lifecycle this was permanently dropped via `known_bad_patches`. Post-lifecycle `cleanup` is state-aware and forgets the patch entirely on server-driven rollback, so the rollforward installs cleanly. End-to-end coverage of the regression #356 hotfixed. 4. Add a TODO + target version on the `patches_state.json` entry in `SHOREBIRD_OWNED_PATHS` so the legacy wipe doesn't outlive its usefulness. 5. Document that `running_patch` lives in `config.rs` (session-scoped), not in this module — saves a future grep. 6. Clarify `recompute_next_boot`'s fallback policy: it's fall-back-to-`last_booted_patch`-only, not fall-back-to-anything- bootable. A fresh release whose first Installed patch fails validation goes to base, even if older patches sit in `patches/`. cargo test: 242 passed (was 241).
Closes shorebirdtech/shorebird#3737.
Why
Per-patch state was scattered across six stores — the
download_state.rssidecar JSON, the bare files indownloads/, the inflated.fullfile, thepatches/{N}/dlc.vmcodeartifact, thenext_boot_patch/last_booted_patch/currently_booting_patch/known_bad_patchesfields ofPatchesState, and the session-onlyrunning_patchglobal. There was no single source of truth for "what state is patch N in?"; every behavior change (most recently the 416 fix in #351) added another orthogonal axis instead of moving the patch through defined transitions.This PR consolidates the persistent per-patch state into a single document at
{cache}/patches/{N}/state.jsondriven by an explicit state machine.On-disk layout (per release)
Everything under
patches/is wiped on release-version change. Within a release,state.jsonsurvives as aBadtombstone even after artifact files are removed — that's how rollback to a previously-known-bad patch is prevented when newer patches fail.State machine
pointers.jsoncarries only patch numbers (next_boot_patch,last_booted_patch,currently_booting_patch,boot_started_at). Metadata lives once per patch in itsstate.json.API
The lifecycle exposes two state-aware operations callers don't have to choose between:
mark_bad(n, reason)— writes aBadtombstone and removes artifact files. Sugar overwrite_state(Bad{...}) + cleanup.cleanup(n)— state-aware retirement. If currentlyBad, keeps the tombstone and removes only artifact files. Otherwise removes the entire patch directory.Plus the lifecycle transitions used by
update_internalandreport_launch_*:decide_start(n, url, hash) → DownloadAction { Fresh, Resume{offset}, Complete, Skip }record_download_started,record_download_complete,record_install_completerecord_boot_start,record_boot_success,record_boot_failuredetect_boot_crash_on_init(runs on init; promotes a strandedcurrently_booting_patchtoBad{BootCrash})validate_next_boot_patch(size + signature; marksBad{ValidationFailed}on failure)recompute_next_boot(state-aware fallback walk that doesn't clobber a valid newernext_boot_patch)Behaviors that fall out
decide_startreturningCompleteskips the network when bytes are already on disk;mark_bad(InstallHashMismatch)short-circuits subsequent cycles viaSkip(KnownBad). The post-fix: skip fetch when prior download is already complete on disk #351 redownload loop (every cycle: download → hash mismatch → cleanup → repeat) is now bounded to one cycle per(release, patch_number).uninstall_patchcallscleanup(state-aware), not the old "always-add-to-known-bad" path. A patch the server later un-rolls-back will be redownloaded — but a patch that crashed locally and is then rolled back keeps itsBad{BootCrash}tombstone, so the server's "it's fine again" can't override device-local "this crashed."decide_startreads the on-disk file size as the resume offset, not a recorded count, so chunked-encoded responses (noContent-Length) work the same as fixed-length ones. The state.json is written before the download starts.decide_startcross-checks the download file's existence forDownloading/Downloadedstates. If the file was evicted, the state is unrecoverable — fall through toFresh.Diff scope
Net
−867lines of code despite adding a full state-machine module — the patch_manager / download_state / scattered-state-helpers tonnage outweighs it.Test plan
cargo test— 213 tests pass (down from 257 because the patch_manager unit tests are gone with the module). End-to-end coverage inupdater.rs::testsis unchanged and all green.updater::testsresume / rollback / state-recovery / install-failure scenario reproduced end-to-end against the new layout.cargo fmt --checkclean.Caveats
{cache}/patches/is wiped on release-version change — no migration needed because per-release state can change freely between releases. Devices upgrading to a release shipped with this PR will redownload any in-flight patch once. This is consistent with how per-release state was always handled.config.download_diris now vestigial. The lifecycle uses{storage_dir}/patches/{N}/so the priorcode_cache_dir/downloadslocation is no longer used. The field is left inUpdateConfigto keep the C API stable; it can be removed in a follow-up.FileOperation::DeleteDiris now only used by tests. Compile-time warning; safe to ignore or remove the variant in a follow-up.