feat: recover crashed evals from buffer segment files by rasmusfaber · Pull Request #25 · METR/inspect_ai

rasmusfaber · 2026-04-20T10:25:39Z

This PR contains:

What is the current behavior? (You can also link to an open issue here)

inspect log recover requires a local SQLite buffer database to recover crashed evals. When the buffer DB is lost (e.g., OOMKilled pod with ephemeral storage), recovery is impossible even though the same data exists as compressed segment files in a .buffer/ directory alongside the .eval file (potentially on S3).

What is the new behavior?

inspect log recover automatically falls back to reading .buffer segment files when no local SQLite buffer database exists. Both SampleBufferDatabase (SQLite) and SampleBufferFilestore (segments) implement the SampleBuffer interface, so the filestore slots into the existing recovery pipeline.

Changes:

SampleBuffer ABC gains cleanup() as an abstract method
SampleBufferFilestore gains iter_sample_segments() for streaming segment reads with best-effort error handling (missing/corrupt segments are skipped with warnings)
read_buffer_recovery_data() falls back to filestore when no SQLite DB exists
recover_eval_log() / recover_eval_log_async() gain a no_events parameter
--no-events CLI flag excludes event transcript from recovered samples (reduces output size)
--list output now shows recovery source (database or filestore)
RecoverableEvalLog gains a source field
reconstruct_eval_sample() gains an include_events parameter

Does this PR introduce a breaking change?

No. The BufferRecoveryData.buffer field type widens from SampleBufferDatabase | None to SampleBuffer | None, which is backwards-compatible for consumers that only use the SampleBuffer interface. All existing tests pass unchanged.

Other information:

17 new tests covering: filestore fallback, streaming segments, missing/corrupt segment handling, multi-sample recovery, cleanup behavior, --no-events flag, --list integration, empty manifest edge case, and SQLite-takes-priority behavior.

🤖 Generated with Claude Code

Match the Anthropic provider's tool_calls guard pattern. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When an explicit --id is provided to eval-set, display it in brackets before the task name in both single-task and multi-task panel headers. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: jjallaire <jj.allaire@gmail.com>

…EIS#3646) * fix: handle special token strings in tiktoken encoding count_text_tokens() crashes with ValueError when text contains tiktoken special tokens like <|endoftext|>. Pass disallowed_special=() to treat them as normal text, which is safe since this function is only used for approximate length estimates in context compaction. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: apply same disallowed_special fix to OpenAI provider The OpenAI provider has its own count_text_tokens override that calls enc.encode(text) without disallowed_special=(), making it vulnerable to the same ValueError on special token strings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Update CHANGELOG with recent changes Updated CHANGELOG to reflect recent changes in various components. --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: jjallaire <jj.allaire@gmail.com>

Co-authored-by: jjallaire <jj.allaire@gmail.com>

…rnmentBEIS#3651) * fix: include LoRA adapter in vLLM model name for eval logs/UI When using a LoRA adapter (e.g. `vllm/base-model:adapter`), the adapter suffix was stripped from `model_name` at init, so eval logs, the UI header, log filenames, and model_usage keys all showed only the base model. The adapter info was only visible in individual sample completions. Keep the original `model_name` (including `:adapter` suffix) so it flows through `Model.name` → `ModelName` → `EvalSpec.model` → logs/UI. The OpenAI client init and API routing are unaffected — they already use `self.base_model` and `service_model_name()` respectively. Fixes UKGovernmentBEIS#3648 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: update assertion to expect full model name with adapter suffix The test was asserting `model_name == "base-model"` but after the fix, `model_name` correctly retains the adapter suffix as `"base-model:some-adapter"`. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: terminate vLLM server processes in test cleanup fixture The _clean_vllm_servers fixture was only calling _vllm_servers.clear(), which orphaned running server processes. These zombies held GPU memory, causing subsequent tests to OOM when starting new servers with --enable-lora. Use cleanup_servers() which terminates processes before clearing the registry. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Update CHANGELOG with new bugfix entry --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: jjallaire <jj.allaire@gmail.com>

…GovernmentBEIS#3652) * design doc * step 1: read crashed eval logs * update step 1 * step 2: read recovery data from buffer * update doc with step 2 * step 3: sample reconstruction * update step 3 * step 4, write recovered log * update step 5 * step 5: recovery api * update design doc * step 6 implementation * update recover.md * more e2e tests * code review feedback * code review feedback * implement streaming * don't use connections as high throughput trigger * intregrate with retry * update doc * add overwrite parameter * update plan for eval-set * integration with eval set * doc updates * doc tweaks * address review feedback * additional review feedback * improve overwrite guard * imrpove display * ruff format * use model_copy

…S#3654) * Remove docker-sandbox unhealthy_services computation This is dead code: the unhealthy_services list is computed, then discarded. This is causing an error in cases where the successful_service["Service"] is not a key in services (aliased to unhealthy_services), which can be the case for task-oriented docker containers that exit after their work is complete as normal behavior. * Update CHANGELOG.md --------- Co-authored-by: jjallaire <jj.allaire@gmail.com>

…S#3647) * Allow Score.value to be None for intermediate scores that errored Intermediate ScoreEvents can have value=None when scoring encounters an error during evaluation (e.g. ValueError during an intermediate scoring check). The Score model previously rejected None values, causing deserialization of entire eval files to fail even when all final sample scores were valid. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Coerce null score values to NaN in ScoreEvent deserialization Instead of changing Score.value to accept None (which caused 14 mypy errors across metrics, reducers, and agents), add a model_validator on ScoreEvent that converts null score values to NaN before Pydantic validates. This handles older eval logs with intermediate ScoreEvents where scoring errored, without affecting downstream code. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Update CHANGELOG with recent feature additions --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: jjallaire <jj.allaire@gmail.com>

…caching (UKGovernmentBEIS#3656)

Co-authored-by: jjallaire <jj.allaire@gmail.com>

* initial work on custom output limits * code review feedback

…haracters (e.g. ☆, ○, ◎)

…overnmentBEIS#3658) Path.as_uri() encodes @ to %40 which is unnecessary (@ is valid in URI path components per RFC 3986) and breaks round-tripping through filesystem()/local_path(). Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: jjallaire <jj.allaire@gmail.com>

… samples

…bump-again Bump ts-mono version

…files (UKGovernmentBEIS#3746) * test: add failing test for multi-frame zstd zip entries * feat: cap zstd frames at 200 MiB input for JS-decoder compatibility Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * test: verify small zstd entries stay single-frame * test: verify multi-frame zstd round-trip preserves bytes The round-trip test revealed that the decompressor path also needed fixing: the previous ZstdDecompressObjWrapper (from zipfile_zstd) used a decompressobj that stopped after the first zstd frame, leaving subsequent frames unread and causing a CRC mismatch on read-back. Add _MultiFrameZstdDecompressObj, which chains fresh inner decompressobj instances across frame boundaries, and patch _get_decompressor alongside _get_compressor so both directions handle multi-frame streams correctly. * perf: avoid O(n^2) concat and repeated compressor construction Two issues on multi-frame writes: 1. The wrapper's compress() accumulated output with += b"...", which is O(n^2) across the 1 GiB+ compressed output of a 1.38 GiB entry — gigabytes of needless copying. Use a list + b"".join, and a memoryview over the input to avoid slice copies. 2. The factory delegated to zipfile_zstd's _get_compressor, which constructs a fresh ZstdCompressor(threads=12) per compressobj call. Share a single compressor across all frames of an entry. Before: ~9.5 s for the 1.38 GiB entry (~3x baseline). After: ~3.9 s (+25% vs single-frame baseline, matching the inherent zstd multi-frame dictionary-reset overhead). * refactor: clean up zstd patching - Resolve zipfile.ZIP_ZSTANDARD once at import time into _ZIP_ZSTANDARD, collapsing three # type: ignore suppressions into one and surfacing a missing attribute at import time instead of at call time. - Extract the threads count zipfile_zstd hardcodes (12) into _ZSTD_THREADS with a comment pointing at the source of truth, so our value and theirs can't silently drift apart. - Skip the trailing frame flush when the last compress() call landed exactly on a frame boundary; otherwise we appended an empty 9-byte zstd frame to such entries. * fix: satisfy mypy on Python 3.10 / 3.11 - Type the compressor and decompressor factories with the concrete zstandard classes instead of Any, so self._obj.flush() no longer leaks Any into a function declared to return bytes. - In the test file, resolve ZIP_ZSTANDARD via getattr like the existing test_async_zip.py does, since on Python < 3.14 stdlib zipfile does not declare the attribute and type checkers don't see zipfile_zstd's runtime monkey-patch. --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

The test was calling openai/gpt-4, which maps to the retired gpt-4-0613 snapshot. Every eval call returned status="error" and results=None, causing the assertion on results.scores to fail. gpt-4o is the direct successor and supports the same tool-calling interface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…/triage-24841858600 fix: use gpt-4o in test_openai_tools (gpt-4 is retired)

…iew-config View Configuration

Add note about handling null response.tools in OpenAI.

…l-tools Responses API BugFix: Gracefully catch when response.tools is null and normalize it to []

…upgrade-deps chore(deps): upgrade dependencies in uv.lock

…ew-latest Improved Column selection and rendering in Folders / Tasks View

Added a log recovery feature to manage memory during large evaluations.

bsnodin and others added 30 commits April 8, 2026 09:12

style: use len(tool_calls or []) == 0 pattern for consistency

988313c

Match the Anthropic provider's tool_calls guard pattern. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge branch 'main' into reasoning-only-fallback

0d76e5e

Add Daytona sandbox provider to extensions page (UKGovernmentBEIS#3641)

a408347

Co-authored-by: jjallaire <jj.allaire@gmail.com>

Add list_by_prefix method to KVStore (UKGovernmentBEIS#3650)

97b0b10

Merge branch 'main' into reasoning-only-fallback

eaf0040

static behavior for oai service and compatible

8abe1fa

Merge branch 'METR-reasoning-only-fallback'

cbc9d8a

update changelog

32f5f0d

add sudo back to Dockerfile (UKGovernmentBEIS#3653)

836a8ba

update inspect-docs extension

39a1fc7

update docs .gitignore

e6b35e7

anthropic: use request level "auto" caching mode for improved prompt …

8f11457

…caching (UKGovernmentBEIS#3656)

feat: Latest Transcript component (UKGovernmentBEIS#3655)

0dcef7f

Co-authored-by: jjallaire <jj.allaire@gmail.com>

Merge branch 'main' into fix/windows-resource-auto-resolution

632643f

add tests

67f7a6f

Merge branch 'lilinmadoka-fix/windows-resource-auto-resolution'

d3ae3f3

ruff

a3e1924

custom sandbox output limits (UKGovernmentBEIS#3657)

9027cd5

* initial work on custom output limits * code review feedback

Merge branch 'main' into fix/unicode-answer-pattern

a8e462f

Bugfix: Fix answer("word") scorer failing to match Unicode symbol c…

8c108a0

…haracters (e.g. ☆, ○, ◎)

Merge branch 'METR-fix/unicode-answer-pattern'

3ff6db9

Eval Logs: Save recent events (up to last ModelEvent) when retrying…

11ca7a2

… samples

improve tests

67a3074

jjallaire and others added 30 commits April 22, 2026 20:34

Merge pull request UKGovernmentBEIS#3743 from UKGovernmentBEIS/chore/…

e956c67

…bump-again Bump ts-mono version

chore(deps): upgrade dependencies in uv.lock

cfece08

Gracefully catch when response.tools is null

c119b64

Merge branch 'main' into claude/triage-24841858600

0efc555

Merge pull request UKGovernmentBEIS#3749 from UKGovernmentBEIS/claude…

b22c49a

…/triage-24841858600 fix: use gpt-4o in test_openai_tools (gpt-4 is retired)

Add to task_with

bcf93a4

Add reference

a5228d5

Merge branch 'main' into feat/view-config

58cef82

update schema

c0e70ee

revert trivial schema change

5d36380

Bump to latest version

f6d6cff

rebuild dist

986e92d

Merge pull request UKGovernmentBEIS#3744 from UKGovernmentBEIS/feat/v…

d11aaa6

…iew-config View Configuration

Model roles in LogOverview

f01a8a8

changelog

3eff558

Merge branch 'main' into fix-responses-null-tools

94fdb87

Update CHANGELOG with OpenAI response handling

5eb12ab

Add note about handling null response.tools in OpenAI.

Merge pull request UKGovernmentBEIS#3752 from ckane/fix-responses-nul…

275c6bc

…l-tools Responses API BugFix: Gracefully catch when response.tools is null and normalize it to []

Merge branch 'main' into chore/upgrade-deps

5e55b88

Merge branch 'main' into fix/view-latest

3f9a524

Merge pull request UKGovernmentBEIS#3748 from UKGovernmentBEIS/chore/…

5d9d804

…upgrade-deps chore(deps): upgrade dependencies in uv.lock

Merge branch 'main' into fix/view-latest

d178b26

Merge pull request UKGovernmentBEIS#3753 from UKGovernmentBEIS/fix/vi…

4d0ca94

…ew-latest Improved Column selection and rendering in Folders / Tasks View

update changelog for release

4cc9e8b

sync evals

f235099

Merge branch 'main' into buffer-segment-recovery

c45d37d

Add log recovery for memory management in evaluations

aa299a0

Added a log recovery feature to manage memory during large evaluations.

Fix capitalization in log recovery entry

ffbe7fa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: recover crashed evals from buffer segment files#25

feat: recover crashed evals from buffer segment files#25
rasmusfaber wants to merge 1055 commits into
mainfrom
buffer-segment-recovery

rasmusfaber commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants

Conversation

rasmusfaber commented Apr 20, 2026

This PR contains:

What is the current behavior? (You can also link to an open issue here)

What is the new behavior?

Does this PR introduce a breaking change?

Other information:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants