Skip to content

feat: recover crashed evals from buffer segment files#25

Draft
rasmusfaber wants to merge 1055 commits into
mainfrom
buffer-segment-recovery
Draft

feat: recover crashed evals from buffer segment files#25
rasmusfaber wants to merge 1055 commits into
mainfrom
buffer-segment-recovery

Conversation

@rasmusfaber
Copy link
Copy Markdown

This PR contains:

  • New features
  • Changes to dev-tools e.g. CI config / github tooling
  • Docs
  • Bug fixes
  • Code refactor

What is the current behavior? (You can also link to an open issue here)

inspect log recover requires a local SQLite buffer database to recover crashed evals. When the buffer DB is lost (e.g., OOMKilled pod with ephemeral storage), recovery is impossible even though the same data exists as compressed segment files in a .buffer/ directory alongside the .eval file (potentially on S3).

What is the new behavior?

inspect log recover automatically falls back to reading .buffer segment files when no local SQLite buffer database exists. Both SampleBufferDatabase (SQLite) and SampleBufferFilestore (segments) implement the SampleBuffer interface, so the filestore slots into the existing recovery pipeline.

Changes:

  • SampleBuffer ABC gains cleanup() as an abstract method
  • SampleBufferFilestore gains iter_sample_segments() for streaming segment reads with best-effort error handling (missing/corrupt segments are skipped with warnings)
  • read_buffer_recovery_data() falls back to filestore when no SQLite DB exists
  • recover_eval_log() / recover_eval_log_async() gain a no_events parameter
  • --no-events CLI flag excludes event transcript from recovered samples (reduces output size)
  • --list output now shows recovery source (database or filestore)
  • RecoverableEvalLog gains a source field
  • reconstruct_eval_sample() gains an include_events parameter

Does this PR introduce a breaking change?

No. The BufferRecoveryData.buffer field type widens from SampleBufferDatabase | None to SampleBuffer | None, which is backwards-compatible for consumers that only use the SampleBuffer interface. All existing tests pass unchanged.

Other information:

17 new tests covering: filestore fallback, streaming segments, missing/corrupt segment handling, multi-sample recovery, cleanup behavior, --no-events flag, --list integration, empty manifest edge case, and SQLite-takes-priority behavior.

🤖 Generated with Claude Code

bsnodin and others added 30 commits April 8, 2026 09:12
Match the Anthropic provider's tool_calls guard pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When an explicit --id is provided to eval-set, display it in brackets
before the task name in both single-task and multi-task panel headers.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: jjallaire <jj.allaire@gmail.com>
…EIS#3646)

* fix: handle special token strings in tiktoken encoding

count_text_tokens() crashes with ValueError when text contains tiktoken
special tokens like <|endoftext|>. Pass disallowed_special=() to treat
them as normal text, which is safe since this function is only used for
approximate length estimates in context compaction.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: apply same disallowed_special fix to OpenAI provider

The OpenAI provider has its own count_text_tokens override that calls
enc.encode(text) without disallowed_special=(), making it vulnerable
to the same ValueError on special token strings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Update CHANGELOG with recent changes

Updated CHANGELOG to reflect recent changes in various components.

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: jjallaire <jj.allaire@gmail.com>
Co-authored-by: jjallaire <jj.allaire@gmail.com>
…rnmentBEIS#3651)

* fix: include LoRA adapter in vLLM model name for eval logs/UI

When using a LoRA adapter (e.g. `vllm/base-model:adapter`), the adapter
suffix was stripped from `model_name` at init, so eval logs, the UI
header, log filenames, and model_usage keys all showed only the base
model. The adapter info was only visible in individual sample completions.

Keep the original `model_name` (including `:adapter` suffix) so it flows
through `Model.name` → `ModelName` → `EvalSpec.model` → logs/UI. The
OpenAI client init and API routing are unaffected — they already use
`self.base_model` and `service_model_name()` respectively.

Fixes UKGovernmentBEIS#3648

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: update assertion to expect full model name with adapter suffix

The test was asserting `model_name == "base-model"` but after the fix,
`model_name` correctly retains the adapter suffix as
`"base-model:some-adapter"`.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: terminate vLLM server processes in test cleanup fixture

The _clean_vllm_servers fixture was only calling _vllm_servers.clear(),
which orphaned running server processes.  These zombies held GPU memory,
causing subsequent tests to OOM when starting new servers with
--enable-lora.  Use cleanup_servers() which terminates processes before
clearing the registry.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Update CHANGELOG with new bugfix entry

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: jjallaire <jj.allaire@gmail.com>
…GovernmentBEIS#3652)

* design doc

* step 1: read crashed eval logs

* update step 1

* step 2: read recovery data from buffer

* update doc with step 2

* step 3: sample reconstruction

* update step 3

* step 4, write recovered log

* update step 5

* step 5: recovery api

* update design doc

* step 6 implementation

* update recover.md

* more e2e tests

* code review feedback

* code review feedback

* implement streaming

* don't use connections as high throughput trigger

* intregrate with retry

* update doc

* add overwrite parameter

* update plan for eval-set

* integration with eval set

* doc updates

* doc tweaks

* address review feedback

* additional review feedback

* improve overwrite guard

* imrpove display

* ruff format

* use model_copy
…S#3654)

* Remove docker-sandbox unhealthy_services computation

This is dead code: the unhealthy_services list is computed, then
discarded. This is causing an error in cases where the
successful_service["Service"] is not a key in services
(aliased to unhealthy_services), which can be the case for task-oriented
docker containers that exit after their work is complete as normal
behavior.

* Update CHANGELOG.md

---------

Co-authored-by: jjallaire <jj.allaire@gmail.com>
…S#3647)

* Allow Score.value to be None for intermediate scores that errored

Intermediate ScoreEvents can have value=None when scoring encounters
an error during evaluation (e.g. ValueError during an intermediate
scoring check). The Score model previously rejected None values,
causing deserialization of entire eval files to fail even when all
final sample scores were valid.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Coerce null score values to NaN in ScoreEvent deserialization

Instead of changing Score.value to accept None (which caused 14 mypy
errors across metrics, reducers, and agents), add a model_validator on
ScoreEvent that converts null score values to NaN before Pydantic
validates. This handles older eval logs with intermediate ScoreEvents
where scoring errored, without affecting downstream code.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Update CHANGELOG with recent feature additions

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: jjallaire <jj.allaire@gmail.com>
Co-authored-by: jjallaire <jj.allaire@gmail.com>
* initial work on custom output limits

* code review feedback
…overnmentBEIS#3658)

Path.as_uri() encodes @ to %40 which is unnecessary (@ is valid in URI
path components per RFC 3986) and breaks round-tripping through
filesystem()/local_path().

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: jjallaire <jj.allaire@gmail.com>
jjallaire and others added 30 commits April 22, 2026 20:34
…files (UKGovernmentBEIS#3746)

* test: add failing test for multi-frame zstd zip entries

* feat: cap zstd frames at 200 MiB input for JS-decoder compatibility

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test: verify small zstd entries stay single-frame

* test: verify multi-frame zstd round-trip preserves bytes

The round-trip test revealed that the decompressor path also needed
fixing: the previous ZstdDecompressObjWrapper (from zipfile_zstd) used
a decompressobj that stopped after the first zstd frame, leaving
subsequent frames unread and causing a CRC mismatch on read-back.

Add _MultiFrameZstdDecompressObj, which chains fresh inner decompressobj
instances across frame boundaries, and patch _get_decompressor alongside
_get_compressor so both directions handle multi-frame streams correctly.

* perf: avoid O(n^2) concat and repeated compressor construction

Two issues on multi-frame writes:

1. The wrapper's compress() accumulated output with += b"...", which
   is O(n^2) across the 1 GiB+ compressed output of a 1.38 GiB entry —
   gigabytes of needless copying. Use a list + b"".join, and a
   memoryview over the input to avoid slice copies.

2. The factory delegated to zipfile_zstd's _get_compressor, which
   constructs a fresh ZstdCompressor(threads=12) per compressobj call.
   Share a single compressor across all frames of an entry.

Before: ~9.5 s for the 1.38 GiB entry (~3x baseline).
After:  ~3.9 s (+25% vs single-frame baseline, matching the inherent
zstd multi-frame dictionary-reset overhead).

* refactor: clean up zstd patching

- Resolve zipfile.ZIP_ZSTANDARD once at import time into _ZIP_ZSTANDARD,
  collapsing three # type: ignore suppressions into one and surfacing a
  missing attribute at import time instead of at call time.
- Extract the threads count zipfile_zstd hardcodes (12) into _ZSTD_THREADS
  with a comment pointing at the source of truth, so our value and theirs
  can't silently drift apart.
- Skip the trailing frame flush when the last compress() call landed
  exactly on a frame boundary; otherwise we appended an empty 9-byte
  zstd frame to such entries.

* fix: satisfy mypy on Python 3.10 / 3.11

- Type the compressor and decompressor factories with the concrete
  zstandard classes instead of Any, so self._obj.flush() no longer
  leaks Any into a function declared to return bytes.
- In the test file, resolve ZIP_ZSTANDARD via getattr like the existing
  test_async_zip.py does, since on Python < 3.14 stdlib zipfile does
  not declare the attribute and type checkers don't see zipfile_zstd's
  runtime monkey-patch.

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
The test was calling openai/gpt-4, which maps to the retired gpt-4-0613
snapshot. Every eval call returned status="error" and results=None,
causing the assertion on results.scores to fail. gpt-4o is the direct
successor and supports the same tool-calling interface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…/triage-24841858600

fix: use gpt-4o in test_openai_tools (gpt-4 is retired)
Add note about handling null response.tools in OpenAI.
…l-tools

Responses API BugFix: Gracefully catch when response.tools is null and normalize it to []
…upgrade-deps

chore(deps): upgrade dependencies in uv.lock
…ew-latest

Improved Column selection and rendering in Folders / Tasks View
Added a log recovery feature to manage memory during large evaluations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.