feat: recover crashed evals from buffer segment files#25
Draft
rasmusfaber wants to merge 1055 commits into
Draft
feat: recover crashed evals from buffer segment files#25rasmusfaber wants to merge 1055 commits into
rasmusfaber wants to merge 1055 commits into
Conversation
Match the Anthropic provider's tool_calls guard pattern. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When an explicit --id is provided to eval-set, display it in brackets before the task name in both single-task and multi-task panel headers. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: jjallaire <jj.allaire@gmail.com>
…EIS#3646) * fix: handle special token strings in tiktoken encoding count_text_tokens() crashes with ValueError when text contains tiktoken special tokens like <|endoftext|>. Pass disallowed_special=() to treat them as normal text, which is safe since this function is only used for approximate length estimates in context compaction. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: apply same disallowed_special fix to OpenAI provider The OpenAI provider has its own count_text_tokens override that calls enc.encode(text) without disallowed_special=(), making it vulnerable to the same ValueError on special token strings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Update CHANGELOG with recent changes Updated CHANGELOG to reflect recent changes in various components. --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: jjallaire <jj.allaire@gmail.com>
Co-authored-by: jjallaire <jj.allaire@gmail.com>
…rnmentBEIS#3651) * fix: include LoRA adapter in vLLM model name for eval logs/UI When using a LoRA adapter (e.g. `vllm/base-model:adapter`), the adapter suffix was stripped from `model_name` at init, so eval logs, the UI header, log filenames, and model_usage keys all showed only the base model. The adapter info was only visible in individual sample completions. Keep the original `model_name` (including `:adapter` suffix) so it flows through `Model.name` → `ModelName` → `EvalSpec.model` → logs/UI. The OpenAI client init and API routing are unaffected — they already use `self.base_model` and `service_model_name()` respectively. Fixes UKGovernmentBEIS#3648 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: update assertion to expect full model name with adapter suffix The test was asserting `model_name == "base-model"` but after the fix, `model_name` correctly retains the adapter suffix as `"base-model:some-adapter"`. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: terminate vLLM server processes in test cleanup fixture The _clean_vllm_servers fixture was only calling _vllm_servers.clear(), which orphaned running server processes. These zombies held GPU memory, causing subsequent tests to OOM when starting new servers with --enable-lora. Use cleanup_servers() which terminates processes before clearing the registry. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Update CHANGELOG with new bugfix entry --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: jjallaire <jj.allaire@gmail.com>
…GovernmentBEIS#3652) * design doc * step 1: read crashed eval logs * update step 1 * step 2: read recovery data from buffer * update doc with step 2 * step 3: sample reconstruction * update step 3 * step 4, write recovered log * update step 5 * step 5: recovery api * update design doc * step 6 implementation * update recover.md * more e2e tests * code review feedback * code review feedback * implement streaming * don't use connections as high throughput trigger * intregrate with retry * update doc * add overwrite parameter * update plan for eval-set * integration with eval set * doc updates * doc tweaks * address review feedback * additional review feedback * improve overwrite guard * imrpove display * ruff format * use model_copy
…S#3654) * Remove docker-sandbox unhealthy_services computation This is dead code: the unhealthy_services list is computed, then discarded. This is causing an error in cases where the successful_service["Service"] is not a key in services (aliased to unhealthy_services), which can be the case for task-oriented docker containers that exit after their work is complete as normal behavior. * Update CHANGELOG.md --------- Co-authored-by: jjallaire <jj.allaire@gmail.com>
…S#3647) * Allow Score.value to be None for intermediate scores that errored Intermediate ScoreEvents can have value=None when scoring encounters an error during evaluation (e.g. ValueError during an intermediate scoring check). The Score model previously rejected None values, causing deserialization of entire eval files to fail even when all final sample scores were valid. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Coerce null score values to NaN in ScoreEvent deserialization Instead of changing Score.value to accept None (which caused 14 mypy errors across metrics, reducers, and agents), add a model_validator on ScoreEvent that converts null score values to NaN before Pydantic validates. This handles older eval logs with intermediate ScoreEvents where scoring errored, without affecting downstream code. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Update CHANGELOG with recent feature additions --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: jjallaire <jj.allaire@gmail.com>
Co-authored-by: jjallaire <jj.allaire@gmail.com>
* initial work on custom output limits * code review feedback
…haracters (e.g. ☆, ○, ◎)
…overnmentBEIS#3658) Path.as_uri() encodes @ to %40 which is unnecessary (@ is valid in URI path components per RFC 3986) and breaks round-tripping through filesystem()/local_path(). Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: jjallaire <jj.allaire@gmail.com>
…bump-again Bump ts-mono version
…files (UKGovernmentBEIS#3746) * test: add failing test for multi-frame zstd zip entries * feat: cap zstd frames at 200 MiB input for JS-decoder compatibility Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * test: verify small zstd entries stay single-frame * test: verify multi-frame zstd round-trip preserves bytes The round-trip test revealed that the decompressor path also needed fixing: the previous ZstdDecompressObjWrapper (from zipfile_zstd) used a decompressobj that stopped after the first zstd frame, leaving subsequent frames unread and causing a CRC mismatch on read-back. Add _MultiFrameZstdDecompressObj, which chains fresh inner decompressobj instances across frame boundaries, and patch _get_decompressor alongside _get_compressor so both directions handle multi-frame streams correctly. * perf: avoid O(n^2) concat and repeated compressor construction Two issues on multi-frame writes: 1. The wrapper's compress() accumulated output with += b"...", which is O(n^2) across the 1 GiB+ compressed output of a 1.38 GiB entry — gigabytes of needless copying. Use a list + b"".join, and a memoryview over the input to avoid slice copies. 2. The factory delegated to zipfile_zstd's _get_compressor, which constructs a fresh ZstdCompressor(threads=12) per compressobj call. Share a single compressor across all frames of an entry. Before: ~9.5 s for the 1.38 GiB entry (~3x baseline). After: ~3.9 s (+25% vs single-frame baseline, matching the inherent zstd multi-frame dictionary-reset overhead). * refactor: clean up zstd patching - Resolve zipfile.ZIP_ZSTANDARD once at import time into _ZIP_ZSTANDARD, collapsing three # type: ignore suppressions into one and surfacing a missing attribute at import time instead of at call time. - Extract the threads count zipfile_zstd hardcodes (12) into _ZSTD_THREADS with a comment pointing at the source of truth, so our value and theirs can't silently drift apart. - Skip the trailing frame flush when the last compress() call landed exactly on a frame boundary; otherwise we appended an empty 9-byte zstd frame to such entries. * fix: satisfy mypy on Python 3.10 / 3.11 - Type the compressor and decompressor factories with the concrete zstandard classes instead of Any, so self._obj.flush() no longer leaks Any into a function declared to return bytes. - In the test file, resolve ZIP_ZSTANDARD via getattr like the existing test_async_zip.py does, since on Python < 3.14 stdlib zipfile does not declare the attribute and type checkers don't see zipfile_zstd's runtime monkey-patch. --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
The test was calling openai/gpt-4, which maps to the retired gpt-4-0613 snapshot. Every eval call returned status="error" and results=None, causing the assertion on results.scores to fail. gpt-4o is the direct successor and supports the same tool-calling interface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…/triage-24841858600 fix: use gpt-4o in test_openai_tools (gpt-4 is retired)
…iew-config View Configuration
Add note about handling null response.tools in OpenAI.
…l-tools Responses API BugFix: Gracefully catch when response.tools is null and normalize it to []
…upgrade-deps chore(deps): upgrade dependencies in uv.lock
…ew-latest Improved Column selection and rendering in Folders / Tasks View
Added a log recovery feature to manage memory during large evaluations.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR contains:
What is the current behavior? (You can also link to an open issue here)
inspect log recoverrequires a local SQLite buffer database to recover crashed evals. When the buffer DB is lost (e.g., OOMKilled pod with ephemeral storage), recovery is impossible even though the same data exists as compressed segment files in a.buffer/directory alongside the.evalfile (potentially on S3).What is the new behavior?
inspect log recoverautomatically falls back to reading.buffersegment files when no local SQLite buffer database exists. BothSampleBufferDatabase(SQLite) andSampleBufferFilestore(segments) implement theSampleBufferinterface, so the filestore slots into the existing recovery pipeline.Changes:
SampleBufferABC gainscleanup()as an abstract methodSampleBufferFilestoregainsiter_sample_segments()for streaming segment reads with best-effort error handling (missing/corrupt segments are skipped with warnings)read_buffer_recovery_data()falls back to filestore when no SQLite DB existsrecover_eval_log()/recover_eval_log_async()gain ano_eventsparameter--no-eventsCLI flag excludes event transcript from recovered samples (reduces output size)--listoutput now shows recovery source (databaseorfilestore)RecoverableEvalLoggains asourcefieldreconstruct_eval_sample()gains aninclude_eventsparameterDoes this PR introduce a breaking change?
No. The
BufferRecoveryData.bufferfield type widens fromSampleBufferDatabase | NonetoSampleBuffer | None, which is backwards-compatible for consumers that only use theSampleBufferinterface. All existing tests pass unchanged.Other information:
17 new tests covering: filestore fallback, streaming segments, missing/corrupt segment handling, multi-sample recovery, cleanup behavior,
--no-eventsflag,--listintegration, empty manifest edge case, and SQLite-takes-priority behavior.🤖 Generated with Claude Code