fix: lower compaction threshold to 75% and add token estimation fallback#250
fix: lower compaction threshold to 75% and add token estimation fallback#250
Conversation
Adding .gitkeep for PR creation (default mode). This file will be removed when the task is complete. Issue: #249
… overflow The previous 15% safety margin (85% threshold) was insufficient because: - Providers may return 0 token counts, making overflow detection unreliable - Token estimation has inherent inaccuracy - OpenCode upstream uses 75%, Gemini CLI uses 50% Increase margin to 25% (75% threshold) across JS and Rust implementations. Fixes #249 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When providers (e.g., Nvidia/nemotron via OpenCode) return 0 for all token counts in usage data, the overflow check always reports no overflow, preventing compaction from ever triggering. This causes context to grow unbounded until hitting the model's hard limit. Add an estimatedInputTokens parameter to isOverflow() that estimates tokens from message content using the 4-chars-per-token heuristic. When provider tokens are all zeros, this estimate is used instead. Fixes #249 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When estimated input tokens approach the model's context limit, cap maxOutputTokens to ensure input + output never exceeds the context limit. This prevents "context length exceeded" errors like the one observed where 230145 input + 32000 output = 262145 exceeded the nemotron-3-super-free limit of 262144. The cap uses the same token estimation used for the overflow check. A minimum of 1024 output tokens is preserved to avoid degenerate cases. Fixes #249 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…n tests Update existing test expectations from 85%/15% to 75%/25%. Add new tests for the estimatedInputTokens fallback: - Uses estimate when provider returns 0 tokens → triggers overflow - Ignores estimate when provider returns non-zero tokens - No overflow with 0 tokens and no estimate (backward compat) Fixes #249 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document root causes (0 token counts from provider, uncapped max_tokens, narrow safety margin), research across industry tools (Gemini CLI, OpenCode, Claude Code, Codex CLI), and the three fixes applied. Includes the full solution draft log from the failed session for reference. Fixes #249 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ault Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
🤖 Solution Draft LogThis log file contains the complete execution trace of the AI solution draft process. 💰 Cost estimation:
📊 Context and tokens usage:Claude Opus 4.6:
Total: (156.4K + 21.1M cached) input tokens, 39.3K output tokens, $12.492750 cost Claude Haiku 4.5: Total: (223.9K + 1.0M cached) input tokens, 14.4K / 64K (23%) output tokens, $0.447609 cost 🤖 Models used:
📎 Log file uploaded as Gist (5604KB)Now working session is ended, feel free to review and add any feedback on the solution draft. |
✅ Ready to mergeThis pull request is now ready to be merged:
Monitored by hive-mind with --auto-restart-until-mergeable flag |
This reverts commit 94b9f68.
|
Are we sure provider does not return usage tokens? And also can be use real byte pair encoding, or real encoder per model to have precise results? Like there are different encoders for ChatGPT. We need to ensure all changes are correct, consistent, validated, tested, logged and fully meet each and all discussed requirements (check issue description and all comments in issue and in pull request). Ensure all CI/CD checks pass. |
|
🤖 AI Work Session Started Starting automated work session at 2026-04-11T20:01:31.353Z The PR has been converted to draft mode while work is in progress. This comment marks the beginning of an AI work session. Please wait for the session to finish, and provide your feedback. |
…counting Add Token.countTokens() that uses gpt-tokenizer (o200k_base encoding) for real BPE tokenization when available, falling back to the character-based heuristic (~4 chars/token) for models with unavailable tokenizers. This addresses the reviewer feedback asking whether we can use real byte pair encoding instead of the heuristic. Research found that: - gpt-tokenizer works for OpenAI-compatible models (GPT-4o, GPT-4.1, GPT-5) - Nvidia Nemotron, Google Gemini, Meta Llama use custom SentencePiece tokenizers with no JS libraries available - The 75% safety margin absorbs the ±20% estimation error for all models Changes: - js/src/util/token.ts: Add countTokens() with lazy-loaded BPE encoder - js/src/session/prompt.ts: Use Token.countTokens() for overflow detection - js/tests/token.test.ts: Tests for both estimate() and countTokens() - js/package.json: Add gpt-tokenizer dependency - docs/case-studies/issue-249/: Update analysis, add original failure log Closes #249 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Addressing feedback1. "Are we sure provider does not return usage tokens?"Yes, confirmed from the original failure log. Every single "tokens": { "input": 0, "output": 0, "reasoning": 0, "cache": { "read": 0, "write": 0 } }This occurred for all 10+ overflow checks. The 2. "Can we use real byte pair encoding, or real encoder per model?"Added However, real BPE cannot cover all models because different providers use incompatible tokenizers:
For these models, using the wrong tokenizer (e.g., Changes in this commit
|
🤖 Solution Draft LogThis log file contains the complete execution trace of the AI solution draft process. 💰 Cost estimation:
📊 Context and tokens usage:Claude Opus 4.6:
Total: (113.1K + 5.5M cached) input tokens, 23.4K output tokens, $4.034820 cost Claude Haiku 4.5: Total: (176.0K + 817.5K cached) input tokens, 8.5K / 64K (13%) output tokens, $0.337771 cost 🤖 Models used:
📎 Log file uploaded as Gist (3031KB)Now working session is ended, feel free to review and add any feedback on the solution draft. |
✅ Ready to mergeThis pull request is now ready to be merged:
Monitored by hive-mind with --auto-restart-until-mergeable flag |
|
But do we also have raw requests and responses logged? Not only step-finish events? May be there is a deeper bug we have, and just don't correctly get data from provider's response? Double check source code of OpenCode and other agentic CLIs. If we don't have enough raw requests and responses logged, that also should be fixed. We need to ensure all changes are correct, consistent, validated, tested, logged and fully meet each and all discussed requirements (check issue description and all comments in issue and in pull request). Ensure all CI/CD checks pass. |
|
🤖 AI Work Session Started Starting automated work session at 2026-04-11T20:19:56.893Z The PR has been converted to draft mode while work is in progress. This comment marks the beginning of an AI work session. Please wait for the session to finish, and provide your feedback. |
Investigation of the original failure log revealed that the raw HTTP response from OpenRouter DOES contain valid usage data (prompt_tokens: 15506, completion_tokens: 80), but the AI SDK's @ai-sdk/openai-compatible v1.0.33 fails to propagate these values to the finish-step event, resulting in 0 for all token counts. - Add verbose step-finish raw usage diagnostics in processor.ts - Enhance zero-token warning to fire regardless of finishReason - Update case study with deeper root cause analysis and upstream issue recommendations Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Addressing feedback: raw requests/responses and deeper bug investigation1. "Do we also have raw requests and responses logged?"Yes. The codebase has
This logging proved essential — it captured the raw evidence showing the provider returns valid tokens. 2. "May be there is a deeper bug we have?"Yes, confirmed. Deep investigation of the original failure log revealed: The raw HTTP response from OpenRouter DOES contain valid usage data: But the AI SDK's step-finish event reports all zeros: "tokens": {"input": 0, "output": 0, "reasoning": 0}The request correctly includes 3. "Double check source code of OpenCode and other agentic CLIs"Checked. OpenCode (sst/opencode) uses essentially the same pattern:
OpenCode would have the same bug when using Changes in this commit
All tests pass
|
🤖 Solution Draft LogThis log file contains the complete execution trace of the AI solution draft process. 💰 Cost: $5.844440📊 Context and tokens usage:Claude Opus 4.6:
Total: (116.8K + 8.3M cached) input tokens, 23.5K output tokens, $5.454436 cost Claude Haiku 4.5: Total: 356.6K input tokens, 6.7K / 64K (10%) output tokens, $0.390004 cost 🤖 Models used:
📎 Log file uploaded as Gist (2792KB)Now working session is ended, feel free to review and add any feedback on the solution draft. |
✅ Ready to mergeThis pull request is now ready to be merged:
Monitored by hive-mind with --auto-restart-until-mergeable flag |
Can we fix that by using other popular AI API SDK, like Claude Code SDK? Or can we fallback to regular fetch/axios to starting making our own AI sdk? What source code of OpenCode, Codex, Gemini, Qwen are using at the moment? Can we have implemented all the ways so we can switch between them using CLI option and envs? And by default we should select the most reliable way to get all the data from each request and response, while also logging full http request and response with all statuses, headers, and so on, not just body. Or may be latest version AI SDK will work? We need to guarantee we actually get real usage data, and actually all the data we can from responses to propagate these for users and use in our Agent CLI itself. We need to ensure all changes are correct, consistent, validated, tested, logged and fully meet each and all discussed requirements (check issue description and all comments in issue and in pull request |
|
🤖 AI Work Session Started Starting automated work session at 2026-04-12T09:33:46.835Z The PR has been converted to draft mode while work is in progress. This comment marks the beginning of an AI work session. Please wait for the session to finish, and provide your feedback. |
When the AI SDK drops token usage data from its finish-step event (a known upstream bug in @ai-sdk/openai-compatible), the new SSE usage extractor intercepts the raw HTTP streaming response and parses usage tokens directly from SSE data chunks. This works for all providers (OpenAI, Anthropic, OpenRouter, Groq formats) and operates in both verbose and non-verbose modes. The processor now automatically recovers usage when AI SDK returns zeros, logging the recovery for diagnostics. This eliminates the root cause of issue #249 where compaction never triggered because token counts were always 0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Added comparison of SDK choices across agentic CLIs (OpenCode, Codex, Gemini CLI, Claude Code, Aider, Qwen Agent), documented the SSE usage recovery mechanism, and outlined future work for switchable SDK backends. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Addressing feedback: SDK alternatives and guaranteed usage data1. "Can we fix that by using other popular AI API SDK, like Claude Code SDK?"Researched. Each major agentic CLI uses a different approach:
None of these are drop-in replacements because our codebase supports 11+ providers via the AI SDK unified interface. 2. "Can we fallback to regular fetch/axios?"Yes — implemented. The new The extractor works in both verbose and non-verbose modes — it tee()s the stream, reads one copy for usage extraction, and gives the other copy to the AI SDK. 3. "Or may be latest version AI SDK will work?"Unlikely. The streaming usage bug is a known open issue in the Vercel AI SDK:
4. "We need to guarantee we actually get real usage data"Now guaranteed through defense-in-depth:
5. Full switchable SDK backendsFull switchable backends (direct fetch, provider-specific SDKs, LiteLLM bridge) would require significant architectural work. This is documented as a future feature request in the case study. Changes in this push
All 54 JS tests + Rust tests pass. |
🤖 Solution Draft LogThis log file contains the complete execution trace of the AI solution draft process. 💰 Cost estimation:
📊 Context and tokens usage:Claude Opus 4.6:
Total: (169.8K + 6.8M cached) input tokens, 34.2K output tokens, $5.307618 cost Claude Haiku 4.5: Total: (200.0K + 701.9K cached) input tokens, 10.6K / 64K (17%) output tokens, $0.365299 cost 🤖 Models used:
📎 Log file uploaded as Gist (3725KB)Now working session is ended, feel free to review and add any feedback on the solution draft. |
✅ Ready to mergeThis pull request is now ready to be merged:
Monitored by hive-mind with --auto-restart-until-mergeable flag |
Summary
Fixes #249 — Context overflow error when AI SDK drops token usage from raw provider response, causing compaction to never trigger.
Root Cause (verified from raw HTTP logs)
The provider DOES return valid usage data, but the AI SDK fails to propagate it:
"prompt_tokens": 15506, "completion_tokens": 80in the final chunk@ai-sdk/openai-compatibledrops usage: Despite receiving valid usage, the AI SDK'sfinish-stepevent reports 0 for all token counts. This is a known upstream bug (vercel/ai #9921, #12477, #7412)max_tokensnot capped to available context: Always requested 32,000 output tokens even when only 31,999 were availableChanges
Fix 1: Lower safety margin from 85% to 75%
Matches OpenCode upstream; provides larger buffer for inaccurate/missing token counts (Gemini CLI uses 50%, Claude Code uses ~83.5%)
Fix 2: Real BPE tokenization via
gpt-tokenizerToken.countTokens()useso200k_baseencoding for accurate token counting on OpenAI-compatible models, with character-based heuristic fallback (~4 chars/token) for models with unavailable tokenizersFix 3: Cap
maxOutputTokensEnsure
estimatedInput + maxOutputnever exceeds model context limitFix 4: Raw SSE usage recovery (AI SDK bypass) ← NEW
SSEUsageExtractorintercepts raw HTTP SSE streaming responses at thefetch()level and parses usage tokens directly from SSEdata:chunks. When the AI SDK'sfinish-stepevent reports zero tokens, the processor automatically recovers usage from the raw SSE data.Enhanced diagnostics
Verbose mode logs raw usage vs parsed usage to detect AI SDK propagation failures; zero-token warning now fires regardless of finishReason
SDK Research: What Other CLIs Use
aiv6)streamTextusage — same bug exposure@google/genaiv1.30usageMetadatafrom SDK@anthropic-ai/sdklitellmv1.82completion.usagevia litellmThe Vercel AI SDK streaming usage bug is a known open issue (vercel/ai #9921, #12477). Our SSE usage recovery provides the reliability of Codex's approach while keeping the AI SDK for its provider abstraction.
Future work: Switchable SDK backends
Full backend alternatives (direct fetch, provider-specific SDKs, LiteLLM) require significant architectural work and are tracked as a separate feature request.
Files Changed
js/src/util/sse-usage-extractor.tsjs/src/provider/provider.tsjs/src/session/processor.tsjs/src/util/token.tscountTokens()with real BPE via gpt-tokenizerjs/src/session/compaction.tsjs/src/session/prompt.tsToken.countTokens()for overflow, cap output tokensjs/src/cli/defaults.tsrust/src/cli.rsjs/package.jsongpt-tokenizerdependencyjs/tests/sse-usage-extractor.test.tsjs/tests/token.test.tsjs/tests/compaction-model.test.tsdocs/case-studies/issue-249/Test plan
🤖 Generated with Claude Code