fix: lower compaction threshold to 75% and add token estimation fallback by konard · Pull Request #250 · link-assistant/agent

konard · 2026-04-11T12:21:29Z

Summary

Fixes #249 — Context overflow error when AI SDK drops token usage from raw provider response, causing compaction to never trigger.

Root Cause (verified from raw HTTP logs)

The provider DOES return valid usage data, but the AI SDK fails to propagate it:

Raw HTTP response contains valid tokens: OpenRouter SSE stream includes "prompt_tokens": 15506, "completion_tokens": 80 in the final chunk
AI SDK @ai-sdk/openai-compatible drops usage: Despite receiving valid usage, the AI SDK's finish-step event reports 0 for all token counts. This is a known upstream bug (vercel/ai #9921, #12477, #7412)
max_tokens not capped to available context: Always requested 32,000 output tokens even when only 31,999 were available
Safety margin too narrow: 85% threshold (15% margin) assumed reliable token counts

Changes

Fix 1: Lower safety margin from 85% to 75%

Matches OpenCode upstream; provides larger buffer for inaccurate/missing token counts (Gemini CLI uses 50%, Claude Code uses ~83.5%)

Fix 2: Real BPE tokenization via `gpt-tokenizer`

Token.countTokens() uses o200k_base encoding for accurate token counting on OpenAI-compatible models, with character-based heuristic fallback (~4 chars/token) for models with unavailable tokenizers

Fix 3: Cap `maxOutputTokens`

Ensure estimatedInput + maxOutput never exceeds model context limit

Fix 4: Raw SSE usage recovery (AI SDK bypass) ← NEW

SSEUsageExtractor intercepts raw HTTP SSE streaming responses at the fetch() level and parses usage tokens directly from SSE data: chunks. When the AI SDK's finish-step event reports zero tokens, the processor automatically recovers usage from the raw SSE data.

Works in both verbose and non-verbose modes
Supports OpenAI, Anthropic, Groq, and generic SSE usage formats
Modeled after OpenAI Codex's approach (raw HTTP + manual SSE parsing)
Provides Codex-like reliability while keeping the AI SDK abstraction

Enhanced diagnostics

Verbose mode logs raw usage vs parsed usage to detect AI SDK propagation failures; zero-token warning now fires regardless of finishReason

SDK Research: What Other CLIs Use

Tool	SDK	Usage Source
OpenCode	Vercel AI SDK (`ai` v6)	`streamText` usage — same bug exposure
Codex	Custom Rust (reqwest + SSE)	Hand-parsed SSE events — most reliable
Gemini CLI	`@google/genai` v1.30	`usageMetadata` from SDK
Claude Code	`@anthropic-ai/sdk`	Messages API usage field
Aider	`litellm` v1.82	`completion.usage` via litellm

The Vercel AI SDK streaming usage bug is a known open issue (vercel/ai #9921, #12477). Our SSE usage recovery provides the reliability of Codex's approach while keeping the AI SDK for its provider abstraction.

Future work: Switchable SDK backends

Full backend alternatives (direct fetch, provider-specific SDKs, LiteLLM) require significant architectural work and are tracked as a separate feature request.

Files Changed

File	Change
`js/src/util/sse-usage-extractor.ts`	NEW: Raw SSE stream usage parser (AI SDK bypass)
`js/src/provider/provider.ts`	SSE stream interception for usage extraction
`js/src/session/processor.ts`	SSE usage recovery + enhanced zero-token diagnostics
`js/src/util/token.ts`	`countTokens()` with real BPE via gpt-tokenizer
`js/src/session/compaction.ts`	Lower margin 0.85→0.75, estimation fallback
`js/src/session/prompt.ts`	`Token.countTokens()` for overflow, cap output tokens
`js/src/cli/defaults.ts`	Update default margin 15%→25%
`rust/src/cli.rs`	Sync Rust default 15%→25%
`js/package.json`	Add `gpt-tokenizer` dependency
`js/tests/sse-usage-extractor.test.ts`	17 tests for SSE usage extraction
`js/tests/token.test.ts`	11 tests for Token.estimate and Token.countTokens
`js/tests/compaction-model.test.ts`	Update expectations, add estimation tests
`docs/case-studies/issue-249/`	Case study with deep root cause, SDK research

Test plan

All 54 JS tests pass (26 compaction + 11 token + 17 SSE usage extractor)
All Rust tests pass
Prettier formatting verified
CI passes

🤖 Generated with Claude Code

Adding .gitkeep for PR creation (default mode). This file will be removed when the task is complete. Issue: #249

… overflow The previous 15% safety margin (85% threshold) was insufficient because: - Providers may return 0 token counts, making overflow detection unreliable - Token estimation has inherent inaccuracy - OpenCode upstream uses 75%, Gemini CLI uses 50% Increase margin to 25% (75% threshold) across JS and Rust implementations. Fixes #249 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When providers (e.g., Nvidia/nemotron via OpenCode) return 0 for all token counts in usage data, the overflow check always reports no overflow, preventing compaction from ever triggering. This causes context to grow unbounded until hitting the model's hard limit. Add an estimatedInputTokens parameter to isOverflow() that estimates tokens from message content using the 4-chars-per-token heuristic. When provider tokens are all zeros, this estimate is used instead. Fixes #249 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When estimated input tokens approach the model's context limit, cap maxOutputTokens to ensure input + output never exceeds the context limit. This prevents "context length exceeded" errors like the one observed where 230145 input + 32000 output = 262145 exceeded the nemotron-3-super-free limit of 262144. The cap uses the same token estimation used for the overflow check. A minimum of 1024 output tokens is preserved to avoid degenerate cases. Fixes #249 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…n tests Update existing test expectations from 85%/15% to 75%/25%. Add new tests for the estimatedInputTokens fallback: - Uses estimate when provider returns 0 tokens → triggers overflow - Ignores estimate when provider returns non-zero tokens - No overflow with 0 tokens and no estimate (backward compat) Fixes #249 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Document root causes (0 token counts from provider, uncapped max_tokens, narrow safety margin), research across industry tools (Gemini CLI, OpenCode, Claude Code, Codex CLI), and the three fixes applied. Includes the full solution draft log from the failed session for reference. Fixes #249 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ault Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

konard · 2026-04-11T12:50:31Z

🤖 Solution Draft Log

This log file contains the complete execution trace of the AI solution draft process.

💰 Cost estimation:

Public pricing estimate: $12.940359
Calculated by Anthropic: $13.000359
Difference: $0.060000 (+0.46%)

📊 Context and tokens usage:

Claude Opus 4.6:

Context window: 156.2K / 1M (16%) input tokens, 39.3K / 128K (31%) output tokens

Total: (156.4K + 21.1M cached) input tokens, 39.3K output tokens, $12.492750 cost

Claude Haiku 4.5:

Total: (223.9K + 1.0M cached) input tokens, 14.4K / 64K (23%) output tokens, $0.447609 cost

🤖 Models used:

Tool: Anthropic Claude Code
Requested: opus
Main model: Claude Opus 4.6 (claude-opus-4-6)
Additional models:
- Claude Haiku 4.5 (claude-haiku-4-5-20251001)

📎 Log file uploaded as Gist (5604KB)

View complete solution draft log

Now working session is ended, feel free to review and add any feedback on the solution draft.

konard · 2026-04-11T12:52:51Z

✅ Ready to merge

This pull request is now ready to be merged:

All CI checks have passed
No merge conflicts
No pending changes

Monitored by hive-mind with --auto-restart-until-mergeable flag

This reverts commit 94b9f68.

konard · 2026-04-11T19:54:10Z

Are we sure provider does not return usage tokens? And also can be use real byte pair encoding, or real encoder per model to have precise results? Like there are different encoders for ChatGPT.

We need to ensure all changes are correct, consistent, validated, tested, logged and fully meet each and all discussed requirements (check issue description and all comments in issue and in pull request). Ensure all CI/CD checks pass.

konard · 2026-04-11T20:01:34Z

🤖 AI Work Session Started

Starting automated work session at 2026-04-11T20:01:31.353Z

The PR has been converted to draft mode while work is in progress.

This comment marks the beginning of an AI work session. Please wait for the session to finish, and provide your feedback.

…counting Add Token.countTokens() that uses gpt-tokenizer (o200k_base encoding) for real BPE tokenization when available, falling back to the character-based heuristic (~4 chars/token) for models with unavailable tokenizers. This addresses the reviewer feedback asking whether we can use real byte pair encoding instead of the heuristic. Research found that: - gpt-tokenizer works for OpenAI-compatible models (GPT-4o, GPT-4.1, GPT-5) - Nvidia Nemotron, Google Gemini, Meta Llama use custom SentencePiece tokenizers with no JS libraries available - The 75% safety margin absorbs the ±20% estimation error for all models Changes: - js/src/util/token.ts: Add countTokens() with lazy-loaded BPE encoder - js/src/session/prompt.ts: Use Token.countTokens() for overflow detection - js/tests/token.test.ts: Tests for both estimate() and countTokens() - js/package.json: Add gpt-tokenizer dependency - docs/case-studies/issue-249/: Update analysis, add original failure log Closes #249 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

konard · 2026-04-11T20:14:06Z

Addressing feedback

1. "Are we sure provider does not return usage tokens?"

Yes, confirmed from the original failure log. Every single step-finish event across the entire session showed all zeros:

"tokens": { "input": 0, "output": 0, "reasoning": 0, "cache": { "read": 0, "write": 0 } }

This occurred for all 10+ overflow checks. The nvidia/nemotron-3-super-free model via OpenCode consistently returned 0 for all usage fields. The original failure log is now preserved at docs/case-studies/issue-249/original-failure-log.txt for reference (lines 3203-3210, 4553, 6010, etc.).

2. "Can we use real byte pair encoding, or real encoder per model?"

Added gpt-tokenizer with o200k_base encoding — this provides real BPE tokenization for OpenAI-compatible models (GPT-4o, GPT-4.1, GPT-5). The new Token.countTokens() function uses it when available and falls back to the character heuristic.

However, real BPE cannot cover all models because different providers use incompatible tokenizers:

Nvidia Nemotron uses a custom SentencePiece BPE (256K vocab) — no JS library exists
Google Gemini uses SentencePiece — no JS library exists
Meta Llama uses SentencePiece — no JS library exists
Anthropic Claude uses a proprietary BPE — not publicly available

For these models, using the wrong tokenizer (e.g., o200k_base for Nemotron) would be approximately as accurate as the character heuristic (~4 chars/token), since all major LLM tokenizers average 3.5–4.5 chars/token for English text. The 75% safety margin (25% buffer) absorbs this ±20% estimation error.

Changes in this commit

js/src/util/token.ts: New Token.countTokens() with lazy-loaded gpt-tokenizer
js/src/session/prompt.ts: Uses Token.countTokens() for overflow detection
js/tests/token.test.ts: 11 new tests covering both estimate and countTokens
js/package.json: Added gpt-tokenizer dependency
Case study updated with tokenizer research findings

konard · 2026-04-11T20:15:53Z

🤖 Solution Draft Log

This log file contains the complete execution trace of the AI solution draft process.

💰 Cost estimation:

Public pricing estimate: $4.372591
Calculated by Anthropic: $4.452591
Difference: $0.080000 (+1.83%)

📊 Context and tokens usage:

Claude Opus 4.6:

Context window: 92.1K / 1M (9%) input tokens, 23.4K / 128K (18%) output tokens

Total: (113.1K + 5.5M cached) input tokens, 23.4K output tokens, $4.034820 cost

Claude Haiku 4.5:

Total: (176.0K + 817.5K cached) input tokens, 8.5K / 64K (13%) output tokens, $0.337771 cost

🤖 Models used:

Tool: Anthropic Claude Code
Requested: opus
Main model: Claude Opus 4.6 (claude-opus-4-6)
Additional models:
- Claude Haiku 4.5 (claude-haiku-4-5-20251001)

📎 Log file uploaded as Gist (3031KB)

View complete solution draft log

Now working session is ended, feel free to review and add any feedback on the solution draft.

konard · 2026-04-11T20:18:15Z

✅ Ready to merge

This pull request is now ready to be merged:

All CI checks have passed
No merge conflicts
No pending changes

Monitored by hive-mind with --auto-restart-until-mergeable flag

konard · 2026-04-11T20:19:07Z

But do we also have raw requests and responses logged? Not only step-finish events? May be there is a deeper bug we have, and just don't correctly get data from provider's response? Double check source code of OpenCode and other agentic CLIs.

If we don't have enough raw requests and responses logged, that also should be fixed.

We need to ensure all changes are correct, consistent, validated, tested, logged and fully meet each and all discussed requirements (check issue description and all comments in issue and in pull request). Ensure all CI/CD checks pass.

konard · 2026-04-11T20:19:58Z

🤖 AI Work Session Started

Starting automated work session at 2026-04-11T20:19:56.893Z

The PR has been converted to draft mode while work is in progress.

This comment marks the beginning of an AI work session. Please wait for the session to finish, and provide your feedback.

Investigation of the original failure log revealed that the raw HTTP response from OpenRouter DOES contain valid usage data (prompt_tokens: 15506, completion_tokens: 80), but the AI SDK's @ai-sdk/openai-compatible v1.0.33 fails to propagate these values to the finish-step event, resulting in 0 for all token counts. - Add verbose step-finish raw usage diagnostics in processor.ts - Enhance zero-token warning to fire regardless of finishReason - Update case study with deeper root cause analysis and upstream issue recommendations Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

konard · 2026-04-11T20:43:48Z

Addressing feedback: raw requests/responses and deeper bug investigation

1. "Do we also have raw requests and responses logged?"

Yes. The codebase has verbose-fetch.ts which captures full HTTP request/response data when --verbose is enabled. This includes:

Request: method, URL, sanitized headers, body preview
Response: status, headers, duration, full body preview
Streaming responses: Full SSE stream captured via response.body.tee()

This logging proved essential — it captured the raw evidence showing the provider returns valid tokens.

2. "May be there is a deeper bug we have?"

Yes, confirmed. Deep investigation of the original failure log revealed:

The raw HTTP response from OpenRouter DOES contain valid usage data:

"usage": {"prompt_tokens": 15506, "completion_tokens": 80, "total_tokens": 15586}

But the AI SDK's step-finish event reports all zeros:

"tokens": {"input": 0, "output": 0, "reasoning": 0}

The request correctly includes "stream_options": {"include_usage": true}, and the provider's SSE response correctly includes usage in the final chunk. The bug is in the AI SDK's @ai-sdk/openai-compatible v1.0.33 — it receives the streaming usage data but fails to propagate it to the finish-step event consumed by our application.

3. "Double check source code of OpenCode and other agentic CLIs"

Checked. OpenCode (sst/opencode) uses essentially the same pattern:

Same AI SDK (ai package) with streamText()
Same Session.getUsage() pattern to extract usage from value.usage
No custom raw HTTP response parsing — relies entirely on AI SDK

OpenCode would have the same bug when using @ai-sdk/openai-compatible with providers that return streaming usage. The difference is that OpenCode's upstream uses a 75% safety margin (which we've now adopted), providing more buffer.

Changes in this commit

processor.ts: Added verbose step-finish raw usage diagnostics that log both the AI SDK's parsed usage AND the raw value.usage from the finish-step event. Enhanced zero-token warning to fire regardless of finishReason.
Case study: Updated with the deeper root cause analysis, raw HTTP evidence, timeline, and upstream issue recommendations for vercel/ai.

All tests pass

37/37 JS tests (compaction + token)
18/18 Rust tests
Prettier formatting verified

konard · 2026-04-11T20:44:28Z

🤖 Solution Draft Log

This log file contains the complete execution trace of the AI solution draft process.

💰 Cost: $5.844440

📊 Context and tokens usage:

Claude Opus 4.6:

Context window: 125.9K / 1M (13%) input tokens, 23.5K / 128K (18%) output tokens

Total: (116.8K + 8.3M cached) input tokens, 23.5K output tokens, $5.454436 cost

Claude Haiku 4.5:

Total: 356.6K input tokens, 6.7K / 64K (10%) output tokens, $0.390004 cost

🤖 Models used:

Tool: Anthropic Claude Code
Requested: opus
Main model: Claude Opus 4.6 (claude-opus-4-6)
Additional models:
- Claude Haiku 4.5 (claude-haiku-4-5-20251001)

📎 Log file uploaded as Gist (2792KB)

View complete solution draft log

Now working session is ended, feel free to review and add any feedback on the solution draft.

konard · 2026-04-11T20:46:51Z

✅ Ready to merge

This pull request is now ready to be merged:

All CI checks have passed
No merge conflicts
No pending changes

Monitored by hive-mind with --auto-restart-until-mergeable flag

konard · 2026-04-12T09:33:03Z

AI SDK @ai-sdk/openai-compatible v1.0.33 drops usage: Despite receiving valid usage in the raw stream, the AI SDK's finish-step event reports 0 for all token counts. This is an upstream bug in the AI SDK's streaming usage propagation.

Can we fix that by using other popular AI API SDK, like Claude Code SDK? Or can we fallback to regular fetch/axios to starting making our own AI sdk? What source code of OpenCode, Codex, Gemini, Qwen are using at the moment? Can we have implemented all the ways so we can switch between them using CLI option and envs? And by default we should select the most reliable way to get all the data from each request and response, while also logging full http request and response with all statuses, headers, and so on, not just body.

Or may be latest version AI SDK will work?

We need to guarantee we actually get real usage data, and actually all the data we can from responses to propagate these for users and use in our Agent CLI itself.

We need to ensure all changes are correct, consistent, validated, tested, logged and fully meet each and all discussed requirements (check issue description and all comments in issue and in pull request
). Ensure all CI/CD checks pass.

konard · 2026-04-12T09:33:48Z

🤖 AI Work Session Started

Starting automated work session at 2026-04-12T09:33:46.835Z

The PR has been converted to draft mode while work is in progress.

This comment marks the beginning of an AI work session. Please wait for the session to finish, and provide your feedback.

When the AI SDK drops token usage data from its finish-step event (a known upstream bug in @ai-sdk/openai-compatible), the new SSE usage extractor intercepts the raw HTTP streaming response and parses usage tokens directly from SSE data chunks. This works for all providers (OpenAI, Anthropic, OpenRouter, Groq formats) and operates in both verbose and non-verbose modes. The processor now automatically recovers usage when AI SDK returns zeros, logging the recovery for diagnostics. This eliminates the root cause of issue #249 where compaction never triggered because token counts were always 0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Added comparison of SDK choices across agentic CLIs (OpenCode, Codex, Gemini CLI, Claude Code, Aider, Qwen Agent), documented the SSE usage recovery mechanism, and outlined future work for switchable SDK backends. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

konard · 2026-04-12T09:47:48Z

Addressing feedback: SDK alternatives and guaranteed usage data

1. "Can we fix that by using other popular AI API SDK, like Claude Code SDK?"

Researched. Each major agentic CLI uses a different approach:

Tool	SDK	Approach
Codex	Custom Rust (reqwest + SSE)	Raw HTTP — most reliable, but only supports OpenAI API
Gemini CLI	`@google/genai`	Provider-specific SDK — only works with Google models
Claude Code	`@anthropic-ai/sdk`	Provider-specific SDK — only works with Anthropic
Aider	`litellm` (Python)	Universal gateway — would need a Python bridge
OpenCode	Vercel AI SDK (same as us)	Same bug exposure

None of these are drop-in replacements because our codebase supports 11+ providers via the AI SDK unified interface.

2. "Can we fallback to regular fetch/axios?"

Yes — implemented. The new SSEUsageExtractor does exactly this. It intercepts the raw HTTP response at the fetch() level (before the AI SDK processes it) and parses usage tokens directly from SSE data: chunks. This provides Codex-like reliability without abandoning the AI SDK.

The extractor works in both verbose and non-verbose modes — it tee()s the stream, reads one copy for usage extraction, and gives the other copy to the AI SDK.

3. "Or may be latest version AI SDK will work?"

Unlikely. The streaming usage bug is a known open issue in the Vercel AI SDK:

vercel/ai #9921 — Token usage normalization across providers (Open, Oct 2025)
vercel/ai #12477 — RetryError crash on missing usage (Open, Feb 2026)
vercel/ai #7412 — Zero token usage in OTEL traces (Jul 2025)

4. "We need to guarantee we actually get real usage data"

Now guaranteed through defense-in-depth:

Layer 1 — AI SDK (primary): Standard finish-step event usage
Layer 2 — SSE recovery (NEW): Raw HTTP stream parsing when AI SDK returns zeros
Layer 3 — Token estimation (fallback): BPE tokenization / character heuristic when both above fail
Layer 4 — Safety margin (last resort): 75% threshold provides 25% buffer

5. Full switchable SDK backends

Full switchable backends (direct fetch, provider-specific SDKs, LiteLLM bridge) would require significant architectural work. This is documented as a future feature request in the case study.

Changes in this push

js/src/util/sse-usage-extractor.ts — New raw SSE usage parser
js/src/provider/provider.ts — SSE stream interception in fetch chain
js/src/session/processor.ts — Automatic SSE usage recovery when AI SDK returns zeros
js/tests/sse-usage-extractor.test.ts — 17 tests
Case study updated with SDK research findings

All 54 JS tests + Rust tests pass.

konard · 2026-04-12T09:49:27Z

🤖 Solution Draft Log

This log file contains the complete execution trace of the AI solution draft process.

💰 Cost estimation:

Public pricing estimate: $5.672917
Calculated by Anthropic: $5.752917
Difference: $0.080000 (+1.41%)

📊 Context and tokens usage:

Claude Opus 4.6:

Context window: 119.4K / 1M (12%) input tokens, 34.2K / 128K (27%) output tokens

Total: (169.8K + 6.8M cached) input tokens, 34.2K output tokens, $5.307618 cost

Claude Haiku 4.5:

Total: (200.0K + 701.9K cached) input tokens, 10.6K / 64K (17%) output tokens, $0.365299 cost

🤖 Models used:

Tool: Anthropic Claude Code
Requested: opus
Main model: Claude Opus 4.6 (claude-opus-4-6)
Additional models:
- Claude Haiku 4.5 (claude-haiku-4-5-20251001)

📎 Log file uploaded as Gist (3725KB)

View complete solution draft log

Now working session is ended, feel free to review and add any feedback on the solution draft.

konard · 2026-04-12T09:51:50Z

✅ Ready to merge

This pull request is now ready to be merged:

All CI checks have passed
No merge conflicts
No pending changes

Monitored by hive-mind with --auto-restart-until-mergeable flag

Initial commit with task details

94b9f68

Adding .gitkeep for PR creation (default mode). This file will be removed when the task is complete. Issue: #249

konard self-assigned this Apr 11, 2026

konard and others added 5 commits April 11, 2026 12:29

konard changed the title ~~[WIP] Double check best practices on the threshold of compaction in OpenCode, Codex, Gemini~~ fix: lower compaction threshold to 75% and add token estimation fallback Apr 11, 2026

konard and others added 3 commits April 11, 2026 12:41

chore: add changeset for compaction threshold fix

df87e66

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

test(rust): update compaction safety margin test to match new 25% def…

741fff9

…ault Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

test(rust): update cli_options test for new 25% compaction safety margin

490713e

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

konard marked this pull request as ready for review April 11, 2026 12:45

style: fix prettier formatting in compaction.ts and prompt.ts

5655843

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Revert "Initial commit with task details"

2355071

This reverts commit 94b9f68.

konard marked this pull request as draft April 11, 2026 20:01

konard marked this pull request as ready for review April 11, 2026 20:15

konard marked this pull request as draft April 11, 2026 20:19

konard marked this pull request as ready for review April 11, 2026 20:43

konard marked this pull request as draft April 12, 2026 09:33

konard and others added 2 commits April 12, 2026 09:45

konard marked this pull request as ready for review April 12, 2026 09:47

konard merged commit 996496b into main Apr 12, 2026
16 checks passed

Conversation

konard commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause (verified from raw HTTP logs)

Changes

Fix 1: Lower safety margin from 85% to 75%

Fix 2: Real BPE tokenization via gpt-tokenizer

Fix 3: Cap maxOutputTokens

Fix 4: Raw SSE usage recovery (AI SDK bypass) ← NEW

Enhanced diagnostics

SDK Research: What Other CLIs Use

Future work: Switchable SDK backends

Files Changed

Test plan

Uh oh!

konard commented Apr 11, 2026

🤖 Solution Draft Log

💰 Cost estimation:

📊 Context and tokens usage:

🤖 Models used:

📎 Log file uploaded as Gist (5604KB)

Uh oh!

konard commented Apr 11, 2026

✅ Ready to merge

Uh oh!

konard commented Apr 11, 2026

Uh oh!

konard commented Apr 11, 2026

Uh oh!

konard commented Apr 11, 2026

Addressing feedback

1. "Are we sure provider does not return usage tokens?"

2. "Can we use real byte pair encoding, or real encoder per model?"

Changes in this commit

Uh oh!

konard commented Apr 11, 2026

🤖 Solution Draft Log

💰 Cost estimation:

📊 Context and tokens usage:

🤖 Models used:

📎 Log file uploaded as Gist (3031KB)

Uh oh!

konard commented Apr 11, 2026

✅ Ready to merge

Uh oh!

konard commented Apr 11, 2026

Uh oh!

konard commented Apr 11, 2026

Uh oh!

konard commented Apr 11, 2026

Addressing feedback: raw requests/responses and deeper bug investigation

1. "Do we also have raw requests and responses logged?"

2. "May be there is a deeper bug we have?"

3. "Double check source code of OpenCode and other agentic CLIs"

Changes in this commit

All tests pass

Uh oh!

konard commented Apr 11, 2026

🤖 Solution Draft Log

💰 Cost: $5.844440

📊 Context and tokens usage:

🤖 Models used:

📎 Log file uploaded as Gist (2792KB)

Uh oh!

konard commented Apr 11, 2026

✅ Ready to merge

Uh oh!

konard commented Apr 12, 2026

Uh oh!

konard commented Apr 12, 2026

Uh oh!

konard commented Apr 12, 2026

Addressing feedback: SDK alternatives and guaranteed usage data

1. "Can we fix that by using other popular AI API SDK, like Claude Code SDK?"

2. "Can we fallback to regular fetch/axios?"

3. "Or may be latest version AI SDK will work?"

4. "We need to guarantee we actually get real usage data"

5. Full switchable SDK backends

Changes in this push

konard commented Apr 11, 2026 •

edited

Loading

Fix 2: Real BPE tokenization via `gpt-tokenizer`

Fix 3: Cap `maxOutputTokens`