Skip to content

fix(chat): reset web session after rate limits#2412

Open
aqilaziz wants to merge 1 commit into
tinyhumansai:mainfrom
aqilaziz:codex/2364-rate-limit-session-reset
Open

fix(chat): reset web session after rate limits#2412
aqilaziz wants to merge 1 commit into
tinyhumansai:mainfrom
aqilaziz:codex/2364-rate-limit-session-reset

Conversation

@aqilaziz
Copy link
Copy Markdown
Contributor

@aqilaziz aqilaziz commented May 21, 2026

Summary - Stops caching web-chat agent sessions after transient chat errors, including rate_limited, timeout, network, and provider_error. - Rebuilds the same thread's session on retry so stale failed turn state cannot keep the thread stuck. - Classifies provider 429 balance failures as budget_exhausted instead of transient rate-limit copy. - Adds focused Rust regression coverage for rate-limit chat errors, budget-vs-rate-limit classification, network classification, and transient session-reset policy. ## Problem A thread can keep showing rate-limit copy while a new thread works, which points to transient request state being retained in the per-thread web-channel session cache. The existing classifier also treated any 429 as a transient rate limit before checking budget-like provider messages. ## Solution - Skip storing web-channel sessions after transient error classes (rate_limited, timeout, network, provider_error). - Keep non-transient classes such as auth, model, context, and budget errors out of that transient reset path. - Check budget exhaustion markers before the generic 429/rate-limit classifier. - Map transport/connectivity failures to network so the transient cache-clear policy is reachable. ## Submission Checklist - [x] Tests added or updated (happy path + at least one failure / edge case) per Testing Strategy - [x] Diff coverage >= 80% - focused Rust tests added for changed behavior; CI coverage gate will enforce merged diff coverage. - [x] Coverage matrix updated - N/A: behavior-only web-channel error/session handling, no feature row added/removed/renamed - [x] All affected feature IDs from the matrix are listed in the PR description under ## Related - N/A: no matrix feature ID changed - [x] No new external network dependencies introduced (mock backend used per Testing Strategy) - [x] Manual smoke checklist updated if this touches release-cut surfaces - N/A: no release-cut smoke surface changed - [x] Linked issue closed via Closes #NNN in the ## Related section ## Impact - Desktop chat retry behavior changes only after transient web-channel failures. - No migration or new dependency. - True budget/auth/model/context failures keep their existing non-transient classification. ## Related - Closes #2364 - Follow-up PR(s)/TODOs: N/A --- ## AI Authored PR Metadata (required for Codex/Linear PRs) ### Linear Issue - Key: N/A - URL: N/A ### Commit & Branch - Branch: codex/2364-rate-limit-session-reset - Commit SHA: fee189461b77cecf4cd5deca63f9b98e8f903f58 ### Validation Run - [x] cargo fmt -- --check - [x] git diff --check - [x] Focused tests attempted locally: cargo test --lib transient_chat_errors_skip_cached_web_session_storage -- --nocapture (blocked by missing Windows libclang; see below) - [x] Rust fmt/check (if changed): cargo fmt -- --check - [x] Tauri fmt/check (if changed): N/A ### Validation Blocked - command: cargo test --lib transient_chat_errors_skip_cached_web_session_storage -- --nocapture - error: Unable to find libclang: could not find clang.dll/libclang.dll; set LIBCLANG_PATH - impact: Local Windows Rust test run blocked before executing tests; CI Linux Rust jobs should validate the added tests. ### Behavior Changes - Intended behavior change: retrying the same thread after a transient rate-limit/provider/network failure starts from a fresh web-channel session instead of reusing a possibly poisoned cached agent. - User-visible effect: users can recover in the same thread after transient throttling clears. ### Parity Contract - Legacy behavior preserved: non-transient auth, budget, model, context, and generic inference errors are not treated as transient session poison. - Guard/fallback/dispatch parity checks: event path still emits the same chat_error event shape and message fields. ### Duplicate / Superseded PR Handling - Duplicate PR(s): N/A - Canonical PR: this PR - Resolution (closed/superseded/updated): N/A ## Summary by CodeRabbit * Bug Fixes * Better separates payment/budget failures from rate limits and adds explicit network connectivity classification; budget messages now suggest topping up while network errors show a sanitized connectivity message. * Clears cached conversation sessions for transient failures (rate limits, timeouts, network, provider errors) to avoid sticky sessions. * Tests * Added tests for error classification, sanitized messages, session-cache invalidation, and transient-error behavior. Review Change Stack

@aqilaziz aqilaziz requested a review from a team May 21, 2026 05:19
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 21, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Detects budget/credit exhaustion earlier, adds a network transport-failure classification, prevents transient errors from populating per-thread cached sessions, integrates the cache-skip into the chat handler, and adds tests plus a process-global test lock for forced-error tests.

Changes

Prevent Rate-Limit State from Sticking to Chat Threads

Layer / File(s) Summary
Error classification updates
src/openhuman/channels/providers/web.rs
classify_inference_error moves budget/credit-exhaustion matching earlier, removes the prior duplicate budget arm, and adds a network arm that recognizes transport/DNS/connectivity failures and enriches messages with provider details.
Session cache invalidation helpers
src/openhuman/channels/providers/web.rs
Adds predicates/helpers to classify transient error types (rate_limited, timeout, network, provider_error) and to derive an optional session-reset error type from run_chat_task results.
Chat failure handler integration
src/openhuman/channels/providers/web.rs
After run_chat_task completes, the start_chat handler skips inserting the rebuilt SessionEntry into THREAD_SESSIONS when the error is transient (logs the discard); otherwise it inserts the session as before.
Tests and test lock
src/openhuman/channels/providers/web_tests.rs
Introduces a process-global FORCED_RUN_CHAT_TASK_ERROR_LOCK to serialize forced-error setup/cleanup, updates transport-failure test to assert network + sanitized message, adds a 429 rate_limited chat test, and adds unit tests covering classify_inference_error, should_clear_cached_session_after_error, and session_reset_error_type.

Sequence Diagram(s)

sequenceDiagram
  participant start_chat_handler
  participant run_chat_task
  participant classify_inference_error
  participant cache_logic
  participant THREAD_SESSIONS
  participant WebChannelEvent
  start_chat_handler->>run_chat_task: spawn run_chat_task
  run_chat_task->>start_chat_handler: returns Error
  start_chat_handler->>classify_inference_error: classify(error)
  classify_inference_error-->>start_chat_handler: classified_type
  start_chat_handler->>cache_logic: should_skip_insert?(classified_type)
  cache_logic-->>start_chat_handler: skip / insert decision
  alt skip
    start_chat_handler->>THREAD_SESSIONS: do not insert (log discard)
  else insert
    start_chat_handler->>THREAD_SESSIONS: insert rebuilt SessionEntry
  end
  start_chat_handler->>WebChannelEvent: emit chat_error
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • tinyhumansai/openhuman#2239: Both PRs modify src/openhuman/channels/providers/web.rs's error classification to add new actionable mapping arms.

Suggested labels

agent, bug

Suggested reviewers

  • senamakel
  • graycyrus

Poem

🐇 I nibbled logs and chased a thread,
Cleared a stuck rate-limit from where it spread,
Network hops and budget flags I sort,
Sessions refreshed — the chat resumes its part,
Hooray, a bunny-fixed restart!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix(chat): reset web session after rate limits' accurately captures the main change: clearing cached web-chat sessions after transient errors including rate limits.
Docstring Coverage ✅ Passed Docstring coverage is 87.50% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added the working A PR that is being worked on by the team. label May 21, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/openhuman/channels/providers/web_tests.rs (1)

124-163: 💤 Low value

Test name implies session-cache verification, but only checks the emitted event.

The test name start_chat_emits_rate_limit_error_without_sticky_session_state suggests it validates that session state isn't sticky after a rate-limit error. However, the test only verifies that the chat_error event has error_type == "rate_limited" and the correct user-facing message — it doesn't actually confirm that THREAD_SESSIONS was cleared.

Consider either:

  1. Renaming to start_chat_emits_rate_limit_chat_error_event to reflect what it actually tests, or
  2. Adding an assertion that the session cache no longer contains an entry for "rate-limit-client::rate-limit-thread" after the error.

The transient_errors_clear_cached_web_session test at line 198 does cover the predicate, but an integration-level assertion here would strengthen the regression coverage for issue #2364.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/channels/providers/web_tests.rs` around lines 124 - 163, The
test start_chat_emits_rate_limit_error_without_sticky_session_state currently
only asserts the emitted chat_error event but not whether the session cache was
cleared; either rename the test to start_chat_emits_rate_limit_chat_error_event
to match behavior, or (preferred) add an assertion after receiving recv that
THREAD_SESSIONS does not contain the key format
"rate-limit-client::rate-limit-thread" (use the same cache access utility used
elsewhere in tests, e.g., checking THREAD_SESSIONS or the helper used in
transient_errors_clear_cached_web_session) to ensure the sticky session was
removed after the rate-limit error.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/channels/providers/web.rs`:
- Around line 328-333: The match arm "network" in
should_clear_cached_session_after_error is unreachable because
classify_inference_error never returns "network"; either remove "network" from
should_clear_cached_session_after_error to avoid dead code, or add a
corresponding branch in classify_inference_error that maps
DNS/connect/transport-level failures to "network" (e.g., detect IO/connection
errors and return "network") so the variant becomes reachable; update the
function should_clear_cached_session_after_error and/or classify_inference_error
accordingly, referencing should_clear_cached_session_after_error and
classify_inference_error to locate the changes.

---

Nitpick comments:
In `@src/openhuman/channels/providers/web_tests.rs`:
- Around line 124-163: The test
start_chat_emits_rate_limit_error_without_sticky_session_state currently only
asserts the emitted chat_error event but not whether the session cache was
cleared; either rename the test to start_chat_emits_rate_limit_chat_error_event
to match behavior, or (preferred) add an assertion after receiving recv that
THREAD_SESSIONS does not contain the key format
"rate-limit-client::rate-limit-thread" (use the same cache access utility used
elsewhere in tests, e.g., checking THREAD_SESSIONS or the helper used in
transient_errors_clear_cached_web_session) to ensure the sticky session was
removed after the rate-limit error.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 99b4deb2-c94f-466a-93ef-79965915a715

📥 Commits

Reviewing files that changed from the base of the PR and between 6281aea and 71a7f72.

📒 Files selected for processing (2)
  • src/openhuman/channels/providers/web.rs
  • src/openhuman/channels/providers/web_tests.rs

Comment thread src/openhuman/channels/providers/web.rs
@aqilaziz aqilaziz force-pushed the codex/2364-rate-limit-session-reset branch from 71a7f72 to 4dfd2b9 Compare May 21, 2026 05:30
coderabbitai[bot]
coderabbitai Bot previously approved these changes May 21, 2026
@aqilaziz aqilaziz force-pushed the codex/2364-rate-limit-session-reset branch from 4dfd2b9 to 0ce8558 Compare May 21, 2026 05:49
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
src/openhuman/channels/providers/web.rs (1)

344-368: 💤 Low value

Consider using debug level for the session-clearing log.

The coding guidelines specify using debug or trace level for state transitions. This info-level log will appear in production output on every transient error recovery, which may be noisier than intended for routine retry scenarios.

♻️ Suggested change
     if removed {
-        log::info!(
+        log::debug!(
             "[web-channel] cleared cached session after transient error map_key={} error_type={}",
             map_key,
             error_type
         );
     }

As per coding guidelines: "Use log / tracing at debug or trace level on RPC entry and exit, error paths, state transitions."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/channels/providers/web.rs` around lines 344 - 368, Change the
log level in clear_cached_session_after_error to debug (or trace) so clearing
cached sessions on transient errors doesn't spam production logs: in the
clear_cached_session_after_error function (which uses
should_clear_cached_session_after_error and THREAD_SESSIONS) replace the
log::info! call with log::debug! (or log::trace!) and keep the same message and
variables (map_key, error_type) to preserve context.
src/openhuman/channels/providers/web_tests.rs (1)

125-170: ⚖️ Poor tradeoff

Session-clearing assertion passes vacuously — no session was created.

The forced error triggers inside run_chat_task before any session is cached (the #[cfg(test)] block returns Err at line 724 of web.rs, before the session insert at lines 909-917). The assertion at lines 166-168 verifies the key doesn't exist, which is true because it was never inserted — not because it was cleared.

The unit test transient_errors_clear_cached_web_session (lines 214-234) correctly tests the predicate, but this integration test doesn't verify the actual removal behavior. Consider adding a follow-up test that:

  1. Completes a successful turn first (creating a cached session)
  2. Forces a transient error on the next turn
  3. Verifies the session is removed

This would confirm the end-to-end cache-clearing flow.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/channels/providers/web_tests.rs` around lines 125 - 170, The
current integration test
start_chat_emits_rate_limit_error_without_sticky_session_state never creates a
session, so the "must clear cached session" assertion is vacuous; change the
test (or add a new one) to first create and confirm a cached session via
start_chat (call start_chat once under normal behavior and assert
THREAD_SESSIONS contains key_for("rate-limit-client","rate-limit-thread")), then
enable the forced transient error using set_test_forced_run_chat_task_error /
TestForcedRunChatTaskErrorGuard and call start_chat again to trigger the
rate-limit path in run_chat_task, then assert THREAD_SESSIONS no longer contains
that map_key—this verifies end-to-end cache-clearing rather than relying on the
session never having been inserted.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/openhuman/channels/providers/web_tests.rs`:
- Around line 125-170: The current integration test
start_chat_emits_rate_limit_error_without_sticky_session_state never creates a
session, so the "must clear cached session" assertion is vacuous; change the
test (or add a new one) to first create and confirm a cached session via
start_chat (call start_chat once under normal behavior and assert
THREAD_SESSIONS contains key_for("rate-limit-client","rate-limit-thread")), then
enable the forced transient error using set_test_forced_run_chat_task_error /
TestForcedRunChatTaskErrorGuard and call start_chat again to trigger the
rate-limit path in run_chat_task, then assert THREAD_SESSIONS no longer contains
that map_key—this verifies end-to-end cache-clearing rather than relying on the
session never having been inserted.

In `@src/openhuman/channels/providers/web.rs`:
- Around line 344-368: Change the log level in clear_cached_session_after_error
to debug (or trace) so clearing cached sessions on transient errors doesn't spam
production logs: in the clear_cached_session_after_error function (which uses
should_clear_cached_session_after_error and THREAD_SESSIONS) replace the
log::info! call with log::debug! (or log::trace!) and keep the same message and
variables (map_key, error_type) to preserve context.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7e5829cd-d64c-4503-80ab-5ee955b3aa14

📥 Commits

Reviewing files that changed from the base of the PR and between 4dfd2b9 and 0ce8558.

📒 Files selected for processing (2)
  • src/openhuman/channels/providers/web.rs
  • src/openhuman/channels/providers/web_tests.rs

coderabbitai[bot]
coderabbitai Bot previously approved these changes May 21, 2026
@aqilaziz aqilaziz force-pushed the codex/2364-rate-limit-session-reset branch from 0ce8558 to b4f754d Compare May 21, 2026 06:11
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/openhuman/channels/providers/web.rs (1)

361-366: ⚡ Quick win

Use debug! for this cache-reset diagnostic log.

This is a state-transition diagnostic, so info! will add noise on every transient provider blip. debug! fits the repo guidance better.

♻️ Proposed fix
-        log::info!(
+        log::debug!(
             "[web-channel] cleared cached session after transient error map_key={} error_type={}",
             map_key,
             error_type
         );

As per coding guidelines, "Use log / tracing at debug or trace level on RPC entry and exit, error paths, state transitions, and any branch that is hard to infer from tests alone."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/channels/providers/web.rs` around lines 361 - 366, Replace the
informational log call used when clearing the cached session with a debug-level
log: in the block that checks the removed flag (variable removed) where it
currently invokes log::info! with map_key and error_type, change the macro to
log::debug! so this state-transition diagnostic uses debug level for the
transient provider cache-reset message.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/channels/providers/web.rs`:
- Around line 525-527: The transient-session cleanup
(clear_cached_session_after_error) must not run in an abortable path because
run_chat_task re-inserts SessionEntry into THREAD_SESSIONS and a concurrent
start_chat can abort this task, leaving a poisoned session cached; modify
run_chat_task (and callers like start_chat) so that session caching and cleanup
are decided and executed outside the abortable spawned task: either have
run_chat_task take responsibility for caching the SessionEntry internally and
perform any clear_cached_session_after_error synchronously before yielding
control, or change run_chat_task to return a result that includes a cleanup
decision/flag (e.g., ErrorWithCleanup { err, should_clear_session }) so the
caller can call clear_cached_session_after_error(&map_key_task,
classified_type).await outside the spawn/abort boundary; ensure references to
THREAD_SESSIONS, SessionEntry, clear_cached_session_after_error, run_chat_task
and start_chat are updated accordingly so no transient cleanup awaits inside a
task that can be aborted.

---

Nitpick comments:
In `@src/openhuman/channels/providers/web.rs`:
- Around line 361-366: Replace the informational log call used when clearing the
cached session with a debug-level log: in the block that checks the removed flag
(variable removed) where it currently invokes log::info! with map_key and
error_type, change the macro to log::debug! so this state-transition diagnostic
uses debug level for the transient provider cache-reset message.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2392b64e-0c38-41cc-94d6-efa6afd856c4

📥 Commits

Reviewing files that changed from the base of the PR and between 0ce8558 and b4f754d.

📒 Files selected for processing (2)
  • src/openhuman/channels/providers/web.rs
  • src/openhuman/channels/providers/web_tests.rs

Comment thread src/openhuman/channels/providers/web.rs Outdated
@aqilaziz aqilaziz force-pushed the codex/2364-rate-limit-session-reset branch from b4f754d to 5221ba9 Compare May 21, 2026 07:02
@coderabbitai coderabbitai Bot added agent Built-in agents, prompts, orchestration, and agent runtime in src/openhuman/agent/. bug labels May 21, 2026
coderabbitai[bot]
coderabbitai Bot previously approved these changes May 21, 2026
@aqilaziz aqilaziz force-pushed the codex/2364-rate-limit-session-reset branch from 5221ba9 to fee1894 Compare May 21, 2026 07:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent Built-in agents, prompts, orchestration, and agent runtime in src/openhuman/agent/. bug working A PR that is being worked on by the team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rate limit state sticks to one chat thread

1 participant