fix(chat): reset web session after rate limits#2412
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughDetects budget/credit exhaustion earlier, adds a ChangesPrevent Rate-Limit State from Sticking to Chat Threads
Sequence Diagram(s)sequenceDiagram
participant start_chat_handler
participant run_chat_task
participant classify_inference_error
participant cache_logic
participant THREAD_SESSIONS
participant WebChannelEvent
start_chat_handler->>run_chat_task: spawn run_chat_task
run_chat_task->>start_chat_handler: returns Error
start_chat_handler->>classify_inference_error: classify(error)
classify_inference_error-->>start_chat_handler: classified_type
start_chat_handler->>cache_logic: should_skip_insert?(classified_type)
cache_logic-->>start_chat_handler: skip / insert decision
alt skip
start_chat_handler->>THREAD_SESSIONS: do not insert (log discard)
else insert
start_chat_handler->>THREAD_SESSIONS: insert rebuilt SessionEntry
end
start_chat_handler->>WebChannelEvent: emit chat_error
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/openhuman/channels/providers/web_tests.rs (1)
124-163: 💤 Low valueTest name implies session-cache verification, but only checks the emitted event.
The test name
start_chat_emits_rate_limit_error_without_sticky_session_statesuggests it validates that session state isn't sticky after a rate-limit error. However, the test only verifies that thechat_errorevent haserror_type == "rate_limited"and the correct user-facing message — it doesn't actually confirm thatTHREAD_SESSIONSwas cleared.Consider either:
- Renaming to
start_chat_emits_rate_limit_chat_error_eventto reflect what it actually tests, or- Adding an assertion that the session cache no longer contains an entry for
"rate-limit-client::rate-limit-thread"after the error.The
transient_errors_clear_cached_web_sessiontest at line 198 does cover the predicate, but an integration-level assertion here would strengthen the regression coverage for issue#2364.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/channels/providers/web_tests.rs` around lines 124 - 163, The test start_chat_emits_rate_limit_error_without_sticky_session_state currently only asserts the emitted chat_error event but not whether the session cache was cleared; either rename the test to start_chat_emits_rate_limit_chat_error_event to match behavior, or (preferred) add an assertion after receiving recv that THREAD_SESSIONS does not contain the key format "rate-limit-client::rate-limit-thread" (use the same cache access utility used elsewhere in tests, e.g., checking THREAD_SESSIONS or the helper used in transient_errors_clear_cached_web_session) to ensure the sticky session was removed after the rate-limit error.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/channels/providers/web.rs`:
- Around line 328-333: The match arm "network" in
should_clear_cached_session_after_error is unreachable because
classify_inference_error never returns "network"; either remove "network" from
should_clear_cached_session_after_error to avoid dead code, or add a
corresponding branch in classify_inference_error that maps
DNS/connect/transport-level failures to "network" (e.g., detect IO/connection
errors and return "network") so the variant becomes reachable; update the
function should_clear_cached_session_after_error and/or classify_inference_error
accordingly, referencing should_clear_cached_session_after_error and
classify_inference_error to locate the changes.
---
Nitpick comments:
In `@src/openhuman/channels/providers/web_tests.rs`:
- Around line 124-163: The test
start_chat_emits_rate_limit_error_without_sticky_session_state currently only
asserts the emitted chat_error event but not whether the session cache was
cleared; either rename the test to start_chat_emits_rate_limit_chat_error_event
to match behavior, or (preferred) add an assertion after receiving recv that
THREAD_SESSIONS does not contain the key format
"rate-limit-client::rate-limit-thread" (use the same cache access utility used
elsewhere in tests, e.g., checking THREAD_SESSIONS or the helper used in
transient_errors_clear_cached_web_session) to ensure the sticky session was
removed after the rate-limit error.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 99b4deb2-c94f-466a-93ef-79965915a715
📒 Files selected for processing (2)
src/openhuman/channels/providers/web.rssrc/openhuman/channels/providers/web_tests.rs
71a7f72 to
4dfd2b9
Compare
4dfd2b9 to
0ce8558
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (2)
src/openhuman/channels/providers/web.rs (1)
344-368: 💤 Low valueConsider using
debuglevel for the session-clearing log.The coding guidelines specify using
debugortracelevel for state transitions. This info-level log will appear in production output on every transient error recovery, which may be noisier than intended for routine retry scenarios.♻️ Suggested change
if removed { - log::info!( + log::debug!( "[web-channel] cleared cached session after transient error map_key={} error_type={}", map_key, error_type ); }As per coding guidelines: "Use
log/tracingatdebugortracelevel on RPC entry and exit, error paths, state transitions."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/channels/providers/web.rs` around lines 344 - 368, Change the log level in clear_cached_session_after_error to debug (or trace) so clearing cached sessions on transient errors doesn't spam production logs: in the clear_cached_session_after_error function (which uses should_clear_cached_session_after_error and THREAD_SESSIONS) replace the log::info! call with log::debug! (or log::trace!) and keep the same message and variables (map_key, error_type) to preserve context.src/openhuman/channels/providers/web_tests.rs (1)
125-170: ⚖️ Poor tradeoffSession-clearing assertion passes vacuously — no session was created.
The forced error triggers inside
run_chat_taskbefore any session is cached (the#[cfg(test)]block returnsErrat line 724 ofweb.rs, before the session insert at lines 909-917). The assertion at lines 166-168 verifies the key doesn't exist, which is true because it was never inserted — not because it was cleared.The unit test
transient_errors_clear_cached_web_session(lines 214-234) correctly tests the predicate, but this integration test doesn't verify the actual removal behavior. Consider adding a follow-up test that:
- Completes a successful turn first (creating a cached session)
- Forces a transient error on the next turn
- Verifies the session is removed
This would confirm the end-to-end cache-clearing flow.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/channels/providers/web_tests.rs` around lines 125 - 170, The current integration test start_chat_emits_rate_limit_error_without_sticky_session_state never creates a session, so the "must clear cached session" assertion is vacuous; change the test (or add a new one) to first create and confirm a cached session via start_chat (call start_chat once under normal behavior and assert THREAD_SESSIONS contains key_for("rate-limit-client","rate-limit-thread")), then enable the forced transient error using set_test_forced_run_chat_task_error / TestForcedRunChatTaskErrorGuard and call start_chat again to trigger the rate-limit path in run_chat_task, then assert THREAD_SESSIONS no longer contains that map_key—this verifies end-to-end cache-clearing rather than relying on the session never having been inserted.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/openhuman/channels/providers/web_tests.rs`:
- Around line 125-170: The current integration test
start_chat_emits_rate_limit_error_without_sticky_session_state never creates a
session, so the "must clear cached session" assertion is vacuous; change the
test (or add a new one) to first create and confirm a cached session via
start_chat (call start_chat once under normal behavior and assert
THREAD_SESSIONS contains key_for("rate-limit-client","rate-limit-thread")), then
enable the forced transient error using set_test_forced_run_chat_task_error /
TestForcedRunChatTaskErrorGuard and call start_chat again to trigger the
rate-limit path in run_chat_task, then assert THREAD_SESSIONS no longer contains
that map_key—this verifies end-to-end cache-clearing rather than relying on the
session never having been inserted.
In `@src/openhuman/channels/providers/web.rs`:
- Around line 344-368: Change the log level in clear_cached_session_after_error
to debug (or trace) so clearing cached sessions on transient errors doesn't spam
production logs: in the clear_cached_session_after_error function (which uses
should_clear_cached_session_after_error and THREAD_SESSIONS) replace the
log::info! call with log::debug! (or log::trace!) and keep the same message and
variables (map_key, error_type) to preserve context.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7e5829cd-d64c-4503-80ab-5ee955b3aa14
📒 Files selected for processing (2)
src/openhuman/channels/providers/web.rssrc/openhuman/channels/providers/web_tests.rs
0ce8558 to
b4f754d
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/openhuman/channels/providers/web.rs (1)
361-366: ⚡ Quick winUse
debug!for this cache-reset diagnostic log.This is a state-transition diagnostic, so
info!will add noise on every transient provider blip.debug!fits the repo guidance better.♻️ Proposed fix
- log::info!( + log::debug!( "[web-channel] cleared cached session after transient error map_key={} error_type={}", map_key, error_type );As per coding guidelines, "Use
log/tracingatdebugortracelevel on RPC entry and exit, error paths, state transitions, and any branch that is hard to infer from tests alone."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/channels/providers/web.rs` around lines 361 - 366, Replace the informational log call used when clearing the cached session with a debug-level log: in the block that checks the removed flag (variable removed) where it currently invokes log::info! with map_key and error_type, change the macro to log::debug! so this state-transition diagnostic uses debug level for the transient provider cache-reset message.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/channels/providers/web.rs`:
- Around line 525-527: The transient-session cleanup
(clear_cached_session_after_error) must not run in an abortable path because
run_chat_task re-inserts SessionEntry into THREAD_SESSIONS and a concurrent
start_chat can abort this task, leaving a poisoned session cached; modify
run_chat_task (and callers like start_chat) so that session caching and cleanup
are decided and executed outside the abortable spawned task: either have
run_chat_task take responsibility for caching the SessionEntry internally and
perform any clear_cached_session_after_error synchronously before yielding
control, or change run_chat_task to return a result that includes a cleanup
decision/flag (e.g., ErrorWithCleanup { err, should_clear_session }) so the
caller can call clear_cached_session_after_error(&map_key_task,
classified_type).await outside the spawn/abort boundary; ensure references to
THREAD_SESSIONS, SessionEntry, clear_cached_session_after_error, run_chat_task
and start_chat are updated accordingly so no transient cleanup awaits inside a
task that can be aborted.
---
Nitpick comments:
In `@src/openhuman/channels/providers/web.rs`:
- Around line 361-366: Replace the informational log call used when clearing the
cached session with a debug-level log: in the block that checks the removed flag
(variable removed) where it currently invokes log::info! with map_key and
error_type, change the macro to log::debug! so this state-transition diagnostic
uses debug level for the transient provider cache-reset message.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2392b64e-0c38-41cc-94d6-efa6afd856c4
📒 Files selected for processing (2)
src/openhuman/channels/providers/web.rssrc/openhuman/channels/providers/web_tests.rs
b4f754d to
5221ba9
Compare
5221ba9 to
fee1894
Compare
Summary - Stops caching web-chat agent sessions after transient chat errors, including
rate_limited,timeout,network, andprovider_error. - Rebuilds the same thread's session on retry so stale failed turn state cannot keep the thread stuck. - Classifies provider 429 balance failures asbudget_exhaustedinstead of transient rate-limit copy. - Adds focused Rust regression coverage for rate-limit chat errors, budget-vs-rate-limit classification, network classification, and transient session-reset policy. ## Problem A thread can keep showing rate-limit copy while a new thread works, which points to transient request state being retained in the per-thread web-channel session cache. The existing classifier also treated any 429 as a transient rate limit before checking budget-like provider messages. ## Solution - Skip storing web-channel sessions after transient error classes (rate_limited,timeout,network,provider_error). - Keep non-transient classes such as auth, model, context, and budget errors out of that transient reset path. - Check budget exhaustion markers before the generic 429/rate-limit classifier. - Map transport/connectivity failures tonetworkso the transient cache-clear policy is reachable. ## Submission Checklist - [x] Tests added or updated (happy path + at least one failure / edge case) per Testing Strategy - [x] Diff coverage >= 80% - focused Rust tests added for changed behavior; CI coverage gate will enforce merged diff coverage. - [x] Coverage matrix updated - N/A: behavior-only web-channel error/session handling, no feature row added/removed/renamed - [x] All affected feature IDs from the matrix are listed in the PR description under ## Related - N/A: no matrix feature ID changed - [x] No new external network dependencies introduced (mock backend used per Testing Strategy) - [x] Manual smoke checklist updated if this touches release-cut surfaces - N/A: no release-cut smoke surface changed - [x] Linked issue closed via Closes #NNN in the ## Related section ## Impact - Desktop chat retry behavior changes only after transient web-channel failures. - No migration or new dependency. - True budget/auth/model/context failures keep their existing non-transient classification. ## Related - Closes #2364 - Follow-up PR(s)/TODOs: N/A --- ## AI Authored PR Metadata (required for Codex/Linear PRs) ### Linear Issue - Key: N/A - URL: N/A ### Commit & Branch - Branch: codex/2364-rate-limit-session-reset - Commit SHA: fee189461b77cecf4cd5deca63f9b98e8f903f58 ### Validation Run - [x]cargo fmt -- --check- [x]git diff --check- [x] Focused tests attempted locally:cargo test --lib transient_chat_errors_skip_cached_web_session_storage -- --nocapture(blocked by missing Windows libclang; see below) - [x] Rust fmt/check (if changed):cargo fmt -- --check- [x] Tauri fmt/check (if changed): N/A ### Validation Blocked - command:cargo test --lib transient_chat_errors_skip_cached_web_session_storage -- --nocapture- error: Unable to find libclang: could not findclang.dll/libclang.dll; setLIBCLANG_PATH- impact: Local Windows Rust test run blocked before executing tests; CI Linux Rust jobs should validate the added tests. ### Behavior Changes - Intended behavior change: retrying the same thread after a transient rate-limit/provider/network failure starts from a fresh web-channel session instead of reusing a possibly poisoned cached agent. - User-visible effect: users can recover in the same thread after transient throttling clears. ### Parity Contract - Legacy behavior preserved: non-transient auth, budget, model, context, and generic inference errors are not treated as transient session poison. - Guard/fallback/dispatch parity checks: event path still emits the same chat_error event shape and message fields. ### Duplicate / Superseded PR Handling - Duplicate PR(s): N/A - Canonical PR: this PR - Resolution (closed/superseded/updated): N/A ## Summary by CodeRabbit * Bug Fixes * Better separates payment/budget failures from rate limits and adds explicit network connectivity classification; budget messages now suggest topping up while network errors show a sanitized connectivity message. * Clears cached conversation sessions for transient failures (rate limits, timeouts, network, provider errors) to avoid sticky sessions. * Tests * Added tests for error classification, sanitized messages, session-cache invalidation, and transient-error behavior.