Skip to content

Expire stale pending plan questions on cold app startup#862

Open
marvin-bitterlich wants to merge 2 commits intopingdotgg:mainfrom
marvin-bitterlich:fix/856-pending-question-restart
Open

Expire stale pending plan questions on cold app startup#862
marvin-bitterlich wants to merge 2 commits intopingdotgg:mainfrom
marvin-bitterlich:fix/856-pending-question-restart

Conversation

@marvin-bitterlich
Copy link

@marvin-bitterlich marvin-bitterlich commented Mar 10, 2026

Closes #856

Why

If T3 Code is fully closed while a thread is waiting on a plan-mode / request_user_input question, reopening the app can leave that thread stuck.

The pending question is reconstructed from persisted thread activity, but the live provider-side request context does not survive a full app restart. As a result, the UI can show a question that no longer has any valid way to be answered.

The important part here is when to recover from that state. We did not want to wait until the user tries to submit an answer, because by that point they have already spent time reading the question and thinking through a response. The correct time to reconcile this mismatch is when the app cold-starts and resumes persisted state, so the UI is already in a clean state when the user returns to the thread.

How

This adds a dedicated cold-start lifecycle step to orchestration startup.

On startup, before the normal orchestration reactors begin processing, the server:

  • checks whether the current process already has any live provider sessions
  • if not, treats the restored state as a cold-start resume path
  • scans persisted thread activity for unresolved pending user-input requests
  • expires those stale requests by appending a user-input.expired activity

This also extracts the existing pending user-input derivation logic into shared runtime code so the server and web use the same rules for deciding whether a prompt is still open. That keeps the new cold-start reconciliation behavior aligned with the existing UI state model instead of introducing a second copy of that logic.

Decisions

  • This is handled during cold-start reconciliation, not when the user submits an answer.
    Expiring the stale prompt only after the user interacts would create a poor recovery flow and would surface the problem at the latest possible moment.

  • This lives in a dedicated cold-start lifecycle, not in ProviderCommandReactor.
    The goal was not to add a one-off abstraction, but to give startup-only repair logic a clear home that is separate from normal event reaction. This keeps the boundary explicit and makes future cold-start cleanup work easier to add in the same place.

  • The fix is intentionally scoped to cold app restart.
    The reported bug is specifically about reopening the app with persisted state from a previous process. Broader in-process session-loss recovery can be handled separately if needed, but it is a different lifecycle boundary.

  • Stale prompts are expired, not resolved.
    After restart, the original provider-side request context is gone, so marking the prompt as resolved would imply a successful interaction that never actually happened.

  • This is a server-side state fix, not a UI-only workaround.
    Hiding the stale prompt in the UI would leave the persisted orchestration state inconsistent with what the user sees.

Note

Expire stale pending plan questions on cold app startup

  • Adds a ColdStartLifecycle service that runs at orchestration reactor startup, before other reactors, and expires any pending user-input.requested prompts by appending a user-input.expired activity with reason server-restart.
  • Expiration is skipped when active provider sessions exist, ensuring in-flight sessions are not disrupted.
  • Extracts shared derivePendingUserInputs logic into packages/shared/src/pendingUserInput.ts, which now includes turnId in results and treats user-input.expired as a closing event.
  • The web client's derivePendingUserInputs now delegates to the shared implementation, so expired prompts are no longer shown as pending in the UI.
  • Risk: the shared parser discards malformed questions more strictly than the previous local implementation, which may silently drop previously-visible prompts.

Macroscope summarized c521a68.

@coderabbitai
Copy link

coderabbitai bot commented Mar 10, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: bdc6f83e-c5ff-4328-959c-4ff93ac875e4

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added the vouch:unvouched PR author is not yet trusted in the VOUCHED list. label Mar 10, 2026
left: OrchestrationThreadActivity,
right: OrchestrationThreadActivity,
): number {
if (left.sequence !== undefined && right.sequence !== undefined) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low src/pendingUserInput.ts:19

The compareActivitiesByOrder function sorts any activity with a sequence property after activities without one, which breaks chronological ordering in derivePendingUserInputs. For example, if a user-input.requested has a sequence but the later user-input.resolved does not (common for async acknowledgments), the resolution sorts before the request. The function processes the resolution first (no-op), then the request, leaving the input stuck as "pending" instead of "resolved".

Consider using createdAt as the primary sort key and only using sequence as a tiebreaker when both activities have it, or document if the current behavior is intentional for a specific ordering requirement.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file packages/shared/src/pendingUserInput.ts around line 19:

The `compareActivitiesByOrder` function sorts any activity with a `sequence` property *after* activities without one, which breaks chronological ordering in `derivePendingUserInputs`. For example, if a `user-input.requested` has a `sequence` but the later `user-input.resolved` does not (common for async acknowledgments), the resolution sorts before the request. The function processes the resolution first (no-op), then the request, leaving the input stuck as "pending" instead of "resolved".

Consider using `createdAt` as the primary sort key and only using `sequence` as a tiebreaker when both activities have it, or document if the current behavior is intentional for a specific ordering requirement.

Evidence trail:
packages/shared/src/pendingUserInput.ts lines 16-29: The `compareActivitiesByOrder` function returns 1 when `left.sequence !== undefined` and `right.sequence === undefined` (line 22), placing activities with sequence AFTER those without.

packages/shared/src/pendingUserInput.ts lines 77-102: `derivePendingUserInputs` uses this sort order (line 79), then processes activities sequentially - adding to map for 'user-input.requested' (lines 88-96) and deleting for 'user-input.resolved' (lines 98-100). If resolved sorts before requested due to the sequence logic, the delete is a no-op and the request gets added after, leaving the input incorrectly marked as pending.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. This ordering logic was extracted from the existing web-side derivePendingUserInputs implementation into the new shared helper, so the issue is not unique to the new cold-start path, but this PR does make it shared between web and server and is the right place to fix it.

I’m going to update the comparator so createdAt remains the primary ordering key, with sequence only used as a tiebreaker when both activities have it, and add a regression test for mixed sequenced / unsequenced request/close activities.

@marvin-bitterlich marvin-bitterlich force-pushed the fix/856-pending-question-restart branch from 18a75b5 to 1808699 Compare March 10, 2026 23:01
@juliusmarminge
Copy link
Member

hmm as long as we send the right resumeCursor along and have the codex provider decode that and forward it in the right places we should be able to resume the session at the provider level without issues?

@marvin-bitterlich
Copy link
Author

hmm as long as we send the right resumeCursor along and have the codex provider decode that and forward it in the right places we should be able to resume the session at the provider level without issues?

That was my first thought too, and I might be missing something about how Codex handles resumed pending prompts.

From what I can tell, we already pass resumeCursor through the resume path, so I don’t think this is just us not forwarding the right resume state. What still seems missing is the in-memory pending user-input bookkeeping in CodexAppServerManager that respondToUserInput(...) relies on.

So my current assumption is:

  • if Codex replays pending requestUserInput calls on resume, then your approach is probably the right fix
  • if it doesn’t, then resuming the thread alone still leaves us without enough state to answer the old prompt

But I may be missing a Codex guarantee there, so if resumed sessions are supposed to replay those pending requests, I’m happy to switch direction. Most of my work has been with Claude Code which does not resume pending tool calls at all so I presumed Codex does the same 😅

@marvin-bitterlich marvin-bitterlich force-pushed the fix/856-pending-question-restart branch from 1808699 to c521a68 Compare March 10, 2026 23:16
@marvin-bitterlich
Copy link
Author

(Did a force push to actually start signing my commits again on this device)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

vouch:unvouched PR author is not yet trusted in the VOUCHED list.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: pending plan questions become stuck after app restart

2 participants