Skip to content

Avoid suspending ephemeral agents#3339

Merged
vigoo merged 5 commits intomainfrom
ephemeral-no-suspend
May 7, 2026
Merged

Avoid suspending ephemeral agents#3339
vigoo merged 5 commits intomainfrom
ephemeral-no-suspend

Conversation

@vigoo
Copy link
Copy Markdown
Contributor

@vigoo vigoo commented May 5, 2026

Resolves #3327

Golem suspends agents in various situations and before this PR, it did so for both durable and ephemeral agents equally. Ephemeral agents cannot recover from being suspended though. This PR changes how each of these suspend cases behave for ephemeral agents:

  • Short sleeps (but above the configured suspend threshold): ephemeral agents now sleep in-process when the requested sleep is within suspend.ephemeral_max_sleep
  • Long sleeps: if an ephemeral agent sleeps longer than suspend.ephemeral_max_sleep, the invocation fails with EphemeralSleepTooLong
  • Promise waits / all pollables blocked: ephemeral agents no longer take the durable-worker “all promise-backed pollables are blocked → suspend” shortcut. They wait in-process until the pollable becomes ready.
  • Fuel exhaustion: ephemeral agents no longer suspend when account fuel cannot be borrowed mid-invocation. If allowed, they continue using bounded local overdraft; otherwise the invocation fails with EphemeralFuelExhausted.
  • Fuel overdraft accounting: when an ephemeral invocation uses local overdraft, only the actually consumed overdraft is recorded as account debt at invocation end.
  • Monthly HTTP/RPC budget exhaustion: ephemeral agents fail immediately with EphemeralCannotSuspend instead of suspending until budget replenishment.
  • Quota throttling: ephemeral agents fail immediately with EphemeralCannotSuspend instead of suspending, and no resume action is scheduled for them.
  • External interrupt: interrupting an ephemeral agent still interrupts the running invocation; the CLI now warns/asks for confirmation because the agent cannot be resumed afterward.

The idea for failing on quota exhaustion is that it's the caller's responsibility to retry. If we would allow these ephemeral agents to run, and block them in-memory until they can resume, that could exhaust other resources (as they have to stay in memory for a potentially long time and so on)

@thesparq
Copy link
Copy Markdown

thesparq commented May 5, 2026

Pls, does this mean that, ephemeral agents can't do promises?, for instance, if want an agent that is created, does its work and maybe during the course of its work creates a promise which requires the user's action, let's assume the user delays for hours in which the agent suspends without consuming resources, the user later fulfills the promise, it resumes, completes the work and then dies. An example of this would be a multi step form or survey agent which does not need to be completely durable or everlasting, just durable for the lifetime of the form or survey, where the form the form state has to be saved on the backend (the agent) during the duration of the form and when the form is submitted, the ephemeral agent dies off.

@vigoo
Copy link
Copy Markdown
Contributor Author

vigoo commented May 6, 2026

Pls, does this mean that, ephemeral agents can't do promises?, for instance, if want an agent that is created, does its work and maybe during the course of its work creates a promise which requires the user's action, let's assume the user delays for hours in which the agent suspends without consuming resources, the user later fulfills the promise, it resumes, completes the work and then dies. An example of this would be a multi step form or survey agent which does not need to be completely durable or everlasting, just durable for the lifetime of the form or survey, where the form the form state has to be saved on the backend (the agent) during the duration of the form and when the form is submitted, the ephemeral agent dies off.

Ephemeral agents are for short lived, non-durable scenarios, such as a request handler etc. They don't have any durability guarantees, so in case of any crash / rebalancing / etc they just fail (to finish). Because of this, they should not be used for anything long like waiting for an external promise. What you are looking for is the durable agent, I think there is a misunderstanding of what durable vs ephemeral agents differ in.

Both durable and ephemeral agents write to the oplog and leave a "trace" that can be observed later with tools like golem agent oplog etc. Ephemeral agents are more lazily writing this oplog to achieve higher performance, so in case they don't finish due to a restart etc (like mentioned above) it is possible that not every entry is written out yet. That's a trade-off.

Even though ephemeral agents write an oplog for observability reasons, they never can be recover state like durable agents. That's the difference. If an ephemeral agent goes out of memory, it's gone forever. Also an ephemeral agent can be invoked only once (technically they can be invoked more than once, but they start from empty state every time, and it does not make much sense to do so).

Note that the following features are available for durable agents, that may cover what you are looking for:

  • Idle agents are suspended. Whenever an agent is waiting for something, or it is not having anything in its invocation queue, it can go out of memory, not consuming any further resources. The only "cost" they have is storage space (but note that is completely the same for ephemeral agent as well).
  • Ephemeral agents are always phantom agents by default, which means they get an auto-generated uuid as part of their identity (beside the constructor parameters). But the phantom agent feature can also be used for durable agents, if needed.

if record_ephemeral_promise_wait {
inc_promise_waiting();
}
let either_result = futures::future::select(poll, interrupt_signal).await;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to introduce some timeout here, to avoid polluting the executor with ephemeral agents that are blocked on promises that are never getting completed

let poll = Host::poll(&mut io_data, in_);
pin_mut!(poll);

if record_ephemeral_promise_wait {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this guarenteed to be balanced?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not in case of a crashing executor

@thesparq
Copy link
Copy Markdown

thesparq commented May 6, 2026

Pls, does this mean that, ephemeral agents can't do promises?, for instance, if want an agent that is created, does its work and maybe during the course of its work creates a promise which requires the user's action, let's assume the user delays for hours in which the agent suspends without consuming resources, the user later fulfills the promise, it resumes, completes the work and then dies. An example of this would be a multi step form or survey agent which does not need to be completely durable or everlasting, just durable for the lifetime of the form or survey, where the form the form state has to be saved on the backend (the agent) during the duration of the form and when the form is submitted, the ephemeral agent dies off.

Ephemeral agents are for short lived, non-durable scenarios, such as a request handler etc. They don't have any durability guarantees, so in case of any crash / rebalancing / etc they just fail (to finish). Because of this, they should not be used for anything long like waiting for an external promise. What you are looking for is the durable agent, I think there is a misunderstanding of what durable vs ephemeral agents differ in.

Both durable and ephemeral agents write to the oplog and leave a "trace" that can be observed later with tools like golem agent oplog etc. Ephemeral agents are more lazily writing this oplog to achieve higher performance, so in case they don't finish due to a restart etc (like mentioned above) it is possible that not every entry is written out yet. That's a trade-off.

Even though ephemeral agents write an oplog for observability reasons, they never can be recover state like durable agents. That's the difference. If an ephemeral agent goes out of memory, it's gone forever. Also an ephemeral agent can be invoked only once (technically they can be invoked more than once, but they start from empty state every time, and it does not make much sense to do so).

Note that the following features are available for durable agents, that may cover what you are looking for:

* Idle agents are suspended. Whenever an agent is waiting for something, or it is not having anything in its invocation queue, it can go out of memory, not consuming any further resources. The only "cost" they have is storage space (but note that is completely the same for ephemeral agent as well).

* Ephemeral agents are always _phantom agents_ by default, which means they get an auto-generated uuid as part of their identity (beside the constructor parameters). But the phantom agent feature can also be used for durable agents, if needed.

Pls, does this mean that, ephemeral agents can't do promises?, for instance, if want an agent that is created, does its work and maybe during the course of its work creates a promise which requires the user's action, let's assume the user delays for hours in which the agent suspends without consuming resources, the user later fulfills the promise, it resumes, completes the work and then dies. An example of this would be a multi step form or survey agent which does not need to be completely durable or everlasting, just durable for the lifetime of the form or survey, where the form the form state has to be saved on the backend (the agent) during the duration of the form and when the form is submitted, the ephemeral agent dies off.

Ephemeral agents are for short lived, non-durable scenarios, such as a request handler etc. They don't have any durability guarantees, so in case of any crash / rebalancing / etc they just fail (to finish). Because of this, they should not be used for anything long like waiting for an external promise. What you are looking for is the durable agent, I think there is a misunderstanding of what durable vs ephemeral agents differ in.

Both durable and ephemeral agents write to the oplog and leave a "trace" that can be observed later with tools like golem agent oplog etc. Ephemeral agents are more lazily writing this oplog to achieve higher performance, so in case they don't finish due to a restart etc (like mentioned above) it is possible that not every entry is written out yet. That's a trade-off.

Even though ephemeral agents write an oplog for observability reasons, they never can be recover state like durable agents. That's the difference. If an ephemeral agent goes out of memory, it's gone forever. Also an ephemeral agent can be invoked only once (technically they can be invoked more than once, but they start from empty state every time, and it does not make much sense to do so).

Note that the following features are available for durable agents, that may cover what you are looking for:

* Idle agents are suspended. Whenever an agent is waiting for something, or it is not having anything in its invocation queue, it can go out of memory, not consuming any further resources. The only "cost" they have is storage space (but note that is completely the same for ephemeral agent as well).

* Ephemeral agents are always _phantom agents_ by default, which means they get an auto-generated uuid as part of their identity (beside the constructor parameters). But the phantom agent feature can also be used for durable agents, if needed.

i now understand you better, i guess what i was looking for was an auto delete durable agent 😂, so its durable but once it finishes it deletes itself. this means for my use cases, i need to use durable agents but somehow deletes them after because they only serve one particular purpose so it wont be wise to just keep them around, consuming storage when i know it will never be invoked again.

@vigoo vigoo merged commit ef7c1fc into main May 7, 2026
51 checks passed
@vigoo vigoo deleted the ephemeral-no-suspend branch May 7, 2026 11:17
@github-actions github-actions Bot locked and limited conversation to collaborators May 7, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Do not suspend ephemeral agents

4 participants