OpenWorkflow is a framework for building reliable, long-running applications. It allows developers to write durable functions, called workflows, that can survive process crashes, server restarts, and code deploys.
It achieves this through a worker-driven architecture. Instead of relying on a central orchestrator server, OpenWorkflow uses a pool of stateless Workers that run within the user's infrastructure. These workers communicate with a durable Backend (like a Postgres or SQLite database) which acts as the single source of truth for all workflow state. This model provides the power of durable execution with minimal operational complexity.
- Workflow: A durable function that orchestrates multiple steps. Workflows are deterministic, resumable, and versioned.
- Workflow Run: A single, complete execution instance of a workflow, from start to finish. Each run is a state machine managed by the workers.
- Step: A durable, memoized checkpoint within a workflow. A step represents a unit of work, like a database query or an API call.
- Step Attempt: A record in the Backend representing the state and result of a single step attempt within a specific workflow run.
- Worker: A long-running process in the user's application that polls the Backend for pending workflows, executes their code, and persists the results.
- Client: The part of the OpenWorkflow SDK used by application code to start and query workflow runs.
- Backend: A pluggable persistence layer (e.g., a Postgres or SQLite database) that stores all state for workflow runs and step attempts. It serves as the queue and the durable state log.
- Config:
openworkflow.config.*defines backend settings, workflow discovery paths, and optional ignore patterns for CLI commands. It typically imports the sharedbackendfromopenworkflow/client.*so app code and CLI use the same connection. availableAt: A critical timestamp on a workflow run that controls its visibility to workers. It is used for scheduling, heartbeating, crash recovery, and durable timers.deadlineAt: An optional timestamp on a workflow run that specifies the deadline by which the workflow must complete. If the deadline is reached, the workflow run is marked as failed.
A workflow run can be in one of the following states:
pending: The workflow run has been created and is waiting for a worker to claim it.running: The workflow run is either actively being executed by a worker or durably parked withworkerId = nulluntilavailableAt.sleeping(deprecated): Legacy parked state kept for backward compatibility. New runs are parked inrunninginstead.succeeded(deprecated): Legacy success state kept for backward compatibility. New successful runs usecompleted.completed: The workflow run has completed successfully.failed: The workflow run has failed after exhausting retries or deadline reached.canceled: The workflow run has been explicitly canceled and will not be processed further.
A step attempt can be in one of the following states:
running: The step attempt is currently being executed.succeeded(deprecated): Legacy success state kept for backward compatibility. New successful attempts usecompleted.completed: The step attempt completed successfully and its result is stored.failed: The step attempt failed. The workflow may create a new attempt if it retries.
OpenWorkflow uses a worker-driven model where the database is the central point of coordination. There is no separate orchestrator server.
+---------------------------------+ +--------------------------------+
| | | |
| Your Application Code | | OpenWorkflow Worker |
| (e.g., a web server) | | (Separate Process) |
| | | |
| +---------------------------+ | | +---------------------------+ |
| | OpenWorkflow Client | | | | Workflow Definitions | |
| | (Creates Workflow Runs) | | | | | |
| +---------------------------+ | | +---------------------------+ |
| | | | | |
+---------------+-----------------+ +---------------+----------------+
| |
| +----------------------------+ |
+-----| Backend Interface |-----+
| (e.g., Postgres / SQLIte) |
+----------------------------+
|
|
+------------------------------+
| |
| Backend Storage |
| |
| - workflow_runs |
| - step_attempts |
+------------------------------+
- Client: The entry point for an application to interact with OpenWorkflow.
It is responsible for creating new workflow runs by writing to the
workflow_runstable in the Backend. - Worker: The execution engine. It contains an in-memory registry of all
defined workflow code. It continuously polls the
workflow_runstable for available work, executes the workflow logic, and updates the Backend with the results. - CLI: The dev tooling that scaffolds projects, writes
openworkflow.config.ts, and runs workers vianpx @openworkflow/cli worker startwith auto-discovery of workflow files. - Backend: The source of truth. It stores workflow runs and step attempts.
The
workflow_runstable serves as the job queue for the workers, while thestep_attemptstable serves as a record of started and completed work, enabling memoization.
- Workflow Registration: A developer defines workflows in their code. When
a Worker process starts, it automatically discovers and registers the
workflow code based on
openworkflow.config.ts(defaultopenworkflow/directory). There is no sync process with an external server. - Workflow Invocation: The application code uses the Client to start a
new workflow run. The Client creates a new entry in the
workflow_runstable with apendingstatus. - Job Polling: A Worker process polls the
workflow_runstable, looking for runs whoseavailableAttimestamp is in the past and whose status is eitherpending(new work),running(parked or with an expired lease), or legacysleeping. It uses an atomicFOR UPDATE SKIP LOCKEDquery to claim a single workflow run, setting its status torunningand extending the lease. - Code Execution (Replay Loop): The Worker loads the history of completed
step_attemptsfor the claimed workflow. It then executes the workflow code from the beginning, using the history to memoize results of already-completed steps. - Step Processing: When the Worker encounters a new step, it creates a
step_attemptrecord with statusrunning, executes the step function, and then updates thestep_attempttocompletedupon completion. The Worker continues executing inline until the workflow code completes or encounters a sleep. - State Update: The Worker updates the Backend with each
step_attemptas it is created and completed, and updates the status of theworkflow_run(e.g.,completed,runningfor parked waits).
OpenWorkflow treats each workflow run as a state machine. The worker's job is to advance the state of that machine from its last known checkpoint until the next one.
When a worker claims a workflow run, it always executes the code from the beginning. This is the core of the deterministic replay model.
// A worker claims a workflow run.
// It loads the step history and continues to the first step.
const user = await step.run({ name: "fetch-user" }, async () => {
// 1. The framework sees "fetch-user".
// 2. It finds a completed result in the history.
// 3. It returns the cached output immediately without executing the function.
return await db.users.findOne({ id: 1 });
});
const welcomeEmail = await step.run({ name: "welcome-email" }, async () => {
// 4. The framework sees "welcome-email".
// 5. It is NOT in the history.
// 6. It creates a step_attempt with status "running".
// 7. It executes the function and saves the result.
// 8. It updates the step_attempt to status "completed" and continues.
return await email.send(user);
});All steps are executed synchronously by the worker. When a worker encounters a new step:
- It resolves the step's durable key for this execution pass. The first
occurrence keeps its base name; later collisions are auto-indexed as
name:1,name:2, and so on. - It creates a
step_attemptrecord with statusrunning. - It executes the step function inline.
- Upon completion, it updates the
step_attemptto statuscompletedwith the result.
Workers can be configured with a high concurrency limit (e.g., 100 or more) to handle many workflow runs simultaneously. Each workflow run occupies a worker slot for the duration of its execution, but this is acceptable given the high concurrency capacity.
The SDK provides several step primitives to handle different workflow patterns:
step.run(config, fn): Executes a block of arbitrary code. This is the most
common step type used for database queries, API calls, and other synchronous
operations.
const user = await step.run({ name: "fetch-user" }, async () => {
return await db.users.findOne({ id: userId });
});step.sleep(name, duration): Pauses the workflow until a specified time.
When encountered, the worker keeps the workflow run's status as running,
sets availableAt to the resume time, clears workerId, and releases the
workflow. This frees up the worker slot for other work - it's not a blocking
sleep but a durable pause.
await step.sleep("wait-one-hour", "1h");step.runWorkflow(spec, input?, options?): Starts a child workflow and
waits for it durably. options.name sets the durable step name (defaults to the
target workflow name in spec) and options.timeout controls the wait timeout
(default 1y). When the timeout is reached, the parent step fails but the child
workflow continues running independently.
All step APIs (step.run, step.sleep, and step.runWorkflow) share the same
collision logic for durable keys. If duplicate base names are encountered in one
execution pass, OpenWorkflow auto-indexes them as name, name:1, name:2,
and so on so each step call maps to a distinct step attempt.
When a step's function throws an error, the framework records the error in the
step_attempt and sets its status to failed. The error then propagates up.
Retry scheduling for that failure is driven by the failed-attempt count for that
specific stepName in the workflow run. If retryable, the workflow run is
rescheduled by setting availableAt to the computed backoff time. On the next
execution, replay reaches the failed step and re-executes its function.
To prevent runaway workflows from accumulating unbounded step history, execution enforces a default hard cap of 1000 step attempts per workflow run. When that limit is reached, the run fails immediately and is not retried.
If an error is unhandled by the workflow code, the entire workflow run fails.
Workflow-level retries are disabled by default (maximumAttempts: 1): an
unhandled error immediately marks the run as failed. To enable automatic
workflow-level retries, supply a retryPolicy when defining the workflow.
Set maximumAttempts: 0 for unlimited retries.
If the run can no longer be retried (for example, because the next
retry would exceed deadlineAt or maximumAttempts has been reached), its
status is set to failed permanently.
When a worker claims a run but does not have the matching workflow definition in its registry, this is treated as a deployment concern rather than an application failure. The run is rescheduled with its own generous backoff policy (5s initial, 5min cap, unlimited attempts) so it remains available for a worker that does have the definition — for example during a rolling deploy.
OpenWorkflow uses the same RetryPolicy shape for two separate concerns:
- Step retry policy (
step.run({ retryPolicy })or step defaults) for step-function failures. Budgets/backoff are tracked per step name. - Workflow retry policy (
workflow.spec.retryPolicy) for workflow-level failures outside step execution.
Workflow runs can include an optional deadlineAt timestamp, specifying the
time by which the workflow must complete. Steps and retries are skipped if they
would exceed the deadline, making the run permanently failed.
Workflows can be explicitly canceled at any time via the Client API:
const handle = await workflow.run({ "..." });
await handle.cancel();Handling cancelation during execution: If a workflow is canceled while a
worker is actively processing it, the worker will detect the cancelation. The
worker will then stop further execution of the workflow code and mark the
workflow as canceled. This ensures that partial work from the canceled
workflow is not committed as a successful result.
The SDK supports parallel execution of steps via language-native constructs like
Promise.all.
const [user, settings] = await Promise.all([
step.run({ name: "fetch-user" }, ...),
step.run({ name: "fetch-settings" }, ...),
]);When the worker encounters this, it executes all steps within the Promise.all
concurrently. It waits for all of them to complete before proceeding. Each step
attempt is persisted individually as a step_attempt.
Workers are configured with a concurrency limit (e.g., 10). A worker will
maintain up to 10 in-flight workflow runs simultaneously. It polls for new work
only when it has available capacity. The Backend's atomic dequeue operation
(FOR UPDATE SKIP LOCKED) ensures that multiple workers can poll the same table
without race conditions or processing the same run twice.
The availableAt heartbeat mechanism provides robust recovery. If a worker
crashes while executing parallel steps, its heartbeat stops. The availableAt
for the workflow run expires, and another worker claims it. The new worker
replays the workflow: any steps that completed before the crash will return
their cached output instantly; any that were in-flight will be re-executed.
If a workflow's code is changed while runs are in-flight, the deterministic replay can break. For example, renaming a step will cause the replaying worker to fail because the step ID in the history no longer matches the step ID in the new code.
Workflows can be made version-aware using conditional logic to handle different
historical paths within the workflow code. The workflow receives a version
parameter that can be used to determine which code path to execute.
const workflow = ow.defineWorkflow({ name: "versioned-workflow" }, async ({ step, version }) => {
if (version === "v1") {
await step.run({ name: "old-step-name" }, ...);
} else {
await step.run({ name: "new-step-name" }, ...);
}
});This approach enables zero-downtime deployments by allowing old workflow runs to replay correctly on their original version while new runs use the updated code path.
Workers are the stateless engines of the OpenWorkflow system. They are responsible for:
- Polling the Backend for available workflow runs.
- Executing workflow code using the deterministic replay model.
- Managing a concurrency pool to process multiple runs simultaneously.
- Periodically heartbeating to maintain their claim on active runs.
- Handling errors and implementing retry logic with backoff.
- Gracefully shutting down to ensure no work is lost during deploys.
The availableAt timestamp is the core of the system's fault tolerance.
- When a worker claims a run, it sets
availableAttoNOW() + visibilityTimeout. - It must periodically
UPDATEthis timestamp before the timeout expires to maintain its lock. This is the heartbeat. - If a worker crashes, its heartbeats stop.
- The
availableAttimestamp expires, and theworkflow_runbecomes visible to other workers' polling queries again. - Another worker claims the run and initiates recovery.
When a worker receives a shutdown signal (e.g., SIGTERM), it initiates a
graceful shutdown:
- It stops polling for new workflow runs.
- It waits for all its currently active workflow runs to complete their current execution slice and be safely persisted back to the Backend.
- Once all in-flight work is finished, the worker process exits.
packages/openworkflowcontains the SDK (client, worker, registry) and backend implementations viaopenworkflow/postgresandopenworkflow/sqlitesubpath exports.packages/cliis the CLI.packages/dashboardis the web UI for monitoring workflow runs.