diff --git a/docs/geneva/jobs/lifecycle.mdx b/docs/geneva/jobs/lifecycle.mdx index b60f340..e26f368 100644 --- a/docs/geneva/jobs/lifecycle.mdx +++ b/docs/geneva/jobs/lifecycle.mdx @@ -114,6 +114,8 @@ Jobs save intermediate results to a checkpoint store. If a job fails: 2. **Resume from checkpoint** - Restarted jobs skip already-processed data 3. **No duplicate processing** - Each batch is processed exactly once +By default, checkpoints are stored in a `_ckp/` subdirectory inside the table's storage location. At scale, you can redirect checkpoints to a separate bucket to avoid IOPS contention. See [Checkpoint Storage configuration](/geneva/udfs/advanced-configuration#checkpoint-storage) for details. + ### Resuming Failed Jobs To resume a failed job, simply re-run the same backfill or refresh command. The job will automatically detect existing checkpoints, skip already-processed fragments, and continue from where it left off. diff --git a/docs/geneva/udfs/advanced-configuration.mdx b/docs/geneva/udfs/advanced-configuration.mdx index 76a24a6..d16a491 100644 --- a/docs/geneva/udfs/advanced-configuration.mdx +++ b/docs/geneva/udfs/advanced-configuration.mdx @@ -41,6 +41,43 @@ This section configures retry logic for Lance I/O operations. Retries occur on ` | `GENEVA_RETRY_LANCE_INITIAL_SECS` | `0.5` | Initial wait time in seconds for exponential backoff when retrying Lance I/O operations. | | `GENEVA_RETRY_LANCE_MAX_SECS` | `120.0` | Maximum wait time in seconds for exponential backoff when retrying Lance I/O operations. | +## Checkpoint Storage + + +Checkpoint storage configuration is **experimental**. The environment variable names and behavior may change in a future release. + + +Configure where Geneva stores checkpoint data during job execution. Checkpoints enable fault-tolerant processing by saving intermediate results so that failed jobs can resume without reprocessing completed work. + +By default, Geneva stores checkpoints in a `_ckp/` subdirectory inside the table's own storage location. This means checkpoints share the same bucket and IOPS budget as the table data. You can override this to store checkpoints in a separate location. + +| Variable | Default | Description | +|----------|---------|-------------| +| `JOB__CHECKPOINT__OBJECT_STORE__PATH` | _(table dir)_`/_ckp/` | URI where checkpoint data is stored. When set, overrides the default in-table checkpoint location. Accepts any URI supported by Lance (e.g., `gs://bucket/path/checkpoints`, `s3://bucket/checkpoints`). | + + +This variable maps to the config path `job.checkpoint.object_store.path`. It can also be set via config files in `.config/` or `pyproject.toml` under the `[geneva]` section. + + +### Why use a separate checkpoint path? + +At scale, checkpoint I/O and data I/O compete for the same object store IOPS budget when they share a bucket prefix. Setting `JOB__CHECKPOINT__OBJECT_STORE__PATH` to a **different bucket or prefix** decouples checkpoint I/O from data I/O, giving each its own IOPS budget and preventing shared-prefix rate limiting. + +```bash +# Example: separate checkpoint bucket from dataset storage +JOB__CHECKPOINT__OBJECT_STORE__PATH=gs://my-checkpoints-bucket/ckpts +``` + +Equivalent programmatic configuration: + +```python +from geneva.config import override_config_kv + +override_config_kv({ + "job.checkpoint.object_store.path": "gs://my-checkpoints-bucket/ckpts", +}) +``` + ## Other Configuration | Variable | Default | Description |