Investigate switching ingest celery pool from prefork to gevent

## Context

`process_nats_pipeline_result` is the dominant task on the `antenna` celery queue during `async_api` ML jobs. Profiling on a live production backlog shows it spends most wall-clock time on:

- Postgres round-trips (saving detections / classifications, updating `Job` progress rows)
- Redis round-trips (updating `AsyncJobStateManager` state)
- One NATS ack call per task (`async_to_sync(ack_task)()` around `nats-py`)

All three are I/O. The task is almost never CPU-bound. That is the workload profile where gevent pays off: a single OS process running a few hundred greenlets can keep the DB/Redis/NATS pipelines saturated, where the same OS process with prefork only has ~one in-flight task per worker process.

Ballpark: prefork at 16 workers ≈ 16 in-flight DB round-trips. Gevent at 200 greenlets could hold 200 in-flight, bounded by pgbouncer pool size and server cores. Could be a 5–10× throughput lever on the ingest path specifically.

## What needs verifying before we commit

This is an investigation ticket, not a "just do it" ticket. Switching a production pool class has a long tail of silent-bug risk; we need to confirm each of these:

1. **Monkey-patching at startup.** Gevent needs `gevent.monkey.patch_all()` as the first import in the worker entrypoint — before celery, django, psycopg2, redis-py. Confirm this interacts cleanly with our New Relic agent (`newrelic-admin run-program`), which wraps the celery entrypoint today.

2. **Postgres driver safety.** Django uses psycopg2 (or psycopg3 if we've migrated — verify). Under gevent, psycopg2 blocking calls freeze the greenlet pool unless `psycogreen.gevent.patch_psycopg()` is called after monkey-patching. Psycopg3's async mode also has specific guidance. Get this right or we'll be slower than prefork.

3. **Pgbouncer pool sizing.** At gevent concurrency 100–200, we'll hold that many persistent DB connections if Django's `CONN_MAX_AGE` is non-zero. Pgbouncer's `default_pool_size` needs to be large enough, or `CONN_MAX_AGE` needs to be 0 for the ingest worker specifically. Measure before deploying.

4. **asyncio inside gevent tasks.** `process_nats_pipeline_result` uses `async_to_sync(ack_task)()` which creates an asyncio event loop per call to run `nats-py`. asyncio + gevent is supported but the details matter: `nats-py` uses sockets that will be monkey-patched, which can change reconnect / timeout behaviour in non-obvious ways. Test this in isolation first.

5. **Non-I/O tasks on the same worker.** If the ingest worker consumes `antenna-ingest` only (see #1229 for queue split) then we only need to audit ingest tasks. If not, every task on the shared queue needs gevent safety — an open-ended audit. This ticket depends on #1229 for that reason.

6. **PIL / numpy / torch.** `create_detection_images` runs PIL crops. PIL is blocking C code that releases the GIL but is not explicitly gevent-aware. It works, but long per-image crops would block one greenlet for the full crop duration. For the ingest queue this is probably fine; flagging it for awareness.

7. **Sentry, logging, thread-locals.** Sentry's SDK and structlog both have gevent-mode configuration — verify both are set up correctly, or exceptions and logs will get crossed between greenlets.

## Suggested proof-of-concept

1. Land #1229 (queue split) first. Gevent only needs to be viable on the ingest queue, not on the long tail.
2. Add an optional worker service `celeryworker-ingest-gevent` in `docker-compose.worker.yml`:
   - Same image, but entrypoint calls `python -c 'from gevent import monkey; monkey.patch_all(); from psycogreen.gevent import patch_psycopg; patch_psycopg(); import celery.__main__; celery.__main__._main()' worker --pool=gevent --concurrency=100 --queues=antenna-ingest`
   - Scale set to 0 by default.
3. In staging or against a synthesised backlog, scale it to 1 and scale the prefork version to 0. Capture:
   - Tasks/min throughput vs prefork baseline
   - p50 / p95 task duration
   - pgbouncer connection count, postgres active connection count
   - Memory per process (gevent single process should be flat-ish vs prefork's N processes)
   - Any new exceptions in Sentry

4. If throughput is meaningfully better and no new errors surface, decide whether to make gevent the default for ingest, or keep it as an opt-in pool class per deployment.

## Why this might be worth the investigation cost

Separate from the pool-class choice, the measurement work above tells us whether the bottleneck is actually "not enough in-flight tasks" (in which case gevent wins) or "pgbouncer / postgres cap" (in which case no amount of celery concurrency helps and we need DB-side scaling instead). Either answer is useful.

## Related

- #1228 — bumps prefork concurrency as an immediate mitigation.
- #1229 — queue split. Prerequisite for a clean gevent rollout scoped to ingest only.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate switching ingest celery pool from prefork to gevent #1230

Context

What needs verifying before we commit

Suggested proof-of-concept

Why this might be worth the investigation cost

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Investigate switching ingest celery pool from prefork to gevent #1230

Description

Context

What needs verifying before we commit

Suggested proof-of-concept

Why this might be worth the investigation cost

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions