Skip to content

Investigate switching ingest celery pool from prefork to gevent #1230

@mihow

Description

@mihow

Context

process_nats_pipeline_result is the dominant task on the antenna celery queue during async_api ML jobs. Profiling on a live production backlog shows it spends most wall-clock time on:

  • Postgres round-trips (saving detections / classifications, updating Job progress rows)
  • Redis round-trips (updating AsyncJobStateManager state)
  • One NATS ack call per task (async_to_sync(ack_task)() around nats-py)

All three are I/O. The task is almost never CPU-bound. That is the workload profile where gevent pays off: a single OS process running a few hundred greenlets can keep the DB/Redis/NATS pipelines saturated, where the same OS process with prefork only has ~one in-flight task per worker process.

Ballpark: prefork at 16 workers ≈ 16 in-flight DB round-trips. Gevent at 200 greenlets could hold 200 in-flight, bounded by pgbouncer pool size and server cores. Could be a 5–10× throughput lever on the ingest path specifically.

What needs verifying before we commit

This is an investigation ticket, not a "just do it" ticket. Switching a production pool class has a long tail of silent-bug risk; we need to confirm each of these:

  1. Monkey-patching at startup. Gevent needs gevent.monkey.patch_all() as the first import in the worker entrypoint — before celery, django, psycopg2, redis-py. Confirm this interacts cleanly with our New Relic agent (newrelic-admin run-program), which wraps the celery entrypoint today.

  2. Postgres driver safety. Django uses psycopg2 (or psycopg3 if we've migrated — verify). Under gevent, psycopg2 blocking calls freeze the greenlet pool unless psycogreen.gevent.patch_psycopg() is called after monkey-patching. Psycopg3's async mode also has specific guidance. Get this right or we'll be slower than prefork.

  3. Pgbouncer pool sizing. At gevent concurrency 100–200, we'll hold that many persistent DB connections if Django's CONN_MAX_AGE is non-zero. Pgbouncer's default_pool_size needs to be large enough, or CONN_MAX_AGE needs to be 0 for the ingest worker specifically. Measure before deploying.

  4. asyncio inside gevent tasks. process_nats_pipeline_result uses async_to_sync(ack_task)() which creates an asyncio event loop per call to run nats-py. asyncio + gevent is supported but the details matter: nats-py uses sockets that will be monkey-patched, which can change reconnect / timeout behaviour in non-obvious ways. Test this in isolation first.

  5. Non-I/O tasks on the same worker. If the ingest worker consumes antenna-ingest only (see Split the celery antenna queue into workload classes (ingest / housekeeping) #1229 for queue split) then we only need to audit ingest tasks. If not, every task on the shared queue needs gevent safety — an open-ended audit. This ticket depends on Split the celery antenna queue into workload classes (ingest / housekeeping) #1229 for that reason.

  6. PIL / numpy / torch. create_detection_images runs PIL crops. PIL is blocking C code that releases the GIL but is not explicitly gevent-aware. It works, but long per-image crops would block one greenlet for the full crop duration. For the ingest queue this is probably fine; flagging it for awareness.

  7. Sentry, logging, thread-locals. Sentry's SDK and structlog both have gevent-mode configuration — verify both are set up correctly, or exceptions and logs will get crossed between greenlets.

Suggested proof-of-concept

  1. Land Split the celery antenna queue into workload classes (ingest / housekeeping) #1229 (queue split) first. Gevent only needs to be viable on the ingest queue, not on the long tail.

  2. Add an optional worker service celeryworker-ingest-gevent in docker-compose.worker.yml:

    • Same image, but entrypoint calls python -c 'from gevent import monkey; monkey.patch_all(); from psycogreen.gevent import patch_psycopg; patch_psycopg(); import celery.__main__; celery.__main__._main()' worker --pool=gevent --concurrency=100 --queues=antenna-ingest
    • Scale set to 0 by default.
  3. In staging or against a synthesised backlog, scale it to 1 and scale the prefork version to 0. Capture:

    • Tasks/min throughput vs prefork baseline
    • p50 / p95 task duration
    • pgbouncer connection count, postgres active connection count
    • Memory per process (gevent single process should be flat-ish vs prefork's N processes)
    • Any new exceptions in Sentry
  4. If throughput is meaningfully better and no new errors surface, decide whether to make gevent the default for ingest, or keep it as an opt-in pool class per deployment.

Why this might be worth the investigation cost

Separate from the pool-class choice, the measurement work above tells us whether the bottleneck is actually "not enough in-flight tasks" (in which case gevent wins) or "pgbouncer / postgres cap" (in which case no amount of celery concurrency helps and we need DB-side scaling instead). Either answer is useful.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions