Skip to content

Review and simplify job logs #1236

@mihow

Description

@mihow

Follow-up to #1234 / #1235. During e2e chaos testing of those fixes we noticed several log patterns worth revisiting. Not urgent — parking this until live async-job triage has stabilised, since some of this noise is load-bearing during debugging. Logging additions from #1222 (NATS lifecycle → per-job log) and #1227 (jobs_health_check periodic snapshots) deliberately optimised for "we had invisible gaps, fill them in." Now that those gaps are filled, time to tune for signal-to-noise.

Module + job logger mirroring (from #1222)

#1222's log_async helper emits every NATS lifecycle event to both the module logger and the per-job logger. That's correct from a UI perspective — the job log needs these — but in docker/container logs it shows up as two adjacent identical lines from the same worker in the same millisecond (observed for Finalizing NATS consumer … in every cleanup path during the chaos run).

Options to consider:

  • Single emit with a flag/tag so one sink can de-dupe
  • Structured prefix ([job=X]) on module-logger lines so grep-by-job stays easy without the mirror
  • Keep mirroring but drop the module-logger copy for events that are primarily operator-facing in the UI (stream/consumer lifecycle)

Per-image progress noise

Results stage emits three lines per image processed:

  • Pending images from Redis for job X stage: N/M: P%
  • Updated job X progress in stage 'stage' to P%
  • TIME: Updated job X progress in STAGE stage progress to P%: 0.0Xs

For a 100-image job with parallel workers that's ~600 interleaved lines. Collapse to one structured line per image (or one summary line per N images). The TIME: line is a dev timing breadcrumb that probably doesn't need to be INFO.

Retry storm amplification

When Celery auto-retries process_nats_pipeline_result on RedisError, each retry logs the full WARNING ... Transient Redis error ... plus a Python traceback. With parallel NATS messages hitting the same transient, you can get hundreds of identical 5-line entries in seconds (observed during Scenario E of the #1234/#1235 chaos test).

  • Drop the traceback at WARNING level (exception repr is enough), log traceback only when MaxRetriesExceeded fires
  • Aggregate per-task: `retry N/5` instead of a fresh warning each attempt

Health-check beat task noise

`check_processing_services_online` logs a full `ConnectionError` traceback per unreachable endpoint every ~15s. In dev setups that's constant connection-refused tracebacks for services that are not running. Consider logging a single warn-level line per endpoint per status change (mark live → mark offline), not per tick.

Related: `running_job_snapshots` from #1227 currently only emits when there are running async jobs, which is the right default.

What to preserve

Keep and model the cleanup on these — they're already the good template:

Scope

Observation-first ticket. Implementation is a single cleanup pass after live async-job behaviour stabilises on the combined #1234/#1235 code.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions