Review and simplify job logs

Follow-up to #1234 / #1235. During e2e chaos testing of those fixes we noticed several log patterns worth revisiting. Not urgent — parking this until live async-job triage has stabilised, since some of this noise is load-bearing during debugging. Logging additions from #1222 (NATS lifecycle → per-job log) and #1227 (`jobs_health_check` periodic snapshots) deliberately optimised for "we had invisible gaps, fill them in." Now that those gaps are filled, time to tune for signal-to-noise.

## Module + job logger mirroring (from #1222)

#1222's `log_async` helper emits every NATS lifecycle event to both the module logger and the per-job logger. That's correct from a UI perspective — the job log needs these — but in docker/container logs it shows up as two adjacent identical lines from the same worker in the same millisecond (observed for `Finalizing NATS consumer …` in every cleanup path during the chaos run).

Options to consider:
- Single emit with a flag/tag so one sink can de-dupe
- Structured prefix (`[job=X]`) on module-logger lines so grep-by-job stays easy without the mirror
- Keep mirroring but drop the module-logger copy for events that are primarily operator-facing in the UI (stream/consumer lifecycle)

## Per-image progress noise

Results stage emits three lines per image processed:
- `Pending images from Redis for job X stage: N/M: P%`
- `Updated job X progress in stage 'stage' to P%`
- `TIME: Updated job X progress in STAGE stage progress to P%: 0.0Xs`

For a 100-image job with parallel workers that's ~600 interleaved lines. Collapse to one structured line per image (or one summary line per N images). The `TIME:` line is a dev timing breadcrumb that probably doesn't need to be INFO.

## Retry storm amplification

When Celery auto-retries `process_nats_pipeline_result` on `RedisError`, each retry logs the full `WARNING ... Transient Redis error ...` plus a Python traceback. With parallel NATS messages hitting the same transient, you can get hundreds of identical 5-line entries in seconds (observed during Scenario E of the #1234/#1235 chaos test).
- Drop the traceback at WARNING level (exception repr is enough), log traceback only when `MaxRetriesExceeded` fires
- Aggregate per-task: \`retry N/5\` instead of a fresh warning each attempt

## Health-check beat task noise

\`check_processing_services_online\` logs a full \`ConnectionError\` traceback per unreachable endpoint every ~15s. In dev setups that's constant connection-refused tracebacks for services that are not running. Consider logging a single warn-level line per endpoint per status change (mark live → mark offline), not per tick.

Related: \`running_job_snapshots\` from #1227 currently only emits when there are running async jobs, which is the right default.

## What to preserve

Keep and model the cleanup on these — they're already the good template:
- \`run_job task exited for job X; async results still in-flight via NATS\` (#1234)
- \`Reaping stalled job: no progress for N min (threshold M min). previous_status=…, celery_state=…, dispatch_mode=…, stages: …\` (#1235)
- NATS setup/cleanup forensic snapshots (\`delivered=…, ack_floor=…, …\`) from #1222 — single line, every field an operator needs

## Scope

Observation-first ticket. Implementation is a single cleanup pass after live async-job behaviour stabilises on the combined #1234/#1235 code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review and simplify job logs #1236

Module + job logger mirroring (from #1222)

Per-image progress noise

Retry storm amplification

Health-check beat task noise

What to preserve

Scope

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Review and simplify job logs #1236

Description

Module + job logger mirroring (from #1222)

Per-image progress noise

Retry storm amplification

Health-check beat task noise

What to preserve

Scope

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions