Skip to content

fix: recover scans across daemon/web container boundaries#61

Merged
grossir merged 5 commits intomainfrom
fix/prod-multi-container-recovery
May 5, 2026
Merged

fix: recover scans across daemon/web container boundaries#61
grossir merged 5 commits intomainfrom
fix/prod-multi-container-recovery

Conversation

@quevon24
Copy link
Copy Markdown
Member

@quevon24 quevon24 commented May 4, 2026

In prod the daemon and web run in separate containers with separate /tmp/ volumes. The daemon writes processing files to its own /tmp/scanning/{pk}/ and pushes them to S3; the web container's local copy is empty and is populated lazily when needed. That split was breaking several code paths that assumed a shared filesystem, plus a couple of unrelated daemon-shutdown / observability gaps surfaced along the way.

Bugs found

1. "Before approving you need to generate the files"

Click "Generate files" on step 3, wait for it to finish, click "Approve", and the page redirects with "Before approving you need to generate the files." Opening any opinion in the sidebar (which lazy-pulls from S3) then makes Approve work.

upload_approved_files was checking output_dir/redacted/*.pdf on the web container's local disk, which in prod is empty until something pulls from S3. The check was a poor proxy for "did generate-files run?".

2. PDF area on the process viewer shows 500

After OCR / RunPod returns successfully, the left column renders Issues / Pages correctly but the PDF panel shows Error loading PDF: Unexpected server response (500).

serve_scan_pdf only looked at local disk. With no /tmp/ files and no local FileField copy (prod uses S3-backed storage), scan.pdf_path raised FileNotFoundError and the uncaught exception became a 500.

3. Scans stuck in PROCESSING for up to an hour after a deploy

When the daemon container was restarted mid-pipeline (deploy, OOM, K8s eviction), the scan it was processing sat in PROCESSING until the stale-recovery sweep picked it up, which uses DAEMON_PROCESSING_TIMEOUT (default 3600s).

The signal handler only set self.shutdown = True; it never touched the row.

4. NO_GPU log lines read like a hard failure

Sentry was firing GPU not available; skipping weight/model preload. Jobs on this worker will return error_code=NO_GPU; fitness check will prevent this worker from accepting jobs. That sounds bad, but NO_GPU is in _TRANSIENT_ERROR_CODES, so the daemon re-queues the scan automatically and the next tick lands on a different worker. The message didn't say so.

Changes

  • fix(approve): upload_approved_files now uses scan.stage != Stage.APPROVED instead of a filesystem probe. run_generate_files already sets stage to APPROVED on success, and the actual S3 copy is server-side, so the daemon and web containers don't need to share files for approval to work.
  • fix(views): serve_scan_pdf now does try-local, lazy-pull from S3 via s3_sync.download_processing_files, try-local again, then Http404. Mirrors the pattern serve_opinionscan_pdf already uses. Replaces the uncaught 500 with a 404 if S3 also has nothing.
  • fix(daemon): SIGTERM/SIGINT handler now atomically resets all PROCESSING scans to QUEUED with a progress message saying the daemon crashed mid-pipeline. The next tick (~5s) re-claims them. The re-queue does NOT bump retry_count, since the scan didn't fail; that budget is reserved for real RunPod transient errors so a string of deploys can't push a scan to ERROR_MAX_RETRIES. Also captures a sentry_sdk.capture_message warning when the signal arrives, so unexpected kills (deploys, OOM, evictions) become visible.
  • docs(runpod): both NO_GPU log lines (the preload guard and the per-job belt-and-suspenders) now mention that the scan is re-queued automatically up to RUNPOD_MAX_TRANSIENT_RETRIES attempts. The longer explanation lives in a comment above each line; the log message itself stays terse.

Tests

  • TestUploadApprovedFiles: renamed test_missing_files_returns_error to test_pre_generation_returns_error; fixtures now seed stage=APPROVED.
  • TestApproveScan: same fixture update.
  • TestHandleSignal (new, 2 cases): verifies re-queue + Sentry capture, and that _handle_signal is a no-op on the DB side when nothing is in flight.
  • TestServeScanPdfLazyPull (new, 2 cases): exercises the S3 fallback path and the 404 case.

All four commits are independent and could be reverted separately.

@grossir grossir moved this to PRs to Review in Sprint (Case Law) May 5, 2026
@grossir grossir self-assigned this May 5, 2026
Copy link
Copy Markdown
Contributor

@grossir grossir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Just 2 minor suggestions I will push myself and get this merged

),
)
if requeued:
self.stdout.write(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be logger.info ?

Comment thread scanning/tests/test_views.py Outdated
Comment on lines +1093 to +1096
import os
import pathlib
import tempfile
from unittest.mock import patch
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are not heavy, should be top level imports?

@grossir grossir merged commit 6228eb1 into main May 5, 2026
7 checks passed
@github-project-automation github-project-automation Bot moved this from PRs to Review to Done in Sprint (Case Law) May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants