fix: recover scans across daemon/web container boundaries by quevon24 · Pull Request #61 · freelawproject/scanning

quevon24 · 2026-05-04T23:18:04Z

In prod the daemon and web run in separate containers with separate /tmp/ volumes. The daemon writes processing files to its own /tmp/scanning/{pk}/ and pushes them to S3; the web container's local copy is empty and is populated lazily when needed. That split was breaking several code paths that assumed a shared filesystem, plus a couple of unrelated daemon-shutdown / observability gaps surfaced along the way.

Bugs found

1. "Before approving you need to generate the files"

Click "Generate files" on step 3, wait for it to finish, click "Approve", and the page redirects with "Before approving you need to generate the files." Opening any opinion in the sidebar (which lazy-pulls from S3) then makes Approve work.

upload_approved_files was checking output_dir/redacted/*.pdf on the web container's local disk, which in prod is empty until something pulls from S3. The check was a poor proxy for "did generate-files run?".

2. PDF area on the process viewer shows 500

After OCR / RunPod returns successfully, the left column renders Issues / Pages correctly but the PDF panel shows Error loading PDF: Unexpected server response (500).

serve_scan_pdf only looked at local disk. With no /tmp/ files and no local FileField copy (prod uses S3-backed storage), scan.pdf_path raised FileNotFoundError and the uncaught exception became a 500.

3. Scans stuck in PROCESSING for up to an hour after a deploy

When the daemon container was restarted mid-pipeline (deploy, OOM, K8s eviction), the scan it was processing sat in PROCESSING until the stale-recovery sweep picked it up, which uses DAEMON_PROCESSING_TIMEOUT (default 3600s).

The signal handler only set self.shutdown = True; it never touched the row.

4. NO_GPU log lines read like a hard failure

Sentry was firing GPU not available; skipping weight/model preload. Jobs on this worker will return error_code=NO_GPU; fitness check will prevent this worker from accepting jobs. That sounds bad, but NO_GPU is in _TRANSIENT_ERROR_CODES, so the daemon re-queues the scan automatically and the next tick lands on a different worker. The message didn't say so.

Changes

fix(approve): upload_approved_files now uses scan.stage != Stage.APPROVED instead of a filesystem probe. run_generate_files already sets stage to APPROVED on success, and the actual S3 copy is server-side, so the daemon and web containers don't need to share files for approval to work.
fix(views): serve_scan_pdf now does try-local, lazy-pull from S3 via s3_sync.download_processing_files, try-local again, then Http404. Mirrors the pattern serve_opinionscan_pdf already uses. Replaces the uncaught 500 with a 404 if S3 also has nothing.
fix(daemon): SIGTERM/SIGINT handler now atomically resets all PROCESSING scans to QUEUED with a progress message saying the daemon crashed mid-pipeline. The next tick (~5s) re-claims them. The re-queue does NOT bump retry_count, since the scan didn't fail; that budget is reserved for real RunPod transient errors so a string of deploys can't push a scan to ERROR_MAX_RETRIES. Also captures a sentry_sdk.capture_message warning when the signal arrives, so unexpected kills (deploys, OOM, evictions) become visible.
docs(runpod): both NO_GPU log lines (the preload guard and the per-job belt-and-suspenders) now mention that the scan is re-queued automatically up to RUNPOD_MAX_TRANSIENT_RETRIES attempts. The longer explanation lives in a comment above each line; the log message itself stays terse.

Tests

TestUploadApprovedFiles: renamed test_missing_files_returns_error to test_pre_generation_returns_error; fixtures now seed stage=APPROVED.
TestApproveScan: same fixture update.
TestHandleSignal (new, 2 cases): verifies re-queue + Sentry capture, and that _handle_signal is a no-op on the DB side when nothing is in flight.
TestServeScanPdfLazyPull (new, 2 cases): exercises the S3 fallback path and the 404 case.

All four commits are independent and could be reverted separately.

grossir

This looks good. Just 2 minor suggestions I will push myself and get this merged

grossir · 2026-05-05T15:38:52Z

+                ),
+            )
+            if requeued:
+                self.stdout.write(


should be logger.info ?

grossir · 2026-05-05T15:39:42Z

+        import os
+        import pathlib
+        import tempfile
+        from unittest.mock import patch


these are not heavy, should be top level imports?

quevon24 added 4 commits May 4, 2026 16:42

fix(approve): use scan.stage instead of local fs check

1493779

fix(daemon): re-queue in-flight scan on SIGTERM and report to sentry

d771f1e

docs(runpod): clarify NO_GPU log messages

8cb37f5

fix(views): lazy-pull from s3 in serve_scan_pdf

93b214e

grossir added this to Sprint (Case Law) May 5, 2026

grossir moved this to PRs to Review in Sprint (Case Law) May 5, 2026

grossir self-assigned this May 5, 2026

grossir approved these changes May 5, 2026

View reviewed changes

refactor(run_daemon, test_views): minor style changes

23126ae

grossir merged commit 6228eb1 into main May 5, 2026
7 checks passed

github-project-automation Bot moved this from PRs to Review to Done in Sprint (Case Law) May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: recover scans across daemon/web container boundaries#61

fix: recover scans across daemon/web container boundaries#61
grossir merged 5 commits intomainfrom
fix/prod-multi-container-recovery

quevon24 commented May 4, 2026

Uh oh!

grossir left a comment

Uh oh!

grossir May 5, 2026

Uh oh!

grossir May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

quevon24 commented May 4, 2026

Bugs found

1. "Before approving you need to generate the files"

2. PDF area on the process viewer shows 500

3. Scans stuck in PROCESSING for up to an hour after a deploy

4. NO_GPU log lines read like a hard failure

Changes

Tests

Uh oh!

grossir left a comment

Choose a reason for hiding this comment

Uh oh!

grossir May 5, 2026

Choose a reason for hiding this comment

Uh oh!

grossir May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants