fix: recover scans across daemon/web container boundaries#61
Merged
Conversation
grossir
approved these changes
May 5, 2026
Contributor
grossir
left a comment
There was a problem hiding this comment.
This looks good. Just 2 minor suggestions I will push myself and get this merged
| ), | ||
| ) | ||
| if requeued: | ||
| self.stdout.write( |
Comment on lines
+1093
to
+1096
| import os | ||
| import pathlib | ||
| import tempfile | ||
| from unittest.mock import patch |
Contributor
There was a problem hiding this comment.
these are not heavy, should be top level imports?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
In prod the daemon and web run in separate containers with separate
/tmp/volumes. The daemon writes processing files to its own/tmp/scanning/{pk}/and pushes them to S3; the web container's local copy is empty and is populated lazily when needed. That split was breaking several code paths that assumed a shared filesystem, plus a couple of unrelated daemon-shutdown / observability gaps surfaced along the way.Bugs found
1. "Before approving you need to generate the files"
Click "Generate files" on step 3, wait for it to finish, click "Approve", and the page redirects with "Before approving you need to generate the files." Opening any opinion in the sidebar (which lazy-pulls from S3) then makes Approve work.
upload_approved_fileswas checkingoutput_dir/redacted/*.pdfon the web container's local disk, which in prod is empty until something pulls from S3. The check was a poor proxy for "did generate-files run?".2. PDF area on the process viewer shows 500
After OCR / RunPod returns successfully, the left column renders Issues / Pages correctly but the PDF panel shows
Error loading PDF: Unexpected server response (500).serve_scan_pdfonly looked at local disk. With no/tmp/files and no local FileField copy (prod uses S3-backed storage),scan.pdf_pathraisedFileNotFoundErrorand the uncaught exception became a 500.3. Scans stuck in PROCESSING for up to an hour after a deploy
When the daemon container was restarted mid-pipeline (deploy, OOM, K8s eviction), the scan it was processing sat in
PROCESSINGuntil the stale-recovery sweep picked it up, which usesDAEMON_PROCESSING_TIMEOUT(default 3600s).The signal handler only set
self.shutdown = True; it never touched the row.4. NO_GPU log lines read like a hard failure
Sentry was firing
GPU not available; skipping weight/model preload. Jobs on this worker will return error_code=NO_GPU; fitness check will prevent this worker from accepting jobs.That sounds bad, butNO_GPUis in_TRANSIENT_ERROR_CODES, so the daemon re-queues the scan automatically and the next tick lands on a different worker. The message didn't say so.Changes
fix(approve):upload_approved_filesnow usesscan.stage != Stage.APPROVEDinstead of a filesystem probe.run_generate_filesalready sets stage to APPROVED on success, and the actual S3 copy is server-side, so the daemon and web containers don't need to share files for approval to work.fix(views):serve_scan_pdfnow does try-local, lazy-pull from S3 vias3_sync.download_processing_files, try-local again, thenHttp404. Mirrors the patternserve_opinionscan_pdfalready uses. Replaces the uncaught 500 with a 404 if S3 also has nothing.fix(daemon): SIGTERM/SIGINT handler now atomically resets allPROCESSINGscans toQUEUEDwith a progress message saying the daemon crashed mid-pipeline. The next tick (~5s) re-claims them. The re-queue does NOT bumpretry_count, since the scan didn't fail; that budget is reserved for real RunPod transient errors so a string of deploys can't push a scan toERROR_MAX_RETRIES. Also captures asentry_sdk.capture_messagewarning when the signal arrives, so unexpected kills (deploys, OOM, evictions) become visible.docs(runpod): both NO_GPU log lines (the preload guard and the per-job belt-and-suspenders) now mention that the scan is re-queued automatically up toRUNPOD_MAX_TRANSIENT_RETRIESattempts. The longer explanation lives in a comment above each line; the log message itself stays terse.Tests
TestUploadApprovedFiles: renamedtest_missing_files_returns_errortotest_pre_generation_returns_error; fixtures now seedstage=APPROVED.TestApproveScan: same fixture update.TestHandleSignal(new, 2 cases): verifies re-queue + Sentry capture, and that_handle_signalis a no-op on the DB side when nothing is in flight.TestServeScanPdfLazyPull(new, 2 cases): exercises the S3 fallback path and the 404 case.All four commits are independent and could be reverted separately.