Skip to content

fix: flip-api Postgres auth via RDS Proxy + IAM (closes #556)#558

Merged
atriaybagur merged 8 commits into
developfrom
claude/serene-brahmagupta-5QiT2
Jun 3, 2026
Merged

fix: flip-api Postgres auth via RDS Proxy + IAM (closes #556)#558
atriaybagur merged 8 commits into
developfrom
claude/serene-brahmagupta-5QiT2

Conversation

@atriaybagur
Copy link
Copy Markdown
Member

@atriaybagur atriaybagur commented May 28, 2026

Summary

Fixes the recurring production DB outage (#556). flip-api read the RDS master password from Secrets Manager once at boot and baked it into the SQLAlchemy engine URL. The RDS-managed master secret (rds!db-…) auto-rotates, so every rotation left the long-running ECS task holding a stale password — taking all DB-backed requests down until a manual force-new-deployment.

This implements Option 2 from the issue (RDS Proxy + IAM auth), which removes the static credential from the application entirely so this failure mode cannot recur.

  • In production (ENV=production), flip-api builds a passwordless engine; a SQLAlchemy do_connect hook mints a short-lived IAM auth token per physical connection (over TLS), plus pool_pre_ping + pool_recycle. Dev (ENV=development) keeps the static POSTGRES_PASSWORD path. The path is selected purely by ENV — no separate toggle.
  • New rds_proxy.tf: RDS Proxy (IAM auth + TLS required), its IAM role scoped to the master secret + kms:Decrypt, a dedicated security group, and RDS ingress from the proxy. The proxy reaches RDS using the rotating master secret it re-reads natively, so rotation is a non-event for the app — and no rds_iam Postgres grant is needed.
  • flip-api task role gains rds-db:connect (scoped to the proxy + DB user); DB_HOST repointed at the proxy endpoint. The now-dead direct ECS→RDS ingress rule and the unused DB-master-secret GetSecretValue grants (flip-api task + execution roles) are removed; POSTGRES_SECRET_ARN is dropped from app config.

Files

  • flip-api/src/flip_api/db/database.py — IAM-token do_connect hook; engine gated on ENV
  • flip-api/src/flip_api/config.py — removed POSTGRES_SECRET_ARN (prod no longer reads a DB password)
  • flip-api/tests/unit/db/test_database.py — new unit tests
  • deploy/providers/AWS/rds_proxy.tf (new), iam_ecs.tf, ecs_tasks.tf, locals.tf, main.tf
  • Docs: .env.development.example, root CLAUDE.md, deploy/providers/AWS/CLAUDE.md

Test plan

  • ruff clean, mypy clean
  • make unit_test pytest step: 1017 passed + new DB tests (token mint, do_connect hook, client caching, dev vs prod engine-build)
  • terraform fmt -check clean
  • terraform validatenot runnable in the authoring sandbox (provider/module registries blocked); relies on the Terraform Validate CI job
  • Acceptance criteria: after deploy, rotate the DB credential against the running service and confirm no password authentication failed errors and no elevated 5xx (requires a real apply + deploy in eu-west-2)

Rollout note

Deploy the new flip-api image and terraform apply together: a production task only works once the proxy exists and DB_HOST points at it. Branch pushes don't build images, so the flip-api image must be rebuilt before deploy.

Closes #556

https://claude.ai/code/session_01EHT7puggpRtgrk2x6XCVAy

flip-api read the RDS master password from Secrets Manager once at boot and
baked it into the SQLAlchemy engine URL. The RDS-managed master secret
auto-rotates, so every rotation left the long-running ECS task holding a stale
password and took prod DB connectivity down until a manual force-new-deployment.

Replace the static credential with RDS Proxy + IAM database authentication:

- flip-api builds a passwordless engine and a SQLAlchemy do_connect hook mints
  a short-lived IAM auth token per physical connection (gated by DB_IAM_AUTH,
  production-only; dev keeps the static POSTGRES_PASSWORD path). Adds
  pool_pre_ping + pool_recycle so recycled connections self-heal.
- New rds_proxy.tf: RDS Proxy (IAM auth + TLS required), its IAM role scoped to
  the master secret + KMS decrypt, a dedicated security group, and RDS ingress
  from the proxy. The proxy reaches RDS with the rotating master secret it
  re-reads natively, so rotation is a non-event for the app and no rds_iam
  Postgres grant is needed.
- flip-api task role gains rds-db:connect scoped to the proxy + DB user; DB_HOST
  repointed at the proxy endpoint and DB_IAM_AUTH=true.

Unit tests cover the token mint, the do_connect hook, and the engine-build
branches. Docs and env example updated.

https://claude.ai/code/session_01EHT7puggpRtgrk2x6XCVAy
Signed-off-by: Claude <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

✅ Acceptance criteria have been automatically imported from the linked issue(s) and added to the PR description.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

No rollback path is needed, so select IAM-via-RDS-Proxy purely on
ENV=production rather than a separate DB_IAM_AUTH flag, and remove the
static-password machinery this leaves dead:

- config: remove the DB_IAM_AUTH setting + validator and POSTGRES_SECRET_ARN
  (prod no longer reads a DB password secret; extra env vars are ignored).
- database.py: gate the IAM do_connect path on ENV == "production"; dev keeps
  the static POSTGRES_PASSWORD path. Drops the Secrets Manager password branch.
- terraform: stop injecting DB_IAM_AUTH / POSTGRES_SECRET_ARN into the task;
  drop the now-dead direct ECS->RDS ingress rule and the unused
  DB-master-secret GetSecretValue grants on the flip-api task + execution roles.
- docs/env example updated to drop the flag and rollback notes.

https://claude.ai/code/session_01EHT7puggpRtgrk2x6XCVAy
Signed-off-by: Claude <noreply@anthropic.com>
…n token in tests

- add dedicated flip_api.db_auth logger; log-and-re-raise on mint failure
  so prod failures are diagnosable (no secrets logged), exception unchanged
- tests: assert sslmode=require, per-connection token mint, pool_recycle,
  dev pooling-knob isolation
- docs: correct SES file in AWS CLAUDE.md table; fix stale POSTGRES_SECRET_ARN
  comment in compose.production.yml (see #566 / #505)

Signed-off-by: at24_bioeng625-pc <alexandre.triay_bagur@kcl.ac.uk>
Comment thread flip-api/tests/unit/db/test_database.py Fixed
Assert the IAM token-mint failure log against the settings stub's
attributes (DB_HOST/POSTGRES_USER/AWS_REGION) instead of host/user
string literals. CodeQL's py/incomplete-url-substring-sanitization
(CWE-020) flagged the host-shaped literal in the 'in caplog.text'
check; using the stub attributes defeats the heuristic and de-dupes
the fixture values into a single source of truth.

Signed-off-by: at24_bioeng625-pc <alexandre.triay_bagur@kcl.ac.uk>
Comment thread deploy/providers/AWS/rds_proxy.tf
Comment thread flip-api/src/flip_api/db/db_auth_logger.py Outdated
Comment thread flip-api/src/flip_api/db/db_auth_logger.py
Comment thread flip-api/src/flip_api/db/db_auth_logger.py
Comment thread flip-api/tests/unit/db/test_database.py Outdated
Comment thread flip-api/tests/unit/db/test_database.py
Comment thread flip-api/src/flip_api/db/database.py
Comment thread flip-api/src/flip_api/db/database.py
@garciadias garciadias assigned atriaybagur and unassigned garciadias Jun 2, 2026
Address review feedback on #558:
- Add 2026 year and canonical blank-line separators to the Apache headers
  of database.py, db_auth_logger.py and test_database.py (match rds_proxy.tf).
- Guard the lazy _rds_client cache with a double-checked threading.Lock so a
  concurrent first-warmup of the do_connect hook can't construct (and discard)
  duplicate boto3 RDS clients.

Signed-off-by: at24_bioeng625-pc <alexandre.triay_bagur@kcl.ac.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

flip-api (prod): DB outage on every RDS secret rotation — DB password cached at boot

5 participants