fix: flip-api Postgres auth via RDS Proxy + IAM (closes #556)#558
Merged
Conversation
flip-api read the RDS master password from Secrets Manager once at boot and baked it into the SQLAlchemy engine URL. The RDS-managed master secret auto-rotates, so every rotation left the long-running ECS task holding a stale password and took prod DB connectivity down until a manual force-new-deployment. Replace the static credential with RDS Proxy + IAM database authentication: - flip-api builds a passwordless engine and a SQLAlchemy do_connect hook mints a short-lived IAM auth token per physical connection (gated by DB_IAM_AUTH, production-only; dev keeps the static POSTGRES_PASSWORD path). Adds pool_pre_ping + pool_recycle so recycled connections self-heal. - New rds_proxy.tf: RDS Proxy (IAM auth + TLS required), its IAM role scoped to the master secret + KMS decrypt, a dedicated security group, and RDS ingress from the proxy. The proxy reaches RDS with the rotating master secret it re-reads natively, so rotation is a non-event for the app and no rds_iam Postgres grant is needed. - flip-api task role gains rds-db:connect scoped to the proxy + DB user; DB_HOST repointed at the proxy endpoint and DB_IAM_AUTH=true. Unit tests cover the token mint, the do_connect hook, and the engine-build branches. Docs and env example updated. https://claude.ai/code/session_01EHT7puggpRtgrk2x6XCVAy Signed-off-by: Claude <noreply@anthropic.com>
|
✅ Acceptance criteria have been automatically imported from the linked issue(s) and added to the PR description. |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
No rollback path is needed, so select IAM-via-RDS-Proxy purely on ENV=production rather than a separate DB_IAM_AUTH flag, and remove the static-password machinery this leaves dead: - config: remove the DB_IAM_AUTH setting + validator and POSTGRES_SECRET_ARN (prod no longer reads a DB password secret; extra env vars are ignored). - database.py: gate the IAM do_connect path on ENV == "production"; dev keeps the static POSTGRES_PASSWORD path. Drops the Secrets Manager password branch. - terraform: stop injecting DB_IAM_AUTH / POSTGRES_SECRET_ARN into the task; drop the now-dead direct ECS->RDS ingress rule and the unused DB-master-secret GetSecretValue grants on the flip-api task + execution roles. - docs/env example updated to drop the flag and rollback notes. https://claude.ai/code/session_01EHT7puggpRtgrk2x6XCVAy Signed-off-by: Claude <noreply@anthropic.com>
3 tasks
…n token in tests - add dedicated flip_api.db_auth logger; log-and-re-raise on mint failure so prod failures are diagnosable (no secrets logged), exception unchanged - tests: assert sslmode=require, per-connection token mint, pool_recycle, dev pooling-knob isolation - docs: correct SES file in AWS CLAUDE.md table; fix stale POSTGRES_SECRET_ARN comment in compose.production.yml (see #566 / #505) Signed-off-by: at24_bioeng625-pc <alexandre.triay_bagur@kcl.ac.uk>
Assert the IAM token-mint failure log against the settings stub's attributes (DB_HOST/POSTGRES_USER/AWS_REGION) instead of host/user string literals. CodeQL's py/incomplete-url-substring-sanitization (CWE-020) flagged the host-shaped literal in the 'in caplog.text' check; using the stub attributes defeats the heuristic and de-dupes the fixture values into a single source of truth. Signed-off-by: at24_bioeng625-pc <alexandre.triay_bagur@kcl.ac.uk>
garciadias
reviewed
Jun 2, 2026
garciadias
reviewed
Jun 2, 2026
garciadias
reviewed
Jun 2, 2026
garciadias
reviewed
Jun 2, 2026
garciadias
reviewed
Jun 2, 2026
garciadias
reviewed
Jun 2, 2026
garciadias
reviewed
Jun 2, 2026
garciadias
reviewed
Jun 2, 2026
Address review feedback on #558: - Add 2026 year and canonical blank-line separators to the Apache headers of database.py, db_auth_logger.py and test_database.py (match rds_proxy.tf). - Guard the lazy _rds_client cache with a double-checked threading.Lock so a concurrent first-warmup of the do_connect hook can't construct (and discard) duplicate boto3 RDS clients. Signed-off-by: at24_bioeng625-pc <alexandre.triay_bagur@kcl.ac.uk>
garciadias
approved these changes
Jun 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the recurring production DB outage (#556).
flip-apiread the RDS master password from Secrets Manager once at boot and baked it into the SQLAlchemy engine URL. The RDS-managed master secret (rds!db-…) auto-rotates, so every rotation left the long-running ECS task holding a stale password — taking all DB-backed requests down until a manualforce-new-deployment.This implements Option 2 from the issue (RDS Proxy + IAM auth), which removes the static credential from the application entirely so this failure mode cannot recur.
ENV=production),flip-apibuilds a passwordless engine; a SQLAlchemydo_connecthook mints a short-lived IAM auth token per physical connection (over TLS), pluspool_pre_ping+pool_recycle. Dev (ENV=development) keeps the staticPOSTGRES_PASSWORDpath. The path is selected purely byENV— no separate toggle.rds_proxy.tf: RDS Proxy (IAM auth + TLS required), its IAM role scoped to the master secret +kms:Decrypt, a dedicated security group, and RDS ingress from the proxy. The proxy reaches RDS using the rotating master secret it re-reads natively, so rotation is a non-event for the app — and nords_iamPostgres grant is needed.rds-db:connect(scoped to the proxy + DB user);DB_HOSTrepointed at the proxy endpoint. The now-dead direct ECS→RDS ingress rule and the unused DB-master-secretGetSecretValuegrants (flip-api task + execution roles) are removed;POSTGRES_SECRET_ARNis dropped from app config.Files
flip-api/src/flip_api/db/database.py— IAM-tokendo_connecthook; engine gated onENVflip-api/src/flip_api/config.py— removedPOSTGRES_SECRET_ARN(prod no longer reads a DB password)flip-api/tests/unit/db/test_database.py— new unit testsdeploy/providers/AWS/rds_proxy.tf(new),iam_ecs.tf,ecs_tasks.tf,locals.tf,main.tf.env.development.example, rootCLAUDE.md,deploy/providers/AWS/CLAUDE.mdTest plan
ruffclean,mypycleanmake unit_testpytest step: 1017 passed + new DB tests (token mint,do_connecthook, client caching, dev vs prod engine-build)terraform fmt -checkcleanterraform validate— not runnable in the authoring sandbox (provider/module registries blocked); relies on the Terraform Validate CI jobpassword authentication failederrors and no elevated 5xx (requires a realapply+ deploy in eu-west-2)Rollout note
Deploy the new flip-api image and
terraform applytogether: a production task only works once the proxy exists andDB_HOSTpoints at it. Branch pushes don't build images, so the flip-api image must be rebuilt before deploy.Closes #556
https://claude.ai/code/session_01EHT7puggpRtgrk2x6XCVAy