Skip to content

[Bug]: Database healthcheck causes connection stampede and complete server lockout under I/O stall #9939

@itsbrody

Description

@itsbrody

Error Message and Logs

Description

The default healthcheck for standalone database resources uses psql -U postgres -d postgres -c 'SELECT 1' with a 5-second interval. Under I/O stall conditions (e.g., cloud provider maintenance), this causes a cascading failure: each healthcheck opens a full PostgreSQL connection that never completes, piling up 12 connections/minute. Within ~8 minutes the default max_connections = 100 is exhausted, locking out all services. The server load spiked to 208, rendering the entire server unresponsive — including services with their own separate databases not connected to the affected instance. Neither docker restart, docker kill, nor kill -9 could recover the container — only a full server reboot resolved it.

The healthcheck — which should detect problems — becomes the primary cause of the outage.

Error Message and Logs

PostgreSQL log (from dependent services and direct connection attempts):

FATAL: sorry, too many clients already

Healthcheck processes visible on the host (ps -efa | grep 2077, where 2077 is the postgres main PID):

root  2011149  2077  0 08:47 ?  00:00:00 /usr/lib/postgresql/17/bin/psql -U postgres -d postgres -c SELECT 1
999   2011150  2077  0 08:47 ?  00:00:00 postgres: postgres postgres [local] initializing
root  2011455  2077  0 08:47 ?  00:00:00 /usr/lib/postgresql/17/bin/psql -U postgres -d postgres -c SELECT 1
999   2011456  2077  0 08:47 ?  00:00:00 postgres: postgres postgres [local] initializing
[... 144 such pairs accumulated over ~12 minutes ...]

Each psql healthcheck spawns a postgres backend in "initializing" state that never completes. These are not cleaned up because the 5-second timeout does not apply when the underlying I/O is blocked (process enters uninterruptible D-state).

Steps to Reproduce

  1. Deploy a standalone PostgreSQL database via Coolify (default healthcheck settings apply automatically)
  2. Simulate an I/O stall on the database volume (or wait for cloud provider maintenance that blocks volume I/O)
  3. Observe: psql SELECT 1 healthcheck processes accumulate every 5 seconds, each holding a connection in "initializing" state
  4. After ~8 minutes (with default max_connections = 100), all connection slots are consumed by healthcheck processes
  5. All services depending on the database receive FATAL: sorry, too many clients already
  6. Server load climbs into the hundreds; the container becomes unrecoverable via docker restart/kill

The hardcoded healthcheck configuration from the generated docker-compose.yml:

healthcheck:
  test: ["CMD-SHELL", "psql -U postgres -d postgres -c 'SELECT 1' || exit 1"]
  interval: 5s
  timeout: 5s
  retries: 10
  start_period: 5s

Example Repository URL

No response

Coolify Version

v4.0.0-beta.473

Are you using Coolify Cloud?

No (self-hosted)

Operating System and Version (self-hosted)

Ubuntu 24.04

Additional Information

Workaround attempted: Editing /data/coolify/databases/<uuid>/docker-compose.yml directly does not persist — Coolify regenerates this file from its internal database on every restart/redeploy. The "Custom PostgreSQL Configuration" field in the UI applies to postgresql.conf, not to Docker healthcheck parameters.

Related: The same class of problem was fixed for Nextcloud in #9440 (fix(service): nextcloud workers exhaustion due to low interval healthcheck). The standalone database resources have the same vulnerability.

Metadata

Metadata

Assignees

No one assigned

    Labels

    🔍 TriageIssues that need assessment and prioritization.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions