-
Notifications
You must be signed in to change notification settings - Fork 37
Open
Labels
Description
We had stuck VMs alert in one of the regions. The symptom is that VMs have monitor health check failed error without agent actively trying to reconnect.
Steps to reproduce
It looks like start of monitor health check failed errors coincided with increased CPU load on the node – perhaps some responses were slow and it triggered a rare deadlock in our code.
Expected result
agent tries to reconnect to vm-monitor inside VM. If it fails, it should retry the connection at least every minute.
Actual result
agent stopped reconnecting after the failure
Logs
{
"level": "error",
"ts": 1743949264.3987582,
"logger": "autoscaler-agent.runner.vm-monitor.health-checks",
"caller": "agent/dispatcher.go:198",
"msg": "vm-monitor health check failed",
"restarts": 0,
"taskName": "vm-monitor health checks",
"duration": 5.028509356,
"okSequence": 31,
"error": "timed out waiting 5s for monitor response",
"stacktrace": "github.com/neondatabase/autoscaling/pkg/agent.NewDispatcher.func5\n\t/workspace/pkg/agent/dispatcher.go:198\ngithub.com/neondatabase/autoscaling/pkg/agent.(*Runner).spawnBackgroundWorker.func1\n\t/workspace/pkg/agent/runner.go:468"
}
{
"level": "error",
"ts": 1743949269.409596,
"logger": "autoscaler-agent.runner.vm-monitor.health-checks",
"caller": "agent/dispatcher.go:198",
"msg": "vm-monitor health check failed",
"restarts": 0,
"taskName": "vm-monitor health checks",
"duration": 5.010701255,
"failSequence": 1,
"error": "timed out waiting 5s for monitor response",
"stacktrace": "github.com/neondatabase/autoscaling/pkg/agent.NewDispatcher.func5\n\t/workspace/pkg/agent/dispatcher.go:198\ngithub.com/neondatabase/autoscaling/pkg/agent.(*Runner).spawnBackgroundWorker.func1\n\t/workspace/pkg/agent/runner.go:468"
}
{
"level": "error",
"ts": 1743949274.5960033,
"logger": "autoscaler-agent.runner.vm-monitor.health-checks",
"caller": "agent/dispatcher.go:198",
"msg": "vm-monitor health check failed",
"restarts": 0,
"taskName": "vm-monitor health checks",
"duration": 5.186049581,
"failSequence": 2,
"error": "timed out waiting 5s for monitor response",
"stacktrace": "github.com/neondatabase/autoscaling/pkg/agent.NewDispatcher.func5\n\t/workspace/pkg/agent/dispatcher.go:198\ngithub.com/neondatabase/autoscaling/pkg/agent.(*Runner).spawnBackgroundWorker.func1\n\t/workspace/pkg/agent/runner.go:468"
}
{
"level": "warn",
"ts": 1743949279.1002576,
"logger": "autoscaler-agent.runner",
"caller": "agent/globalstate.go:575",
"msg": "Runner with endpoint is currently stuck",
"restarts": 0,
"taskName": "podStatus updater",
"reasons": "monitor health check failed"
}
{
"level": "error",
"ts": 1743949279.7036283,
"logger": "autoscaler-agent.runner.vm-monitor.health-checks",
"caller": "agent/dispatcher.go:198",
"msg": "vm-monitor health check failed",
"restarts": 0,
"taskName": "vm-monitor health checks",
"duration": 5.107415476,
"failSequence": 3,
"error": "timed out waiting 5s for monitor response",
"stacktrace": "github.com/neondatabase/autoscaling/pkg/agent.NewDispatcher.func5\n\t/workspace/pkg/agent/dispatcher.go:198\ngithub.com/neondatabase/autoscaling/pkg/agent.(*Runner).spawnBackgroundWorker.func1\n\t/workspace/pkg/agent/runner.go:468"
}
{
"level": "warn",
"ts": 1743949284.1001186,
"logger": "autoscaler-agent.runner",
"caller": "agent/globalstate.go:575",
"msg": "Runner with endpoint is currently stuck",
"restarts": 0,
"taskName": "podStatus updater",
"reasons": "monitor health check failed"
}
{
"level": "error",
"ts": 1743949284.7244496,
"logger": "autoscaler-agent.runner.vm-monitor.health-checks",
"caller": "agent/dispatcher.go:198",
"msg": "vm-monitor health check failed",
"restarts": 0,
"taskName": "vm-monitor health checks",
"duration": 5.020609716,
"failSequence": 4,
"error": "timed out waiting 5s for monitor response",
"stacktrace": "github.com/neondatabase/autoscaling/pkg/agent.NewDispatcher.func5\n\t/workspace/pkg/agent/dispatcher.go:198\ngithub.com/neondatabase/autoscaling/pkg/agent.(*Runner).spawnBackgroundWorker.func1\n\t/workspace/pkg/agent/runner.go:468"
}
{
"level": "warn",
"ts": 1743949289.1036808,
"logger": "autoscaler-agent.runner",
"caller": "agent/globalstate.go:575",
"msg": "Runner with endpoint is currently stuck",
"restarts": 0,
"taskName": "podStatus updater",
"reasons": "monitor health check failed"
}
{
"level": "error",
"ts": 1743949289.7318282,
"logger": "autoscaler-agent.runner.vm-monitor.health-checks",
"caller": "agent/dispatcher.go:198",
"msg": "vm-monitor health check failed",
"restarts": 0,
"taskName": "vm-monitor health checks",
"duration": 5.00723229,
"failSequence": 5,
"error": "timed out waiting 5s for monitor response",
"stacktrace": "github.com/neondatabase/autoscaling/pkg/agent.NewDispatcher.func5\n\t/workspace/pkg/agent/dispatcher.go:198\ngithub.com/neondatabase/autoscaling/pkg/agent.(*Runner).spawnBackgroundWorker.func1\n\t/workspace/pkg/agent/runner.go:468"
}
{
"level": "warn",
"ts": 1743949337.7526948,
"logger": "autoscaler-agent.runner.vm-monitor.message-handler",
"caller": "agent/dispatcher.go:577",
"msg": "Received HealthCheck with id 106840 but id is unknown or already timed out waiting for a reply",
"restarts": 0,
"taskName": "vm-monitor message handler",
"id": 106840
}
{
"level": "warn",
"ts": 1743949337.7549634,
"logger": "autoscaler-agent.runner",
"caller": "agent/globalstate.go:575",
"msg": "Runner with endpoint is currently stuck",
"restarts": 0,
"taskName": "podStatus updater",
"reasons": "monitor health check failed"
}
{
"level": "warn",
"ts": 1743949366.509335,
"logger": "autoscaler-agent.runner",
"caller": "agent/globalstate.go:575",
"msg": "Runner with endpoint is currently stuck",
"restarts": 0,
"taskName": "podStatus updater",
"reasons": "monitor health check failed"
}
{
"level": "warn",
"ts": 1743949371.4166734,
"logger": "autoscaler-agent.runner",
"caller": "agent/globalstate.go:575",
"msg": "Runner with endpoint is currently stuck",
"restarts": 0,
"taskName": "podStatus updater",
"reasons": "monitor health check failed"
}
{
"level": "warn",
"ts": 1743949376.4159663,
"logger": "autoscaler-agent.runner",
"caller": "agent/globalstate.go:575",
"msg": "Runner with endpoint is currently stuck",
"restarts": 0,
"taskName": "podStatus updater",
"reasons": "monitor health check failed"
}
{
"level": "warn",
"ts": 1743949381.4167573,
"logger": "autoscaler-agent.runner",
"caller": "agent/globalstate.go:575",
"msg": "Runner with endpoint is currently stuck",
"restarts": 0,
"taskName": "podStatus updater",
"reasons": "monitor health check failed"
}
{
"level": "warn",
"ts": 1743949386.4167109,
"logger": "autoscaler-agent.runner",
"caller": "agent/globalstate.go:575",
"msg": "Runner with endpoint is currently stuck",
"restarts": 0,
"taskName": "podStatus updater",
"reasons": "monitor health check failed"
}
Other logs, links
More info in this slack thread: https://neondb.slack.com/archives/C03TN5G758R/p1744106583046229