Skip to content

Bug: possible deadlock in autoscaler-agent.runner.vm-monitor.health-checks #1348

@petuhovskiy

Description

@petuhovskiy

We had stuck VMs alert in one of the regions. The symptom is that VMs have monitor health check failed error without agent actively trying to reconnect.

Steps to reproduce

It looks like start of monitor health check failed errors coincided with increased CPU load on the node – perhaps some responses were slow and it triggered a rare deadlock in our code.

Expected result

agent tries to reconnect to vm-monitor inside VM. If it fails, it should retry the connection at least every minute.

Actual result

agent stopped reconnecting after the failure

Logs

{
  "level": "error",
  "ts": 1743949264.3987582,
  "logger": "autoscaler-agent.runner.vm-monitor.health-checks",
  "caller": "agent/dispatcher.go:198",
  "msg": "vm-monitor health check failed",
  "restarts": 0,
  "taskName": "vm-monitor health checks",
  "duration": 5.028509356,
  "okSequence": 31,
  "error": "timed out waiting 5s for monitor response",
  "stacktrace": "github.com/neondatabase/autoscaling/pkg/agent.NewDispatcher.func5\n\t/workspace/pkg/agent/dispatcher.go:198\ngithub.com/neondatabase/autoscaling/pkg/agent.(*Runner).spawnBackgroundWorker.func1\n\t/workspace/pkg/agent/runner.go:468"
}
	
	
{
  "level": "error",
  "ts": 1743949269.409596,
  "logger": "autoscaler-agent.runner.vm-monitor.health-checks",
  "caller": "agent/dispatcher.go:198",
  "msg": "vm-monitor health check failed",
  "restarts": 0,
  "taskName": "vm-monitor health checks",
  "duration": 5.010701255,
  "failSequence": 1,
  "error": "timed out waiting 5s for monitor response",
  "stacktrace": "github.com/neondatabase/autoscaling/pkg/agent.NewDispatcher.func5\n\t/workspace/pkg/agent/dispatcher.go:198\ngithub.com/neondatabase/autoscaling/pkg/agent.(*Runner).spawnBackgroundWorker.func1\n\t/workspace/pkg/agent/runner.go:468"
}
	
	
{
  "level": "error",
  "ts": 1743949274.5960033,
  "logger": "autoscaler-agent.runner.vm-monitor.health-checks",
  "caller": "agent/dispatcher.go:198",
  "msg": "vm-monitor health check failed",
  "restarts": 0,
  "taskName": "vm-monitor health checks",
  "duration": 5.186049581,
  "failSequence": 2,
  "error": "timed out waiting 5s for monitor response",
  "stacktrace": "github.com/neondatabase/autoscaling/pkg/agent.NewDispatcher.func5\n\t/workspace/pkg/agent/dispatcher.go:198\ngithub.com/neondatabase/autoscaling/pkg/agent.(*Runner).spawnBackgroundWorker.func1\n\t/workspace/pkg/agent/runner.go:468"
}
	
	
{
  "level": "warn",
  "ts": 1743949279.1002576,
  "logger": "autoscaler-agent.runner",
  "caller": "agent/globalstate.go:575",
  "msg": "Runner with endpoint is currently stuck",
  "restarts": 0,
  "taskName": "podStatus updater",
  "reasons": "monitor health check failed"
}
	
	
{
  "level": "error",
  "ts": 1743949279.7036283,
  "logger": "autoscaler-agent.runner.vm-monitor.health-checks",
  "caller": "agent/dispatcher.go:198",
  "msg": "vm-monitor health check failed",
  "restarts": 0,
  "taskName": "vm-monitor health checks",
  "duration": 5.107415476,
  "failSequence": 3,
  "error": "timed out waiting 5s for monitor response",
  "stacktrace": "github.com/neondatabase/autoscaling/pkg/agent.NewDispatcher.func5\n\t/workspace/pkg/agent/dispatcher.go:198\ngithub.com/neondatabase/autoscaling/pkg/agent.(*Runner).spawnBackgroundWorker.func1\n\t/workspace/pkg/agent/runner.go:468"
}
	
	
{
  "level": "warn",
  "ts": 1743949284.1001186,
  "logger": "autoscaler-agent.runner",
  "caller": "agent/globalstate.go:575",
  "msg": "Runner with endpoint is currently stuck",
  "restarts": 0,
  "taskName": "podStatus updater",
  "reasons": "monitor health check failed"
}
	
	
{
  "level": "error",
  "ts": 1743949284.7244496,
  "logger": "autoscaler-agent.runner.vm-monitor.health-checks",
  "caller": "agent/dispatcher.go:198",
  "msg": "vm-monitor health check failed",
  "restarts": 0,
  "taskName": "vm-monitor health checks",
  "duration": 5.020609716,
  "failSequence": 4,
  "error": "timed out waiting 5s for monitor response",
  "stacktrace": "github.com/neondatabase/autoscaling/pkg/agent.NewDispatcher.func5\n\t/workspace/pkg/agent/dispatcher.go:198\ngithub.com/neondatabase/autoscaling/pkg/agent.(*Runner).spawnBackgroundWorker.func1\n\t/workspace/pkg/agent/runner.go:468"
}
	
	
{
  "level": "warn",
  "ts": 1743949289.1036808,
  "logger": "autoscaler-agent.runner",
  "caller": "agent/globalstate.go:575",
  "msg": "Runner with endpoint is currently stuck",
  "restarts": 0,
  "taskName": "podStatus updater",
  "reasons": "monitor health check failed"
}
	
	
{
  "level": "error",
  "ts": 1743949289.7318282,
  "logger": "autoscaler-agent.runner.vm-monitor.health-checks",
  "caller": "agent/dispatcher.go:198",
  "msg": "vm-monitor health check failed",
  "restarts": 0,
  "taskName": "vm-monitor health checks",
  "duration": 5.00723229,
  "failSequence": 5,
  "error": "timed out waiting 5s for monitor response",
  "stacktrace": "github.com/neondatabase/autoscaling/pkg/agent.NewDispatcher.func5\n\t/workspace/pkg/agent/dispatcher.go:198\ngithub.com/neondatabase/autoscaling/pkg/agent.(*Runner).spawnBackgroundWorker.func1\n\t/workspace/pkg/agent/runner.go:468"
}
	
	
{
  "level": "warn",
  "ts": 1743949337.7526948,
  "logger": "autoscaler-agent.runner.vm-monitor.message-handler",
  "caller": "agent/dispatcher.go:577",
  "msg": "Received HealthCheck with id 106840 but id is unknown or already timed out waiting for a reply",
  "restarts": 0,
  "taskName": "vm-monitor message handler",
  "id": 106840
}
	
	
{
  "level": "warn",
  "ts": 1743949337.7549634,
  "logger": "autoscaler-agent.runner",
  "caller": "agent/globalstate.go:575",
  "msg": "Runner with endpoint is currently stuck",
  "restarts": 0,
  "taskName": "podStatus updater",
  "reasons": "monitor health check failed"
}
	
	
{
  "level": "warn",
  "ts": 1743949366.509335,
  "logger": "autoscaler-agent.runner",
  "caller": "agent/globalstate.go:575",
  "msg": "Runner with endpoint is currently stuck",
  "restarts": 0,
  "taskName": "podStatus updater",
  "reasons": "monitor health check failed"
}
	
	
{
  "level": "warn",
  "ts": 1743949371.4166734,
  "logger": "autoscaler-agent.runner",
  "caller": "agent/globalstate.go:575",
  "msg": "Runner with endpoint is currently stuck",
  "restarts": 0,
  "taskName": "podStatus updater",
  "reasons": "monitor health check failed"
}
	
	
{
  "level": "warn",
  "ts": 1743949376.4159663,
  "logger": "autoscaler-agent.runner",
  "caller": "agent/globalstate.go:575",
  "msg": "Runner with endpoint is currently stuck",
  "restarts": 0,
  "taskName": "podStatus updater",
  "reasons": "monitor health check failed"
}
	
	
{
  "level": "warn",
  "ts": 1743949381.4167573,
  "logger": "autoscaler-agent.runner",
  "caller": "agent/globalstate.go:575",
  "msg": "Runner with endpoint is currently stuck",
  "restarts": 0,
  "taskName": "podStatus updater",
  "reasons": "monitor health check failed"
}
	
	
{
  "level": "warn",
  "ts": 1743949386.4167109,
  "logger": "autoscaler-agent.runner",
  "caller": "agent/globalstate.go:575",
  "msg": "Runner with endpoint is currently stuck",
  "restarts": 0,
  "taskName": "podStatus updater",
  "reasons": "monitor health check failed"
}

Other logs, links

More info in this slack thread: https://neondb.slack.com/archives/C03TN5G758R/p1744106583046229

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions