Bug: possible deadlock in autoscaler-agent.runner.vm-monitor.health-checks

We had stuck VMs alert in one of the regions. The symptom is that VMs have `monitor health check failed` error without agent actively trying to reconnect. 


## Steps to reproduce

It looks like start of `monitor health check failed` errors coincided with increased CPU load on the node – perhaps some responses were slow and it triggered a rare deadlock in our code.



## Expected result

agent tries to reconnect to vm-monitor inside VM. If it fails, it should retry the connection at least every minute.



## Actual result

agent stopped reconnecting after the failure

<details><summary>Logs</summary>
<p>



```
{
  "level": "error",
  "ts": 1743949264.3987582,
  "logger": "autoscaler-agent.runner.vm-monitor.health-checks",
  "caller": "agent/dispatcher.go:198",
  "msg": "vm-monitor health check failed",
  "restarts": 0,
  "taskName": "vm-monitor health checks",
  "duration": 5.028509356,
  "okSequence": 31,
  "error": "timed out waiting 5s for monitor response",
  "stacktrace": "github.com/neondatabase/autoscaling/pkg/agent.NewDispatcher.func5\n\t/workspace/pkg/agent/dispatcher.go:198\ngithub.com/neondatabase/autoscaling/pkg/agent.(*Runner).spawnBackgroundWorker.func1\n\t/workspace/pkg/agent/runner.go:468"
}
	
	
{
  "level": "error",
  "ts": 1743949269.409596,
  "logger": "autoscaler-agent.runner.vm-monitor.health-checks",
  "caller": "agent/dispatcher.go:198",
  "msg": "vm-monitor health check failed",
  "restarts": 0,
  "taskName": "vm-monitor health checks",
  "duration": 5.010701255,
  "failSequence": 1,
  "error": "timed out waiting 5s for monitor response",
  "stacktrace": "github.com/neondatabase/autoscaling/pkg/agent.NewDispatcher.func5\n\t/workspace/pkg/agent/dispatcher.go:198\ngithub.com/neondatabase/autoscaling/pkg/agent.(*Runner).spawnBackgroundWorker.func1\n\t/workspace/pkg/agent/runner.go:468"
}
	
	
{
  "level": "error",
  "ts": 1743949274.5960033,
  "logger": "autoscaler-agent.runner.vm-monitor.health-checks",
  "caller": "agent/dispatcher.go:198",
  "msg": "vm-monitor health check failed",
  "restarts": 0,
  "taskName": "vm-monitor health checks",
  "duration": 5.186049581,
  "failSequence": 2,
  "error": "timed out waiting 5s for monitor response",
  "stacktrace": "github.com/neondatabase/autoscaling/pkg/agent.NewDispatcher.func5\n\t/workspace/pkg/agent/dispatcher.go:198\ngithub.com/neondatabase/autoscaling/pkg/agent.(*Runner).spawnBackgroundWorker.func1\n\t/workspace/pkg/agent/runner.go:468"
}
	
	
{
  "level": "warn",
  "ts": 1743949279.1002576,
  "logger": "autoscaler-agent.runner",
  "caller": "agent/globalstate.go:575",
  "msg": "Runner with endpoint is currently stuck",
  "restarts": 0,
  "taskName": "podStatus updater",
  "reasons": "monitor health check failed"
}
	
	
{
  "level": "error",
  "ts": 1743949279.7036283,
  "logger": "autoscaler-agent.runner.vm-monitor.health-checks",
  "caller": "agent/dispatcher.go:198",
  "msg": "vm-monitor health check failed",
  "restarts": 0,
  "taskName": "vm-monitor health checks",
  "duration": 5.107415476,
  "failSequence": 3,
  "error": "timed out waiting 5s for monitor response",
  "stacktrace": "github.com/neondatabase/autoscaling/pkg/agent.NewDispatcher.func5\n\t/workspace/pkg/agent/dispatcher.go:198\ngithub.com/neondatabase/autoscaling/pkg/agent.(*Runner).spawnBackgroundWorker.func1\n\t/workspace/pkg/agent/runner.go:468"
}
	
	
{
  "level": "warn",
  "ts": 1743949284.1001186,
  "logger": "autoscaler-agent.runner",
  "caller": "agent/globalstate.go:575",
  "msg": "Runner with endpoint is currently stuck",
  "restarts": 0,
  "taskName": "podStatus updater",
  "reasons": "monitor health check failed"
}
	
	
{
  "level": "error",
  "ts": 1743949284.7244496,
  "logger": "autoscaler-agent.runner.vm-monitor.health-checks",
  "caller": "agent/dispatcher.go:198",
  "msg": "vm-monitor health check failed",
  "restarts": 0,
  "taskName": "vm-monitor health checks",
  "duration": 5.020609716,
  "failSequence": 4,
  "error": "timed out waiting 5s for monitor response",
  "stacktrace": "github.com/neondatabase/autoscaling/pkg/agent.NewDispatcher.func5\n\t/workspace/pkg/agent/dispatcher.go:198\ngithub.com/neondatabase/autoscaling/pkg/agent.(*Runner).spawnBackgroundWorker.func1\n\t/workspace/pkg/agent/runner.go:468"
}
	
	
{
  "level": "warn",
  "ts": 1743949289.1036808,
  "logger": "autoscaler-agent.runner",
  "caller": "agent/globalstate.go:575",
  "msg": "Runner with endpoint is currently stuck",
  "restarts": 0,
  "taskName": "podStatus updater",
  "reasons": "monitor health check failed"
}
	
	
{
  "level": "error",
  "ts": 1743949289.7318282,
  "logger": "autoscaler-agent.runner.vm-monitor.health-checks",
  "caller": "agent/dispatcher.go:198",
  "msg": "vm-monitor health check failed",
  "restarts": 0,
  "taskName": "vm-monitor health checks",
  "duration": 5.00723229,
  "failSequence": 5,
  "error": "timed out waiting 5s for monitor response",
  "stacktrace": "github.com/neondatabase/autoscaling/pkg/agent.NewDispatcher.func5\n\t/workspace/pkg/agent/dispatcher.go:198\ngithub.com/neondatabase/autoscaling/pkg/agent.(*Runner).spawnBackgroundWorker.func1\n\t/workspace/pkg/agent/runner.go:468"
}
	
	
{
  "level": "warn",
  "ts": 1743949337.7526948,
  "logger": "autoscaler-agent.runner.vm-monitor.message-handler",
  "caller": "agent/dispatcher.go:577",
  "msg": "Received HealthCheck with id 106840 but id is unknown or already timed out waiting for a reply",
  "restarts": 0,
  "taskName": "vm-monitor message handler",
  "id": 106840
}
	
	
{
  "level": "warn",
  "ts": 1743949337.7549634,
  "logger": "autoscaler-agent.runner",
  "caller": "agent/globalstate.go:575",
  "msg": "Runner with endpoint is currently stuck",
  "restarts": 0,
  "taskName": "podStatus updater",
  "reasons": "monitor health check failed"
}
	
	
{
  "level": "warn",
  "ts": 1743949366.509335,
  "logger": "autoscaler-agent.runner",
  "caller": "agent/globalstate.go:575",
  "msg": "Runner with endpoint is currently stuck",
  "restarts": 0,
  "taskName": "podStatus updater",
  "reasons": "monitor health check failed"
}
	
	
{
  "level": "warn",
  "ts": 1743949371.4166734,
  "logger": "autoscaler-agent.runner",
  "caller": "agent/globalstate.go:575",
  "msg": "Runner with endpoint is currently stuck",
  "restarts": 0,
  "taskName": "podStatus updater",
  "reasons": "monitor health check failed"
}
	
	
{
  "level": "warn",
  "ts": 1743949376.4159663,
  "logger": "autoscaler-agent.runner",
  "caller": "agent/globalstate.go:575",
  "msg": "Runner with endpoint is currently stuck",
  "restarts": 0,
  "taskName": "podStatus updater",
  "reasons": "monitor health check failed"
}
	
	
{
  "level": "warn",
  "ts": 1743949381.4167573,
  "logger": "autoscaler-agent.runner",
  "caller": "agent/globalstate.go:575",
  "msg": "Runner with endpoint is currently stuck",
  "restarts": 0,
  "taskName": "podStatus updater",
  "reasons": "monitor health check failed"
}
	
	
{
  "level": "warn",
  "ts": 1743949386.4167109,
  "logger": "autoscaler-agent.runner",
  "caller": "agent/globalstate.go:575",
  "msg": "Runner with endpoint is currently stuck",
  "restarts": 0,
  "taskName": "podStatus updater",
  "reasons": "monitor health check failed"
}
```

</p>
</details> 



## Other logs, links

More info in this slack thread: https://neondb.slack.com/archives/C03TN5G758R/p1744106583046229

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: possible deadlock in autoscaler-agent.runner.vm-monitor.health-checks #1348

Steps to reproduce

Expected result

Actual result

Other logs, links

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: possible deadlock in autoscaler-agent.runner.vm-monitor.health-checks #1348

Description

Steps to reproduce

Expected result

Actual result

Other logs, links

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions