Skip to content

Streaming pings#3697

Merged
DrJosh9000 merged 18 commits intomainfrom
pb-927-update-agent-to-consume-the-new-connectrpc-endpoint
Mar 3, 2026
Merged

Streaming pings#3697
DrJosh9000 merged 18 commits intomainfrom
pb-927-update-agent-to-consume-the-new-connectrpc-endpoint

Conversation

@DrJosh9000
Copy link
Contributor

@DrJosh9000 DrJosh9000 commented Feb 4, 2026

Description

Support the upcoming streaming-ping endpoint, for faster dispatch of jobs.

Context

https://linear.app/buildkite/issue/PB-927/update-agent-to-consume-the-new-connectrpc-endpoint

Changes

The plumbing stuff (adding the .proto, the new flag, the API client method, the E2E tests) is all hopefully clear on its own.

What we want is:

  • when the streaming loop works, we prefer it
  • when the streaming loop doesn't work, we fall back to the ping loop until it starts working again
  • when it's fallen back to the ping loop, and the ping loop is in the middle of something (say, a job), then the streaming loop doesn't take over again until the ping loop is finished with whatever it's doing

That part is implemented with a "toggle baton", which is... a channel some channels and a mutex in a trenchcoat. The streaming side starts with the toggle baton, and the ping loop blocks to receive the toggle baton. If the streaming side becomes unhealthy, it gives the toggle baton back, and the ping loop should pick it up immediately. If the streaming side becomes healthy again, it waits to take the toggle baton back (the ping loop may be in the middle of a job).

The next main complication is that the ping loop and streaming loop operate in totally different ways. Pings don't get sent while executing a job, but the stream can (at least theoretically) continue receiving messages the whole time. The stream can also (theoretically) receive multiple contradictory messages one after another, e.g. "pause, idle, pause".

This is solved by breaking apart the loops into more loops - think "actors" that are passing messages between each other.

image

The ping loop and streaming loop generate actions. The ping loop can wait until the action is complete, so that's what it does. The streaming loop shouldn't be made to wait, so the waiting is delegated to the debouncer, which also coalesces multiple actions from the stream loop into 0 or 1 next actions, plus deals with the toggle baton. The action handler loop actually performs the actions. The action handler is closest to being "the agent" in the sense that "the ping loop" used to be.

Testing

  • Tests have run locally (with go test ./...). Buildkite employees may check this if the pipeline has run automatically.
  • Code is formatted (with go tool gofumpt -extra -w .)
  • Get the E2E test working - pending buildkite/buildkite#27861
  • Confirm the final proto definition, particularly the removal of re-endpointing

Disclosures / Credits

No LLMs were used in the making of this PR

@DrJosh9000 DrJosh9000 force-pushed the pb-927-update-agent-to-consume-the-new-connectrpc-endpoint branch 11 times, most recently from a273a44 to dec01e1 Compare February 12, 2026 05:15
@DrJosh9000 DrJosh9000 force-pushed the pb-927-update-agent-to-consume-the-new-connectrpc-endpoint branch 19 times, most recently from 0fcf9b8 to e26a125 Compare February 17, 2026 23:36
@DrJosh9000 DrJosh9000 force-pushed the pb-927-update-agent-to-consume-the-new-connectrpc-endpoint branch 8 times, most recently from 97b6da8 to 163c33d Compare March 3, 2026 04:59
@DrJosh9000 DrJosh9000 requested a review from zhming0 March 3, 2026 04:59
@DrJosh9000 DrJosh9000 force-pushed the pb-927-update-agent-to-consume-the-new-connectrpc-endpoint branch from 163c33d to a1cdc75 Compare March 3, 2026 05:02
Copy link
Contributor

@zhming0 zhming0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚂 🚂 🚂 🚂 🚂 🚂

Comment on lines +427 to +435
// reconnInterval functions similarly to pingInterval, except we expect
// the resulting connection to last much longer. By default, attempt to
// reconnect no more than once every 10 seconds.
reconnInterval := time.Second * time.Duration(max(10, a.agent.PingInterval))
if a.agentConfiguration.PingMode == "stream-only" {
// If it's only us, then allow reconnecting as though each stream was
// a ping.
reconnInterval = time.Second * time.Duration(a.agent.PingInterval)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, that's the best outcome 💯

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants