Skip to content

Conversation

@delan
Copy link
Member

@delan delan commented Nov 7, 2025

currently our self-hosted runner system falls back to github-hosted runners if there’s no available capacity at the exact moment of the select runner request. this is suboptimal, because if the job would take 5x as long on github-hosted runners, then you could wait up to 80% of that time for a self-hosted runner and still win.

this patch implements a new global queue service that allows self-hosted runner jobs to wait for available capacity. the service will run on one server for now, as a single queue that dispatches to all available servers, like any efficient supermarket. queueing a job works like this:

  1. POST /profile/<profile_key>/enqueue?<unique_id>&<qualified_repo>&<run_id> (tokenful)
    or POST /enqueue?<unique_id>&<qualified_repo>&<run_id> (tokenless) to enqueue a job.

    • both endpoints return a random token that is used to authenticate the client in the next step.
    • the tokenless endpoint validates that the request came from an authorised job, using an artifact.
    • the request is rejected if no servers are configured to target non-zero runners for the requested profile, because we may never be able to satisfy it.
    • there are no limits to queue depth (at least not yet), but clients probably have better knowledge of the nature of their job anyway, and in theory, they could use that knowledge to decide how long to wait (see below).
  2. POST /take/<unique_id>?<token> to try to take the runner for the enqueued job. once capacity is available, this endpoint is effectively proxied to POST /profile/<profile_key>/take on one of the underlying servers.

    • if the client failed to provide the correct token from the previous step, the response is HTTP 403.
    • if the unique id is unknown, because it expired or the queue service restarted, the response is HTTP 404.
    • if there’s no capacity yet, the response is HTTP 503. repeat after waiting for ‘Retry-After’ seconds.
    • if taking the runner was successful, the response is HTTP 200, with the runner details as JSON.
    • if taking the runner was somehow unsuccessful (bug), the response is HTTP 200, with null as JSON. this sucks, to be honest, but it was also true for the underlying monitor API.
      • when we fix this, we should be careful about curl --retry.
    • clients are free to abandon a queued job without actually taking it, by doing nothing for 30 seconds. for now, the runner-select action client abandons a queued job if it has been waiting for one hour.

i’ve added a “self-test” workflow that can be manually dispatched to test the new flow (e.g. ok 1, ok 2, ok 3, unsatisfiable, unauthorised). you can also play around with this locally by spinning up a monitor and a queue on your own machine, then sending the requests by hand (so three separate terminals):

  • $ cargo build && sudo IMAGE_DEPS_DIR=$(nix eval --raw .\#image-deps) LIB_MONITOR_DIR=. $CARGO_TARGET_DIR/debug/monitor
  • $ cargo build && sudo IMAGE_DEPS_DIR=$(nix eval --raw .\#image-deps) LIB_MONITOR_DIR=. $CARGO_TARGET_DIR/debug/queue
  • $ unique_id=$RANDOM; curl --fail-with-body -sSX POST --retry-max-time 3600 --retry 3600 --retry-delay 1 'http://192.168.100.1:8002/take/'"$unique_id"'?token='"$(curl --fail-with-body -sSX POST --oauth2-bearer "$SERVO_CI_MONITOR_API_TOKEN" 'http://192.168.100.1:8002/profile/servo-windows10/enqueue?unique_id='"$unique_id"'&qualified_repo=delan/servo&run_id=123')"

@delan delan force-pushed the queueing branch 2 times, most recently from f5eca96 to ba815c8 Compare November 7, 2025 15:28
@delan delan changed the base branch from fix-jitconfig-leak to shopping November 8, 2025 12:39
delan added a commit that referenced this pull request Nov 10, 2025
to prepare us for the queueing patch (#69), this patch does a bit of
refactoring and fixes a couple of bugs:
- we now flush the dashboard after taking runners, so we don’t mislead
clients into thinking runners that were taken are still idle. the queue
service relies on this to avoid prematurely dequeuing and forwarding a
queued job.
- the `destroy_all_non_busy_runners` setting now correctly zeroes out
the target counts for all profiles, since it implies
`dont_create_runners`. the queue service relies on this to reject
unsatisfiable requests.

while we’re at it, let’s make the dashboard tolerate and recover from
errors. you should never need to reload the page anymore, unless you’re
expecting a CSS/JS update. see below for what happens on HTTP 503 (a
normal consequence of flushing the dashboard), and what happens on other
request errors.

<img width="640" height="200" alt="image"
src="https://github.com/user-attachments/assets/6c5e2464-08ff-4f0d-8c5e-f1bc03121157"
/>

<img width="640" height="200" alt="image"
src="https://github.com/user-attachments/assets/c2acc24d-e85b-48e5-9b01-e9c0ba1e823a"
/>
@delan delan changed the base branch from shopping to main November 10, 2025 12:07
Comment on lines +471 to +472
// If we can find a server with idle runners for the requested profile, forward the
// request to the queue thread of that server.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this to work reliably and not prematurely dequeue jobs, the monitors should only accept requests from the queue, and stop accepting requests from workflows. i think we can do that by disabling the tokenless select endpoint in the monitor.

@delan delan requested a review from jschwe November 10, 2025 12:16
@delan delan marked this pull request as ready for review November 10, 2025 12:16
@delan delan requested a review from sagudev as a code owner November 10, 2025 12:16
Comment on lines +111 to +112
# Use the queue API to reserve a runner. If we get an object with
# runner details, we succeeded. If we get null, we failed.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the object/null comments belong on the take request, not on the enqueue request

@delan
Copy link
Member Author

delan commented Nov 13, 2025

this should be ready for review! @sagudev @jschwe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants