-
Notifications
You must be signed in to change notification settings - Fork 6
queue: initial queue service #69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
f5eca96 to
ba815c8
Compare
to prepare us for the queueing patch (#69), this patch does a bit of refactoring and fixes a couple of bugs: - we now flush the dashboard after taking runners, so we don’t mislead clients into thinking runners that were taken are still idle. the queue service relies on this to avoid prematurely dequeuing and forwarding a queued job. - the `destroy_all_non_busy_runners` setting now correctly zeroes out the target counts for all profiles, since it implies `dont_create_runners`. the queue service relies on this to reject unsatisfiable requests. while we’re at it, let’s make the dashboard tolerate and recover from errors. you should never need to reload the page anymore, unless you’re expecting a CSS/JS update. see below for what happens on HTTP 503 (a normal consequence of flushing the dashboard), and what happens on other request errors. <img width="640" height="200" alt="image" src="https://github.com/user-attachments/assets/6c5e2464-08ff-4f0d-8c5e-f1bc03121157" /> <img width="640" height="200" alt="image" src="https://github.com/user-attachments/assets/c2acc24d-e85b-48e5-9b01-e9c0ba1e823a" />
| // If we can find a server with idle runners for the requested profile, forward the | ||
| // request to the queue thread of that server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for this to work reliably and not prematurely dequeue jobs, the monitors should only accept requests from the queue, and stop accepting requests from workflows. i think we can do that by disabling the tokenless select endpoint in the monitor.
| # Use the queue API to reserve a runner. If we get an object with | ||
| # runner details, we succeeded. If we get null, we failed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the object/null comments belong on the take request, not on the enqueue request
currently our self-hosted runner system falls back to github-hosted runners if there’s no available capacity at the exact moment of the select runner request. this is suboptimal, because if the job would take 5x as long on github-hosted runners, then you could wait up to 80% of that time for a self-hosted runner and still win.
this patch implements a new global queue service that allows self-hosted runner jobs to wait for available capacity. the service will run on one server for now, as a single queue that dispatches to all available servers, like any efficient supermarket. queueing a job works like this:
POST /profile/<profile_key>/enqueue?<unique_id>&<qualified_repo>&<run_id> (tokenful)
or POST /enqueue?<unique_id>&<qualified_repo>&<run_id> (tokenless) to enqueue a job.
POST /take/<unique_id>?<token> to try to take the runner for the enqueued job. once capacity is available, this endpoint is effectively proxied to POST /profile/<profile_key>/take on one of the underlying servers.
nullas JSON. this sucks, to be honest, but it was also true for the underlying monitor API.i’ve added a “self-test” workflow that can be manually dispatched to test the new flow (e.g. ok 1, ok 2, ok 3, unsatisfiable, unauthorised). you can also play around with this locally by spinning up a monitor and a queue on your own machine, then sending the requests by hand (so three separate terminals):
$ cargo build && sudo IMAGE_DEPS_DIR=$(nix eval --raw .\#image-deps) LIB_MONITOR_DIR=. $CARGO_TARGET_DIR/debug/monitor$ cargo build && sudo IMAGE_DEPS_DIR=$(nix eval --raw .\#image-deps) LIB_MONITOR_DIR=. $CARGO_TARGET_DIR/debug/queue$ unique_id=$RANDOM; curl --fail-with-body -sSX POST --retry-max-time 3600 --retry 3600 --retry-delay 1 'http://192.168.100.1:8002/take/'"$unique_id"'?token='"$(curl --fail-with-body -sSX POST --oauth2-bearer "$SERVO_CI_MONITOR_API_TOKEN" 'http://192.168.100.1:8002/profile/servo-windows10/enqueue?unique_id='"$unique_id"'&qualified_repo=delan/servo&run_id=123')"