feat(metrics): expose queue guarantee resource metrics#5278
feat(metrics): expose queue guarantee resource metrics#5278Aman-Cool wants to merge 1 commit intovolcano-sh:masterfrom
Conversation
c861d14 to
387e46c
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces new Prometheus metrics to track guaranteed resources (CPU, memory, and scalar resources) for queues in the Volcano scheduler. It adds the UpdateQueueGuarantee function to the metrics package, integrates it into the capacity and proportion plugins, and includes comprehensive unit tests to verify the recording and deletion of these metrics. I have no feedback to provide.
There was a problem hiding this comment.
Pull request overview
This PR adds Prometheus gauge metrics to expose per-queue guarantee resources (CPU, memory, and scalar resources), so that the queue’s minimum/SLA-like floor is observable alongside other scheduling-driving queue attributes.
Changes:
- Add new metrics:
volcano_queue_guarantee_milli_cpu,volcano_queue_guarantee_memory_bytes,volcano_queue_guarantee_scalar_resources. - Emit guarantee metrics from both proportion and capacity scheduler plugins (including “no jobs in queue” paths).
- Ensure
DeleteQueueMetricsalso removes the new guarantee metrics, and extend unit tests to validate presence and cleanup.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/scheduler/plugins/proportion/proportion.go | Emit queue guarantee metrics for queues with and without in-session attributes. |
| pkg/scheduler/plugins/capacity/capacity.go | Emit queue guarantee metrics for both flat and hierarchical queue attribute builders. |
| pkg/scheduler/metrics/queue.go | Define guarantee gauge vectors, add UpdateQueueGuarantee, and delete guarantee metrics in DeleteQueueMetrics. |
| pkg/scheduler/metrics/queue_test.go | Extend existing HTTP-scrape metrics test to assert guarantee metrics are emitted and deleted. |
| pkg/scheduler/metrics/queue_scalar_test.go | Add scalar-resource behavior test coverage for UpdateQueueGuarantee (zeroing and cleanup). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
387e46c to
ab3aedc
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@hajnalmt @JesseStutler @hzxuzhonghu,One thing I want to flag proactively: The tests only verify the metrics layer ( |
ab3aedc to
b2e2abf
Compare
|
/assign @hajnalmt |
Add volcano_queue_guarantee_{milli_cpu,memory_bytes,scalar_resources}
gauges emitted each session by proportion and capacity plugins.
Signed-off-by: Aman-Cool <aman017102007@gmail.com>
b2e2abf to
8926895
Compare
|
/assign @JesseStutler |
What type of PR is this?
/kind feature
What this PR does / why we need it:
guaranteeis the onequeueAttrfield that actually represents a contractual SLA; it floorsdeservedin proportion and blocks reclaim from evicting a queue below its minimum in capacity; but it never made it into Prometheus. Every other field that drives a scheduling decision has a metric. This one didn't.Adds
volcano_queue_guarantee_{milli_cpu,memory_bytes,scalar_resources}emitted each session from both plugins. Jobless queues still get the metric read straight from the queue spec, same pattern as the no-attr branch in capacity already uses forrealCapacity. Hierarchical queues get it for free sinceattr.guaranteealready accumulates child guarantees by the time we hit the metrics loop.DeleteQueueMetricscleans it up.Which issue(s) this PR fixes:
Fixes #NA
Special notes for your reviewer:
The no-attr branch in proportion now does a small spec read to get the guarantee for queues with no jobs; same thing capacity's no-attr branch already does for
realCapacityat the equivalent spot, so the pattern is consistent. Everything else is a straight copy of the inqueue metric shape.AI Disclosure: This change was developed with AI assistance (Claude). The author has reviewed and understands all changes.
Does this PR introduce a user-facing change?
Added Prometheus metrics for queue guarantee resources:
volcano_queue_guarantee_milli_cpu,volcano_queue_guarantee_memory_bytes, andvolcano_queue_guarantee_scalar_resources. Emitted by both proportion and capacity plugins.