Skip to content

feat(metrics): expose queue guarantee resource metrics#5278

Open
Aman-Cool wants to merge 1 commit intovolcano-sh:masterfrom
Aman-Cool:feat/queue-guarantee-metrics
Open

feat(metrics): expose queue guarantee resource metrics#5278
Aman-Cool wants to merge 1 commit intovolcano-sh:masterfrom
Aman-Cool:feat/queue-guarantee-metrics

Conversation

@Aman-Cool
Copy link
Copy Markdown
Contributor

@Aman-Cool Aman-Cool commented May 6, 2026

What type of PR is this?
/kind feature

What this PR does / why we need it:

guarantee is the one queueAttr field that actually represents a contractual SLA; it floors deserved in proportion and blocks reclaim from evicting a queue below its minimum in capacity; but it never made it into Prometheus. Every other field that drives a scheduling decision has a metric. This one didn't.

Adds volcano_queue_guarantee_{milli_cpu,memory_bytes,scalar_resources} emitted each session from both plugins. Jobless queues still get the metric read straight from the queue spec, same pattern as the no-attr branch in capacity already uses for realCapacity. Hierarchical queues get it for free since attr.guarantee already accumulates child guarantees by the time we hit the metrics loop. DeleteQueueMetrics cleans it up.

Which issue(s) this PR fixes:
Fixes #NA

Special notes for your reviewer:

The no-attr branch in proportion now does a small spec read to get the guarantee for queues with no jobs; same thing capacity's no-attr branch already does for realCapacity at the equivalent spot, so the pattern is consistent. Everything else is a straight copy of the inqueue metric shape.

AI Disclosure: This change was developed with AI assistance (Claude). The author has reviewed and understands all changes.

Does this PR introduce a user-facing change?

Added Prometheus metrics for queue guarantee resources: volcano_queue_guarantee_milli_cpu, volcano_queue_guarantee_memory_bytes, and volcano_queue_guarantee_scalar_resources. Emitted by both proportion and capacity plugins.

Copilot AI review requested due to automatic review settings May 6, 2026 23:46
@volcano-sh-bot volcano-sh-bot added the kind/feature Categorizes issue or PR as related to a new feature. label May 6, 2026
@volcano-sh-bot volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 6, 2026
@Aman-Cool Aman-Cool force-pushed the feat/queue-guarantee-metrics branch from c861d14 to 387e46c Compare May 6, 2026 23:47
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces new Prometheus metrics to track guaranteed resources (CPU, memory, and scalar resources) for queues in the Volcano scheduler. It adds the UpdateQueueGuarantee function to the metrics package, integrates it into the capacity and proportion plugins, and includes comprehensive unit tests to verify the recording and deletion of these metrics. I have no feedback to provide.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds Prometheus gauge metrics to expose per-queue guarantee resources (CPU, memory, and scalar resources), so that the queue’s minimum/SLA-like floor is observable alongside other scheduling-driving queue attributes.

Changes:

  • Add new metrics: volcano_queue_guarantee_milli_cpu, volcano_queue_guarantee_memory_bytes, volcano_queue_guarantee_scalar_resources.
  • Emit guarantee metrics from both proportion and capacity scheduler plugins (including “no jobs in queue” paths).
  • Ensure DeleteQueueMetrics also removes the new guarantee metrics, and extend unit tests to validate presence and cleanup.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
pkg/scheduler/plugins/proportion/proportion.go Emit queue guarantee metrics for queues with and without in-session attributes.
pkg/scheduler/plugins/capacity/capacity.go Emit queue guarantee metrics for both flat and hierarchical queue attribute builders.
pkg/scheduler/metrics/queue.go Define guarantee gauge vectors, add UpdateQueueGuarantee, and delete guarantee metrics in DeleteQueueMetrics.
pkg/scheduler/metrics/queue_test.go Extend existing HTTP-scrape metrics test to assert guarantee metrics are emitted and deleted.
pkg/scheduler/metrics/queue_scalar_test.go Add scalar-resource behavior test coverage for UpdateQueueGuarantee (zeroing and cleanup).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Aman-Cool Aman-Cool force-pushed the feat/queue-guarantee-metrics branch from 387e46c to ab3aedc Compare May 6, 2026 23:54
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign thor-wl for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Aman-Cool
Copy link
Copy Markdown
Contributor Author

@hajnalmt @JesseStutler @hzxuzhonghu,One thing I want to flag proactively: The tests only verify the metrics layer (UpdateQueueGuarantee sets and deletes correctly); they don't exercise the actual call sites in proportion.go and capacity.go.
A typo at the plugin level (wrong field, wrong variable) wouldn't be caught. This is consistent with how every other queue metric is tested in the repo, so I kept it that way here, but it's a real gap. Happy to add plugin-level tests in a follow-up if you guys prefer it addressed now😄.

@Aman-Cool Aman-Cool force-pushed the feat/queue-guarantee-metrics branch from ab3aedc to b2e2abf Compare May 7, 2026 00:10
@Aman-Cool Aman-Cool requested a review from Copilot May 7, 2026 00:10
@Aman-Cool
Copy link
Copy Markdown
Contributor Author

/assign @hajnalmt

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Comment thread docs/design/metrics.md
Comment thread docs/design/metrics.md
Add volcano_queue_guarantee_{milli_cpu,memory_bytes,scalar_resources}
gauges emitted each session by proportion and capacity plugins.

Signed-off-by: Aman-Cool <aman017102007@gmail.com>
@Aman-Cool Aman-Cool force-pushed the feat/queue-guarantee-metrics branch from b2e2abf to 8926895 Compare May 7, 2026 00:16
@Aman-Cool
Copy link
Copy Markdown
Contributor Author

/assign @JesseStutler

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants