feat(scheduler): support resource requirement in priority plugin#442
feat(scheduler): support resource requirement in priority plugin#442jinzhejz wants to merge 1 commit intoxflops:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a cluster-wide default resource_requirement for sessions and transitions the scheduling logic, specifically in the PriorityPlugin, from a single-dimensional slot-based approach to a multi-dimensional ResourceRequirement (CPU, Memory, GPU). It includes updates to the core API types, configuration parsing, CLI tools, and SDK to support explicit resource requests and their server-side resolution. Feedback was provided regarding potential integer overflow and wrapping issues in the newly added ResourceRequirement::mul method, recommending the use of saturating arithmetic and safer type casting.
| pub fn mul(&self, n: u32) -> Self { | ||
| Self { | ||
| cpu: self.cpu * (n as u64), | ||
| memory: self.memory * (n as u64), | ||
| gpu: self.gpu * (n as i32), | ||
| } | ||
| } |
There was a problem hiding this comment.
The mul method can panic on overflow in debug builds or wrap around in release builds, which can lead to incorrect resource calculations. Specifically:
- The multiplication
self.cpu * (n as u64)andself.memory * (n as u64)can overflow if the result exceedsu64::MAX. - The cast
n as i32forgpumultiplication can wrap forn > i32::MAX, resulting in a negative multiplier. This is likely not intended for resource calculations. - The multiplication
self.gpu * (n as i32)can also overflow.
I recommend using saturating_mul to prevent overflows by clamping the result at the maximum value of the type. Also, the cast from u32 to i32 should be handled safely to avoid wrapping.
| pub fn mul(&self, n: u32) -> Self { | |
| Self { | |
| cpu: self.cpu * (n as u64), | |
| memory: self.memory * (n as u64), | |
| gpu: self.gpu * (n as i32), | |
| } | |
| } | |
| pub fn mul(&self, n: u32) -> Self { | |
| let n_u64 = n as u64; | |
| // Safely convert u32 to i32 for multiplication, clamping at i32::MAX | |
| // to avoid overflow and negative results from wrapping. | |
| let n_i32 = i32::try_from(n).unwrap_or(i32::MAX); | |
| Self { | |
| cpu: self.cpu.saturating_mul(n_u64), | |
| memory: self.memory.saturating_mul(n_u64), | |
| gpu: self.gpu.saturating_mul(n_i32), | |
| } | |
| } |
39e5a36 to
a9a37a5
Compare
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |

feat(scheduler): replace slots with ResourceRequirement everywhere
The slots: u32 field is removed from SessionAttributes, Session, Executor (and their proto / SDK / DB-schema counterparts) and replaced by an unconditionally-populated resreq: ResourceRequirement. The cluster-wide default moves from cluster.slot to cluster.resreq, resolved server-side via a new three-level fallback (explicit → cluster.resreq → hardcoded {cpu:1, mem:1GiB, gpu:0}).
PriorityPlugin distributes per-resource-dimension capacity rather than a scalar total_slots, backed by a new set of arithmetic/comparison helpers on ResourceRequirement (mul, min, max, add, sub, equal, less[_equal], great[_equal]). The deprecated fairshare plugin is deleted and drf takes its place in the default policy list.
Breaking changes: proto field SessionSpec.slots/ExecutorSpec.slots reserved; flmctl create --slots removed; cluster.slot YAML key renamed to cluster.resreq; default policy fairshare → drf.