Skip to content

feat(scheduler): support resource requirement in priority plugin#442

Open
jinzhejz wants to merge 1 commit intoxflops:mainfrom
jinzhejz:fsm_enhance
Open

feat(scheduler): support resource requirement in priority plugin#442
jinzhejz wants to merge 1 commit intoxflops:mainfrom
jinzhejz:fsm_enhance

Conversation

@jinzhejz
Copy link
Copy Markdown

@jinzhejz jinzhejz commented May 9, 2026

feat(scheduler): replace slots with ResourceRequirement everywhere

  • The slots: u32 field is removed from SessionAttributes, Session, Executor (and their proto / SDK / DB-schema counterparts) and replaced by an unconditionally-populated resreq: ResourceRequirement. The cluster-wide default moves from cluster.slot to cluster.resreq, resolved server-side via a new three-level fallback (explicit → cluster.resreq → hardcoded {cpu:1, mem:1GiB, gpu:0}).

  • PriorityPlugin distributes per-resource-dimension capacity rather than a scalar total_slots, backed by a new set of arithmetic/comparison helpers on ResourceRequirement (mul, min, max, add, sub, equal, less[_equal], great[_equal]). The deprecated fairshare plugin is deleted and drf takes its place in the default policy list.

  • Breaking changes: proto field SessionSpec.slots/ExecutorSpec.slots reserved; flmctl create --slots removed; cluster.slot YAML key renamed to cluster.resreq; default policy fairshare → drf.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a cluster-wide default resource_requirement for sessions and transitions the scheduling logic, specifically in the PriorityPlugin, from a single-dimensional slot-based approach to a multi-dimensional ResourceRequirement (CPU, Memory, GPU). It includes updates to the core API types, configuration parsing, CLI tools, and SDK to support explicit resource requests and their server-side resolution. Feedback was provided regarding potential integer overflow and wrapping issues in the newly added ResourceRequirement::mul method, recommending the use of saturating arithmetic and safer type casting.

Comment thread common/src/apis/types.rs
Comment on lines +488 to +494
pub fn mul(&self, n: u32) -> Self {
Self {
cpu: self.cpu * (n as u64),
memory: self.memory * (n as u64),
gpu: self.gpu * (n as i32),
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The mul method can panic on overflow in debug builds or wrap around in release builds, which can lead to incorrect resource calculations. Specifically:

  1. The multiplication self.cpu * (n as u64) and self.memory * (n as u64) can overflow if the result exceeds u64::MAX.
  2. The cast n as i32 for gpu multiplication can wrap for n > i32::MAX, resulting in a negative multiplier. This is likely not intended for resource calculations.
  3. The multiplication self.gpu * (n as i32) can also overflow.

I recommend using saturating_mul to prevent overflows by clamping the result at the maximum value of the type. Also, the cast from u32 to i32 should be handled safely to avoid wrapping.

Suggested change
pub fn mul(&self, n: u32) -> Self {
Self {
cpu: self.cpu * (n as u64),
memory: self.memory * (n as u64),
gpu: self.gpu * (n as i32),
}
}
pub fn mul(&self, n: u32) -> Self {
let n_u64 = n as u64;
// Safely convert u32 to i32 for multiplication, clamping at i32::MAX
// to avoid overflow and negative results from wrapping.
let n_i32 = i32::try_from(n).unwrap_or(i32::MAX);
Self {
cpu: self.cpu.saturating_mul(n_u64),
memory: self.memory.saturating_mul(n_u64),
gpu: self.gpu.saturating_mul(n_i32),
}
}

@jinzhejz jinzhejz force-pushed the fsm_enhance branch 2 times, most recently from 39e5a36 to a9a37a5 Compare May 11, 2026 07:07
@jinzhejz
Copy link
Copy Markdown
Author

Update flmping perf:
5f551b8fe2a912730a0519edb24bf61b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant