[Feat] Helm: add support for per-model tolerations#897
[Feat] Helm: add support for per-model tolerations#897AlexanderSing wants to merge 2 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Alexander Sing <AlexanderSing@live.de>
There was a problem hiding this comment.
Code Review
This pull request introduces per-model tolerations for the vLLM multi-deployment Helm chart, allowing users to either append model-specific tolerations to global ones or override them entirely using a new tolerationsPolicy field. The changes include updates to the deployment template, schema validation, and a comprehensive test suite. Feedback was provided to simplify the template logic for calculating effective tolerations to improve readability and maintainability.
Signed-off-by: Alexander Sing <AlexanderSing@live.de>
ruizhang0101
left a comment
There was a problem hiding this comment.
Thanks for the PR! The ray deployment (ray-cluster.yaml) is also using tolerations. Would it worth changing as well?
Also, could you add this to helm readme too?
| # - nodeName: (optional) Directly assigns a pod to a specific node (e.g., "192.168.56.5"). When both nodeName and nodeSelectorTerms are defined, the preference is given to nodeName. | ||
| # - shmSize: (optional, string) The size of the shared memory, e.g., "20Gi" | ||
| # - enableLoRA: (optional, bool) Whether to enable LoRA, e.g., true | ||
| # - nodeName: (optional) Directly assigns a pod to a specific node (e.g., "192.168.56.5"). When both nodeName and nodeSelectorTerms are defined, the preference is given to nodeName. |
There was a problem hiding this comment.
The indention seems a little bit off. Could you align with the others?
| }, | ||
| "containerPort": { | ||
| "required": [ | ||
| "enabled" |
There was a problem hiding this comment.
The indention seems a bit off.
Summary
tolerationsinmodelSpec, allowing individual model deployments to have their own Kubernetes tolerations in addition to or instead of the globalservingEngineSpec.tolerationstolerationsPolicyfield per model spec with two modes:append(default) merges global + model tolerations, andoverridereplaces global tolerations entirely for that modelvalues.schema.json,values.yaml, andvalues-example.yamlwith documentation and schema definitions for the new fieldstolerations_test.yaml) covering all policy combinations and edge casesMotivation
In multi-model deployments, different models may run on different node pools with distinct taints. Previously, tolerations could only be set globally for all models. This change enables operators to schedule models on specific node types (e.g., different GPU tiers) without requiring separate Helm releases.
Behavior
appendpolicy (default)overridepolicytolerationsfield rendered