[ray-operator] Add Pod cache selector to limit informer cache to Kube…#4635
[ray-operator] Add Pod cache selector to limit informer cache to Kube…#4635shihshaoni wants to merge 2 commits intoray-project:masterfrom
Conversation
|
Hi, thanks for the contribution. Please also make the lint pass. https://github.com/ray-project/kuberay/actions/runs/23402753349/job/68240318053?pr=4635 |
ray-operator/main.go
Outdated
| "os" | ||
| "strings" | ||
|
|
||
| corev1 "k8s.io/api/core/v1" |
There was a problem hiding this comment.
Thanks for contributing! You can install the pre-commit hook:
https://github.com/ray-project/kuberay/blob/master/ray-operator/DEVELOPMENT.md#pre-commit-hooks
so the linter will be run automatically on commits.
andrewsykim
left a comment
There was a problem hiding this comment.
LGTM, thanks for the optimization! Can you also update the comment here to include Pods
|
|
||
| return map[client.Object]cache.ByObject{ | ||
| &batchv1.Job{}: {Label: selector}, | ||
| &corev1.Pod{}: {Label: selector}, |
There was a problem hiding this comment.
Overridable cache-critical label can hide Pods from operator
Medium Severity
The new Pod cache selector filters on KubernetesCreatedByLabelKey, but labelPod doesn't protect this label from user override — only ray.io/node-type, ray.io/group, and ray.io/cluster are guarded. If a user specifies a different value for app.kubernetes.io/created-by in their pod template labels, the created Pod becomes invisible to the informer cache. The operator would then repeatedly create new Pods it can never observe, causing unbounded Pod creation. Unlike the existing Job cache filter (where the label is set directly by the operator with no override path), the Pod path passes user-provided labels through labelPod's override loop.
| Should(WithTransform(RayClusterState, Equal(rayv1.Ready))) | ||
|
|
||
| // Verify all Pods carry the kuberay-operator label | ||
| pods, err := test.Client().Core().CoreV1().Pods(namespace.Name).List(test.Ctx(), metav1.ListOptions{}) |
There was a problem hiding this comment.
I think this test would be improved significantly if we created a Pod that did not have the KubeRay label and prove that KubeRay does not see it in it's cache informer. Otherwise we are just testing that the labels are applied correctly


Why are these changes needed?
Currently, the KubeRay operator's informer cache watches and caches all Pod resources in the cluster when watching all namespaces, even though KubeRay only needs to manage Pods labeled with app.kubernetes.io/created-by=kuberay-operator. This causes unnecessary memory consumption, especially in large-scale clusters with thousands of Pods.
Related issue number
Closes #4625
Checks