[ray-operator] Add Pod cache selector to limit informer cache to Kube… by shihshaoni · Pull Request #4635 · ray-project/kuberay

shihshaoni · 2026-03-22T12:09:52Z

Why are these changes needed?

Currently, the KubeRay operator's informer cache watches and caches all Pod resources in the cluster when watching all namespaces, even though KubeRay only needs to manage Pods labeled with app.kubernetes.io/created-by=kuberay-operator. This causes unnecessary memory consumption, especially in large-scale clusters with thousands of Pods.

Related issue number

Closes #4625

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

…Ray-managed Pods

AndySung320

LGTM, thanks for the improvement!
Would you mind adding a minimal e2e test to verify that the Pods returned from the cache all carry the expected label (app.kubernetes.io/created-by=kuberay-operator)?
Also, could you please fix the lint errors?

rueian · 2026-03-23T23:25:45Z

Hi, thanks for the contribution. Please also make the lint pass. https://github.com/ray-project/kuberay/actions/runs/23402753349/job/68240318053?pr=4635

JiangJiaWei1103 · 2026-03-25T03:40:54Z

ray-operator/main.go

 	"os"
 	"strings"

+	corev1 "k8s.io/api/core/v1"


Thanks for contributing! You can install the pre-commit hook:

https://github.com/ray-project/kuberay/blob/master/ray-operator/DEVELOPMENT.md#pre-commit-hooks

so the linter will be run automatically on commits.

andrewsykim

LGTM, thanks for the optimization! Can you also update the comment here to include Pods

https://github.com/shihshaoni/kuberay/blob/473da77436381f3e1d234896bc773e9d65df7287/ray-operator/main.go#L220

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

cursor · 2026-03-27T09:33:20Z

ray-operator/main.go


 	return map[client.Object]cache.ByObject{
 		&batchv1.Job{}: {Label: selector},
+		&corev1.Pod{}:  {Label: selector},


Overridable cache-critical label can hide Pods from operator

Medium Severity

The new Pod cache selector filters on KubernetesCreatedByLabelKey, but labelPod doesn't protect this label from user override — only ray.io/node-type, ray.io/group, and ray.io/cluster are guarded. If a user specifies a different value for app.kubernetes.io/created-by in their pod template labels, the created Pod becomes invisible to the informer cache. The operator would then repeatedly create new Pods it can never observe, causing unbounded Pod creation. Unlike the existing Job cache filter (where the label is set directly by the operator with no override path), the Pod path passes user-provided labels through labelPod's override loop.

andrewsykim · 2026-03-27T13:09:41Z

ray-operator/test/e2e/raycluster_test.go

+		Should(WithTransform(RayClusterState, Equal(rayv1.Ready)))
+
+	// Verify all Pods carry the kuberay-operator label
+	pods, err := test.Client().Core().CoreV1().Pods(namespace.Name).List(test.Ctx(), metav1.ListOptions{})


I think this test would be improved significantly if we created a Pod that did not have the KubeRay label and prove that KubeRay does not see it in it's cache informer. Otherwise we are just testing that the labels are applied correctly

[ray-operator] Add Pod cache selector to limit informer cache to Kube…

473da77

…Ray-managed Pods

AndySung320 approved these changes Mar 23, 2026

View reviewed changes

JiangJiaWei1103 reviewed Mar 25, 2026

View reviewed changes

andrewsykim reviewed Mar 25, 2026

View reviewed changes

add e2e test to verify Pod cache selector label

a11708c

cursor bot reviewed Mar 27, 2026

View reviewed changes

andrewsykim reviewed Mar 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ray-operator] Add Pod cache selector to limit informer cache to Kube…#4635

[ray-operator] Add Pod cache selector to limit informer cache to Kube…#4635
shihshaoni wants to merge 2 commits intoray-project:masterfrom
shihshaoni:perf/add-pod-cache-selector

shihshaoni commented Mar 22, 2026

Uh oh!

AndySung320 left a comment •

edited

Loading

Uh oh!

rueian commented Mar 23, 2026

Uh oh!

JiangJiaWei1103 Mar 25, 2026

Uh oh!

andrewsykim left a comment

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Mar 27, 2026

Uh oh!

andrewsykim Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

shihshaoni commented Mar 22, 2026

Why are these changes needed?

Related issue number

Checks

Uh oh!

AndySung320 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rueian commented Mar 23, 2026

Uh oh!

JiangJiaWei1103 Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

andrewsykim left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Mar 27, 2026

Choose a reason for hiding this comment

Overridable cache-critical label can hide Pods from operator

Uh oh!

andrewsykim Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

AndySung320 left a comment •

edited

Loading