Skip to content

[ray-operator] Add Pod cache selector to limit informer cache to Kube…#4635

Open
shihshaoni wants to merge 2 commits intoray-project:masterfrom
shihshaoni:perf/add-pod-cache-selector
Open

[ray-operator] Add Pod cache selector to limit informer cache to Kube…#4635
shihshaoni wants to merge 2 commits intoray-project:masterfrom
shihshaoni:perf/add-pod-cache-selector

Conversation

@shihshaoni
Copy link
Copy Markdown

Why are these changes needed?

Currently, the KubeRay operator's informer cache watches and caches all Pod resources in the cluster when watching all namespaces, even though KubeRay only needs to manage Pods labeled with app.kubernetes.io/created-by=kuberay-operator. This causes unnecessary memory consumption, especially in large-scale clusters with thousands of Pods.

Related issue number

Closes #4625

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Copy link
Copy Markdown
Contributor

@AndySung320 AndySung320 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the improvement!
Would you mind adding a minimal e2e test to verify that the Pods returned from the cache all carry the expected label (app.kubernetes.io/created-by=kuberay-operator)?
Also, could you please fix the lint errors?

@rueian
Copy link
Copy Markdown
Collaborator

rueian commented Mar 23, 2026

Hi, thanks for the contribution. Please also make the lint pass. https://github.com/ray-project/kuberay/actions/runs/23402753349/job/68240318053?pr=4635

"os"
"strings"

corev1 "k8s.io/api/core/v1"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing! You can install the pre-commit hook:

https://github.com/ray-project/kuberay/blob/master/ray-operator/DEVELOPMENT.md#pre-commit-hooks

so the linter will be run automatically on commits.

Copy link
Copy Markdown
Member

@andrewsykim andrewsykim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the optimization! Can you also update the comment here to include Pods

https://github.com/shihshaoni/kuberay/blob/473da77436381f3e1d234896bc773e9d65df7287/ray-operator/main.go#L220

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor


return map[client.Object]cache.ByObject{
&batchv1.Job{}: {Label: selector},
&corev1.Pod{}: {Label: selector},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overridable cache-critical label can hide Pods from operator

Medium Severity

The new Pod cache selector filters on KubernetesCreatedByLabelKey, but labelPod doesn't protect this label from user override — only ray.io/node-type, ray.io/group, and ray.io/cluster are guarded. If a user specifies a different value for app.kubernetes.io/created-by in their pod template labels, the created Pod becomes invisible to the informer cache. The operator would then repeatedly create new Pods it can never observe, causing unbounded Pod creation. Unlike the existing Job cache filter (where the label is set directly by the operator with no override path), the Pod path passes user-provided labels through labelPod's override loop.

Fix in Cursor Fix in Web

Should(WithTransform(RayClusterState, Equal(rayv1.Ready)))

// Verify all Pods carry the kuberay-operator label
pods, err := test.Client().Core().CoreV1().Pods(namespace.Name).List(test.Ctx(), metav1.ListOptions{})
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test would be improved significantly if we created a Pod that did not have the KubeRay label and prove that KubeRay does not see it in it's cache informer. Otherwise we are just testing that the labels are applied correctly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[perf] [raycluster] Add cache selector to limit Pod caching to KubeRay-managed Pods only

5 participants