-
Notifications
You must be signed in to change notification settings - Fork 733
[ray-operator] Add Pod cache selector to limit informer cache to Kube… #4635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -264,3 +264,32 @@ func TestRayClusterUpgradeStrategy(t *testing.T) { | |
| g.Expect(err).NotTo(HaveOccurred()) | ||
| g.Expect(newWorkerPods).To(HaveLen(1)) | ||
| } | ||
|
|
||
| func TestRayClusterPodCacheSelector(t *testing.T) { | ||
| test := With(t) | ||
| g := NewWithT(t) | ||
|
|
||
| // Create a namespace | ||
| namespace := test.NewTestNamespace() | ||
|
|
||
| rayClusterAC := rayv1ac.RayCluster("raycluster-pod-cache", namespace.Name). | ||
| WithSpec(NewRayClusterSpec()) | ||
|
|
||
| rayCluster, err := test.Client().Ray().RayV1().RayClusters(namespace.Name).Apply(test.Ctx(), rayClusterAC, TestApplyOptions) | ||
| g.Expect(err).NotTo(HaveOccurred()) | ||
| LogWithTimestamp(test.T(), "Created RayCluster %s/%s successfully", rayCluster.Namespace, rayCluster.Name) | ||
|
|
||
| LogWithTimestamp(test.T(), "Waiting for RayCluster %s/%s to become ready", rayCluster.Namespace, rayCluster.Name) | ||
| g.Eventually(RayCluster(test, namespace.Name, rayCluster.Name), TestTimeoutMedium). | ||
| Should(WithTransform(RayClusterState, Equal(rayv1.Ready))) | ||
|
|
||
| // Verify all Pods carry the kuberay-operator label | ||
| pods, err := test.Client().Core().CoreV1().Pods(namespace.Name).List(test.Ctx(), metav1.ListOptions{}) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this test would be improved significantly if we created a Pod that did not have the KubeRay label and prove that KubeRay does not see it in it's cache informer. Otherwise we are just testing that the labels are applied correctly
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The current test uses the Kubernetes API client directly, which bypasses the informer cache so it only proves that labels are applied correctly, not that the cache selector is actually filtering as intended. Rather than adding this to the e2e test suite (where we don't have access to the manager's internal cache), I'd suggest adding an integration test in |
||
| g.Expect(err).NotTo(HaveOccurred()) | ||
| g.Expect(pods.Items).NotTo(BeEmpty()) | ||
|
|
||
| for _, pod := range pods.Items { | ||
| g.Expect(pod.Labels).To(HaveKeyWithValue(utils.KubernetesCreatedByLabelKey, utils.ComponentName), | ||
| "Pod %s/%s should have label %s=%s", pod.Namespace, pod.Name, utils.KubernetesCreatedByLabelKey, utils.ComponentName) | ||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overridable cache-critical label can hide Pods from operator
Medium Severity
The new Pod cache selector filters on
KubernetesCreatedByLabelKey, butlabelPoddoesn't protect this label from user override — onlyray.io/node-type,ray.io/group, andray.io/clusterare guarded. If a user specifies a different value forapp.kubernetes.io/created-byin their pod template labels, the created Pod becomes invisible to the informer cache. The operator would then repeatedly create new Pods it can never observe, causing unbounded Pod creation. Unlike the existing Job cache filter (where the label is set directly by the operator with no override path), the Pod path passes user-provided labels throughlabelPod's override loop.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the code, the KubernetesCreatedByLabelKey label can indeed be overwritten by the user. Since these labels (ray.io/node-type, ray.io/group, and ray.io/cluster) cannot be overridden by the user's raycluster labels, and both worker pods and head pods will be set, caching can be done using any one of these tags. This can significantly reduce operator caching. An example of modification is as follows: