-
Notifications
You must be signed in to change notification settings - Fork 728
feat: add scaling adapter #4699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Julien Mancuso <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]>
WalkthroughThis pull request removes the built-in autoscaling field from DynamoComponentDeployment and DynamoGraphDeployment CRDs, introducing a new dedicated DynamoGraphDeploymentScalingAdapter CRD and controller to manage per-service scaling instead. Autoscaling validation is removed from webhooks, the Autoscaling struct from API types, and HPA generation logic from controllers. Documentation is added for the new autoscaling architecture. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes
Poem
Pre-merge checks❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (5)
deploy/cloud/operator/internal/controller/common.go (1)
66-95: Consider renaming for clarity.The function name
servicesEqualsuggests full equality comparison, but it only compares replica counts. The comment correctly describes the behavior, but the name could be more precise to avoid future confusion.-// servicesEqual compares two services maps to detect changes in replica counts -func servicesEqual(old, new map[string]*v1alpha1.DynamoComponentDeploymentSharedSpec) bool { +// servicesReplicasEqual compares two services maps to detect changes in replica counts +func servicesReplicasEqual(old, new map[string]*v1alpha1.DynamoComponentDeploymentSharedSpec) bool {deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeploymentscalingadapters.yaml (1)
98-108: Consider adding a default value for replicas.The
replicasfield is required but has no default. While this is acceptable for a scaling adapter (the user/autoscaler should specify the value), consider whether a default of1would improve UX for initial creation.If a default is desired, add it to the schema:
replicas: description: |- Replicas is the desired number of replicas for the target service. This field is modified by external autoscalers (HPA/KEDA/Planner) or manually by users. + default: 1 format: int32 minimum: 0 type: integerdeploy/cloud/operator/internal/controller/dynamographdeploymentscalingadapter_controller.go (3)
78-85: Consider graceful handling when referenced DGD is deleted.When the DGD is not found, the controller returns an error causing exponential backoff requeue. If the DGD was intentionally deleted, this will generate continuous reconciliation attempts.
Consider: (1) setting a status condition indicating the DGD is missing, or (2) not requeuing when the DGD is gone (the adapter may be orphaned intentionally).
if err := r.Get(ctx, dgdKey, dgd); err != nil { if errors.IsNotFound(err) { logger.Error(err, "Referenced DGD not found", "dgd", dgdKey) - // DGD doesn't exist, can't proceed - return ctrl.Result{}, err + // DGD doesn't exist - don't requeue (avoids infinite retry) + r.Recorder.Eventf(adapter, corev1.EventTypeWarning, "DGDNotFound", + "Referenced DynamoGraphDeployment %s not found", dgdKey.Name) + return ctrl.Result{}, nil } return ctrl.Result{}, err }
103-120: Initial reconciliation may trigger false "out-of-band" sync.On first reconcile of a new adapter,
adapter.Status.Replicaswill be0(zero value). If the DGD service has, say,3replicas, this logic will detect it as an "out-of-band DGD change" and sync the adapter spec to3.While this might be acceptable behavior (adapter adopts DGD's current state), the log message "Detected out-of-band DGD change" is misleading for the initial sync case. Consider distinguishing between:
- Initial status population
- Actual out-of-band DGD edits
// 4. Detect out-of-band DGD changes (Scenario 1: User manually edited DGD) // If DGD replicas differ from adapter status, DGD was modified externally - if currentReplicas != adapter.Status.Replicas { - logger.Info("Detected out-of-band DGD change, syncing adapter from DGD", + if adapter.Status.Replicas != 0 && currentReplicas != adapter.Status.Replicas { + logger.Info("Detected out-of-band DGD change, syncing adapter from DGD",Or use a condition/annotation to track initialization state.
149-156: Minor: Comment numbering skips from step 5 to step 7.Step 6 is missing in the comment sequence. This is a cosmetic issue but may cause confusion.
- // 7. Update adapter status + // 6. Update adapter status
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (27)
deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml(0 hunks)deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml(0 hunks)deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeploymentscalingadapters.yaml(1 hunks)deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml(2 hunks)deploy/cloud/operator/api/v1alpha1/common.go(0 hunks)deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go(0 hunks)deploy/cloud/operator/api/v1alpha1/dynamographdeploymentscalingadapter_types.go(1 hunks)deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go(1 hunks)deploy/cloud/operator/cmd/main.go(1 hunks)deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml(0 hunks)deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml(0 hunks)deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeploymentscalingadapters.yaml(1 hunks)deploy/cloud/operator/config/rbac/role.yaml(2 hunks)deploy/cloud/operator/internal/consts/consts.go(0 hunks)deploy/cloud/operator/internal/controller/common.go(1 hunks)deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go(0 hunks)deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go(4 hunks)deploy/cloud/operator/internal/controller/dynamographdeploymentscalingadapter_controller.go(1 hunks)deploy/cloud/operator/internal/dynamo/graph.go(2 hunks)deploy/cloud/operator/internal/dynamo/graph_test.go(10 hunks)deploy/cloud/operator/internal/webhook/validation/dynamocomponentdeployment_test.go(0 hunks)deploy/cloud/operator/internal/webhook/validation/dynamographdeployment_test.go(0 hunks)deploy/cloud/operator/internal/webhook/validation/shared.go(0 hunks)deploy/cloud/operator/internal/webhook/validation/shared_test.go(0 hunks)docs/_sections/k8s_deployment.rst(1 hunks)docs/kubernetes/api_reference.md(17 hunks)docs/kubernetes/autoscaling.md(1 hunks)
💤 Files with no reviewable changes (12)
- deploy/cloud/operator/internal/webhook/validation/dynamographdeployment_test.go
- deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go
- deploy/cloud/operator/api/v1alpha1/common.go
- deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go
- deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml
- deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml
- deploy/cloud/operator/internal/consts/consts.go
- deploy/cloud/operator/internal/webhook/validation/shared.go
- deploy/cloud/operator/internal/webhook/validation/dynamocomponentdeployment_test.go
- deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml
- deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml
- deploy/cloud/operator/internal/webhook/validation/shared_test.go
🧰 Additional context used
🧠 Learnings (8)
📓 Common learnings
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml:1233-1235
Timestamp: 2025-07-18T16:04:47.465Z
Learning: The `stopSignal` field in Kubernetes CRDs like DynamoGraphDeployment and DynamoComponentDeployment is autogenerated by controller-gen when upgrading Kubernetes library versions, and represents expected upstream API changes rather than manual code that needs custom validation.
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1308-1312
Timestamp: 2025-06-11T21:29:28.650Z
Learning: User julienmancuso expects replies in English; avoid switching languages unless explicitly requested.
📚 Learning: 2025-09-04T19:03:06.643Z
Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 2872
File: examples/multimodal/deploy/agg_qwen.yaml:53-60
Timestamp: 2025-09-04T19:03:06.643Z
Learning: In the dynamo repository, Kubernetes Custom Resources use `gpu: "1"` format for GPU resource limits and requests, not the standard Kubernetes `nvidia.com/gpu: 1` format. This applies to DynamoGraphDeployment resources and other dynamo CRs.
Applied to files:
deploy/cloud/operator/api/v1alpha1/dynamographdeploymentscalingadapter_types.godocs/kubernetes/api_reference.mddeploy/cloud/helm/crds/templates/nvidia.com_dynamographdeploymentscalingadapters.yamldeploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeploymentscalingadapters.yamldeploy/cloud/operator/config/rbac/role.yaml
📚 Learning: 2025-07-18T16:04:47.465Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml:1233-1235
Timestamp: 2025-07-18T16:04:47.465Z
Learning: The `stopSignal` field in Kubernetes CRDs like DynamoGraphDeployment and DynamoComponentDeployment is autogenerated by controller-gen when upgrading Kubernetes library versions, and represents expected upstream API changes rather than manual code that needs custom validation.
Applied to files:
deploy/cloud/operator/api/v1alpha1/dynamographdeploymentscalingadapter_types.godocs/kubernetes/api_reference.mddeploy/cloud/helm/crds/templates/nvidia.com_dynamographdeploymentscalingadapters.yamldeploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeploymentscalingadapters.yamldeploy/cloud/operator/internal/dynamo/graph.godeploy/cloud/operator/internal/dynamo/graph_test.go
📚 Learning: 2025-07-18T16:05:05.534Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.
Applied to files:
docs/kubernetes/api_reference.mddeploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeploymentscalingadapters.yamldeploy/cloud/operator/internal/dynamo/graph.godeploy/cloud/operator/internal/dynamo/graph_test.go
📚 Learning: 2025-07-18T16:04:31.771Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:92-98
Timestamp: 2025-07-18T16:04:31.771Z
Learning: CRD schemas in files like deploy/cloud/helm/crds/templates/*.yaml are auto-generated from Kubernetes library upgrades and should not be manually modified as changes would be overwritten during regeneration.
Applied to files:
docs/kubernetes/api_reference.mddeploy/cloud/helm/crds/templates/nvidia.com_dynamographdeploymentscalingadapters.yamldeploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeploymentscalingadapters.yaml
📚 Learning: 2025-11-05T20:23:29.539Z
Learnt from: nv-hwoo
Repo: ai-dynamo/dynamo PR: 4112
File: examples/backends/sglang/deploy/agg.yaml:76-78
Timestamp: 2025-11-05T20:23:29.539Z
Learning: In DynamoGraphDeployment (DGD) YAML manifests, the volumeMounts field uses `mountPoint` (not the standard Kubernetes `mountPath`) to specify where to mount volumes. This is defined in the DGD custom resource definition API.
Applied to files:
docs/kubernetes/api_reference.md
📚 Learning: 2025-06-04T13:09:53.416Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 1365
File: deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go:171-178
Timestamp: 2025-06-04T13:09:53.416Z
Learning: The `DYN_DEPLOYMENT_CONFIG` environment variable (commonconsts.DynamoDeploymentConfigEnvVar) in the Dynamo operator will never be set via ValueFrom (secrets/config maps), only via direct Value assignment. The GetDynamoDeploymentConfig method correctly only checks env.Value for this specific environment variable.
Applied to files:
deploy/cloud/operator/internal/dynamo/graph_test.go
📚 Learning: 2025-05-29T16:29:45.152Z
Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 1266
File: deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go:85-85
Timestamp: 2025-05-29T16:29:45.152Z
Learning: In the Dynamo codebase, ComponentTypePlanner constants with different cases ("Planner" vs "planner") are intentional and serve different purposes: component type in config vs component label. These should not be made consistent as they handle different contexts.
Applied to files:
deploy/cloud/operator/internal/dynamo/graph_test.go
🧬 Code graph analysis (6)
deploy/cloud/operator/api/v1alpha1/dynamographdeploymentscalingadapter_types.go (1)
deploy/cloud/operator/api/v1alpha1/groupversion_info.go (1)
SchemeBuilder(35-35)
deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (4)
deploy/cloud/operator/api/v1alpha1/dynamographdeployment_types.go (1)
DynamoGraphDeployment(63-71)deploy/cloud/operator/internal/controller_common/resource.go (1)
SyncResource(60-195)deploy/cloud/operator/api/v1alpha1/dynamographdeploymentscalingadapter_types.go (4)
DynamoGraphDeploymentScalingAdapter(83-89)DynamoGraphDeploymentScalingAdapterSpec(25-35)DynamoGraphDeploymentServiceRef(38-48)DynamoGraphDeploymentScalingAdapterList(94-98)deploy/cloud/operator/internal/consts/consts.go (2)
KubeLabelDynamoGraphDeploymentName(39-39)KubeLabelDynamoComponent(40-40)
deploy/cloud/operator/internal/dynamo/graph.go (1)
deploy/cloud/operator/internal/consts/consts.go (1)
KubeLabelDynamoComponent(40-40)
deploy/cloud/operator/cmd/main.go (2)
deploy/cloud/operator/internal/controller/dynamographdeploymentscalingadapter_controller.go (1)
DynamoGraphDeploymentScalingAdapterReconciler(45-50)deploy/cloud/operator/internal/controller_common/predicate.go (1)
Config(61-87)
deploy/cloud/operator/internal/dynamo/graph_test.go (1)
deploy/cloud/operator/internal/consts/consts.go (3)
KubeLabelDynamoComponent(40-40)KubeLabelDynamoSelector(32-32)KubeLabelDynamoGraphDeploymentName(39-39)
deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go (1)
deploy/cloud/operator/api/v1alpha1/dynamographdeploymentscalingadapter_types.go (5)
DynamoGraphDeploymentScalingAdapter(83-89)DynamoGraphDeploymentScalingAdapterList(94-98)DynamoGraphDeploymentScalingAdapterSpec(25-35)DynamoGraphDeploymentScalingAdapterStatus(51-65)DynamoGraphDeploymentServiceRef(38-48)
🪛 markdownlint-cli2 (0.18.1)
docs/kubernetes/autoscaling.md
18-18: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
docs/kubernetes/api_reference.md
293-293: Link fragments should be valid
(MD051, link-fragments)
632-632: Link fragments should be valid
(MD051, link-fragments)
651-651: Link fragments should be valid
(MD051, link-fragments)
714-714: Link fragments should be valid
(MD051, link-fragments)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: sglang (arm64)
- GitHub Check: trtllm (amd64)
- GitHub Check: sglang (amd64)
- GitHub Check: trtllm (arm64)
- GitHub Check: vllm (arm64)
- GitHub Check: operator (amd64)
- GitHub Check: vllm (amd64)
- GitHub Check: operator (arm64)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (29)
deploy/cloud/operator/internal/dynamo/graph.go (1)
1037-1046: Service‑scoped labels for Grove cliques are consistent and adapter‑friendlyPassing
serviceNameintogenerateLabelsand addingKubeLabelDynamoComponent = componentNamemakes all cliques for a given service share the same selector (KubeLabelDynamoSelector) and component label, matching howDynamoComponentDeploymentis labeled and simplifying per‑service scaling selection. The behavior looks correct and is well‑aligned with the scaling adapter design.Also applies to: 1074-1084
deploy/cloud/operator/internal/dynamo/graph_test.go (1)
1297-1303: Tests correctly track new component‑level label and selector semanticsAll updated expectations for
KubeLabelDynamoComponentandKubeLabelDynamoSelectoron cliques and component deployments line up with the implementation ingenerateLabelsandGenerateDynamoComponentsDeployments: selectors are now per‑service (<graph>-<lower(serviceName)>), and every resource carries aKubeLabelDynamoComponentequal to the service key. This keeps tests in sync with the new scaling/labeling model.Also applies to: 1473-1481, 1873-1885, 2050-2062, 2191-2200, 2353-2361, 2775-2787, 2939-2949, 3079-3088, 3241-3249
docs/_sections/k8s_deployment.rst (1)
13-13: LGTM!The new toctree entry for the Autoscaling documentation follows the existing pattern and correctly references the new
autoscaling.mdfile.deploy/cloud/operator/internal/controller/common.go (1)
57-64: LGTM!The
getServiceKeyshelper is correctly implemented with proper capacity pre-allocation for the slice.docs/kubernetes/autoscaling.md (3)
18-39: Excellent architecture diagram.The ASCII diagram clearly illustrates the relationship between DGD, scaling adapters, and autoscalers. The static analysis warning about missing language specification can be safely ignored here as this is an ASCII diagram, not executable code.
136-161: Well-documented HPA example.The example correctly demonstrates targeting the DGDSA via
scaleTargetRefand includes sensible defaults for stabilization windows. Good practice showing both scale-up and scale-down behavior configuration.
363-395: Helpful troubleshooting section.The troubleshooting guidance covers common issues with practical kubectl commands. Good inclusion of the raw API check for custom metrics availability.
deploy/cloud/operator/api/v1alpha1/dynamographdeploymentscalingadapter_types.go (6)
24-35: Well-designed spec structure.The spec correctly uses
int32for replicas (matching Kubernetes conventions) withMinimum=0validation to support scale-to-zero scenarios with KEDA. Required validation ensures the adapter always has a valid reference.
37-48: LGTM!The
DynamoGraphDeploymentServiceRefcleanly encapsulates the reference to a specific service within a DGD, with appropriate validation to prevent empty strings.
50-65: LGTM!The status fields correctly support the scale subresource requirements:
Replicasfor current state,Selectorfor HPA pod discovery, andLastScaleTimefor scaling observability.
67-74: Proper scale subresource configuration.The kubebuilder markers correctly configure the scale subresource paths, enabling seamless integration with HPA, KEDA, and other scale-aware controllers. The print columns provide a good quick-view experience with
kubectl get dgdsa.
83-89: LGTM!The main type follows standard Kubernetes CRD patterns with embedded TypeMeta/ObjectMeta and separate Spec/Status structs. The
omitemptyon Spec is safe since the required field validations will be enforced by the webhook.
100-102: LGTM!The init function correctly registers both the resource and list types with the scheme builder.
deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go (1)
569-676: Auto-generated deepcopy functions look correct.The controller-gen output correctly handles:
- Value copy for
Spec(sinceDGDRefcontains only string fields)- Deep copy for
Status.LastScaleTimepointer viaDeepCopy()- Standard
DeepCopyObjectfor runtime.Object interface compliancedocs/kubernetes/api_reference.md (1)
297-373: Well-documented new scaling adapter API.The documentation for
DynamoGraphDeploymentScalingAdapteris comprehensive, covering the resource purpose, spec fields (replicas,dgdRef), and status fields (replicas,selector,lastScaleTime). The integration with HPA/KEDA via the scale subresource is clearly explained.deploy/cloud/operator/cmd/main.go (1)
581-589: LGTM - Controller registration follows established pattern.The
DynamoGraphDeploymentScalingAdapterReconcilerregistration is consistent with other controllers in this file. All required dependencies (Client,Scheme,Recorder,Config) are properly wired, and error handling follows the same pattern.deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeploymentscalingadapters.yaml (2)
131-136: Well-designed scale subresource configuration.The scale subresource is correctly configured for HPA/KEDA compatibility with proper paths for
specReplicasPath,statusReplicasPath, andlabelSelectorPath. This enables standard Kubernetes autoscalers to interact with the adapter.
1-136: LGTM - CRD definition is complete and follows Kubernetes conventions.The CRD properly defines all required components for a scaling adapter: spec with target reference and desired replicas, status for observed state, and the scale subresource for HPA/KEDA integration. The printer columns provide useful
kubectl getoutput.deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (4)
660-684: LGTM - Cleanup logic for orphaned adapters.The cleanup logic correctly lists adapters by DGD name label and removes those referencing services that no longer exist in the DGD spec. Error handling properly ignores
NotFounderrors, and events are recorded for observability.
689-693: LGTM with minor note on edge case.The naming function correctly lowercases service names for DNS subdomain compliance. Note that services differing only by case (e.g., "Frontend" vs "frontend") would produce the same adapter name, but this is unlikely in practice given typical service naming conventions.
714-720: Appropriate predicate configuration for owned adapters.The predicates correctly:
- Ignore create events (avoids redundant reconciliation after creating adapter)
- React to delete events (ensures adapters are recreated if deleted)
- Ignore update events (adapter controller manages status; spec changes flow from DGD)
This design ensures the DGD controller creates adapters but doesn't fight with the adapter controller for updates.
229-234: LGTM - Scaling adapter reconciliation placement in flow.The scaling adapter reconciliation is appropriately placed after PVC reconciliation and before service-specific resource reconciliation. This ensures adapters exist before any autoscaling-dependent logic runs.
deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml (1)
382-401: No action needed. The DynamoGraphDeploymentScalingAdapter controller does not use finalizers, so the missingdynamographdeploymentscalingadapters/finalizerspermission is not required. The RBAC configuration is correct.deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeploymentscalingadapters.yaml (2)
1-51: LGTM! Well-structured CRD with proper HPA/KEDA integration.The CRD metadata, naming conventions, and printer columns are appropriately configured. The scale subresource paths correctly reference the schema fields.
131-136: LGTM! Scale subresource is correctly configured for HPA compatibility.The
labelSelectorPath,specReplicasPath, andstatusReplicasPathcorrectly reference the schema fields, enabling standard Kubernetes autoscalers to interact with this resource.deploy/cloud/operator/internal/controller/dynamographdeploymentscalingadapter_controller.go (4)
44-50: LGTM! Reconciler struct follows standard patterns.The struct correctly embeds the Kubernetes client and includes the event recorder for observability.
122-147: LGTM! Scaling logic correctly updates DGD and records events.The logic properly updates the DGD's service replicas and records both Kubernetes events and
LastScaleTimefor observability.
87-101: > Likely an incorrect or invalid review comment.
171-200: LGTM! Controller setup with efficient watch predicates.Good use of
GenerationChangedPredicatefor the primary resource and custom predicates for DGD watches. TheservicesEqualcomparison ensures reconciliation only on meaningful changes, properly comparing replica counts between service maps with nil-safe handling.
deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go
Outdated
Show resolved
Hide resolved
deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go
Show resolved
Hide resolved
deploy/cloud/operator/internal/controller/dynamographdeploymentscalingadapter_controller.go
Show resolved
Hide resolved
Signed-off-by: Julien Mancuso <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]>
deploy/cloud/operator/api/v1alpha1/dynamographdeploymentscalingadapter_types.go
Outdated
Show resolved
Hide resolved
|
|
||
| // 4. Detect out-of-band DGD changes (Scenario 1: User manually edited DGD) | ||
| // If DGD replicas differ from adapter status, DGD was modified externally | ||
| if currentReplicas != adapter.Status.Replicas { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels problematic and unexpected to allow for edits to the DGD service replica count to update the DGDSA replica count.
I think on the DGD there should be a field to enable the DGDSA for a DGD or specific DGD service. In the case where it's enabled the DGDSA serves as the state of truth whereas when it's not enabled, the DGD service replica count is the source of truth. If a user wants to manually edit the DGD service replica count with SA enabled - they can modify the DGDSA directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DGD is the source of truth, users should be directed to update the DGD if they want to do any modifications (replica number included).
DGDSA is just the reflection of what's in the DGD and should be in sync with what's in the DGD.
DGDSA is an adapter and not a state owner which provides a translation layer for k8s-native autoscalers and should unconditionally be present. Adding an "enable DGDSA" flag would introduce conditional behavior and additional confusion (two sources of truth based on a flag)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in your current implementation, the DGD is the source of truth as well as the DGDSA. User edits DGD service replicas -> this will always win. If the DGDSA spec replicas is updated and no updates to the DGD -> DGD service replicas are updated. However, there is a subtle case where an update to the DGDSA is discarded if a concurrent user edit it made to the DGD. We can document this, but I don't think it's a good k8s pattern.
If we think about Deployments and ReplicaSets, the Deployment replica count is the source of truth. A user edit to the ReplicaSet replica count will always be overridden by the Deployment replica count.
Adding an enable DGDSA flag determines what resource is determining the state of truth. I'm a user just deploying a DGD and doesn't care about auto-scaling -> DGDSA disabled. I can update the DGD service replicas directly. OR I'm as user who is using planner, HPA or KEDA -> DGDSA enabled. I don't think the user wants to be overriding the DGD replica count with manual edits -> this would conflict with Planner/HPA/KEDA. If they want to edit the DGD service replica count manually (say for testing) -> they can just update the DGDSA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 7706428 as per our team discussion
Signed-off-by: Julien Mancuso <[email protected]>
|
|
||
| // GetWarnings returns deprecation warnings for the spec. | ||
| // This should be called after Validate() to collect any deprecation notices. | ||
| func (v *SharedSpecValidator) GetWarnings() []string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think it would be cleaner to have DynamoGraphDeploymentValidator.Validate() return (admission.Warnings, error).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 7706428
| { | ||
| name: "valid spec with all fields", | ||
| spec: &nvidiacomv1alpha1.DynamoComponentDeploymentSharedSpec{ | ||
| Replicas: &validReplicas, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: unit test for warning validation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
|
|
||
| // 4. Detect out-of-band DGD changes (Scenario 1: User manually edited DGD) | ||
| // If DGD replicas differ from adapter status, DGD was modified externally | ||
| if currentReplicas != adapter.Status.Replicas { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in your current implementation, the DGD is the source of truth as well as the DGDSA. User edits DGD service replicas -> this will always win. If the DGDSA spec replicas is updated and no updates to the DGD -> DGD service replicas are updated. However, there is a subtle case where an update to the DGDSA is discarded if a concurrent user edit it made to the DGD. We can document this, but I don't think it's a good k8s pattern.
If we think about Deployments and ReplicaSets, the Deployment replica count is the source of truth. A user edit to the ReplicaSet replica count will always be overridden by the Deployment replica count.
Adding an enable DGDSA flag determines what resource is determining the state of truth. I'm a user just deploying a DGD and doesn't care about auto-scaling -> DGDSA disabled. I can update the DGD service replicas directly. OR I'm as user who is using planner, HPA or KEDA -> DGDSA enabled. I don't think the user wants to be overriding the DGD replica count with manual edits -> this would conflict with Planner/HPA/KEDA. If they want to edit the DGD service replica count manually (say for testing) -> they can just update the DGDSA.
Signed-off-by: Julien Mancuso <[email protected]>
4cd75f8 to
7706428
Compare
Signed-off-by: Julien Mancuso <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]>
| // ValidateUpdate performs stateful validation comparing old and new DynamoGraphDeployment. | ||
| // Returns warnings and error. | ||
| func (v *DynamoGraphDeploymentValidator) ValidateUpdate(old *nvidiacomv1alpha1.DynamoGraphDeployment) (admission.Warnings, error) { | ||
| return v.ValidateUpdateWithUserInfo(old, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it feels confusing that ValidateUpdate calls ValidateUpdateWithUserInfo. It seems that ValidateUpdate is called when we can't retrieve the user info.
For replica updates, if we fail to get the user info -> then only allow service replica updates where DGDSA is disabled. Right now though, if the user info can't be retrieved we are allowing requests that are modifying service replicas even if DGDSA is enabled. In this case we should reject.
| @@ -0,0 +1,733 @@ | |||
| # Autoscaling | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This autoscaling doc is awesome! Thanks for the adding in a section on HPA and KEDA
Overview:
design proposal : https://docs.google.com/document/d/1qHDYqbfMO6WJY2vM30YurAh1Gm3Gk7eKnZWdsakFn6M/edit?usp=sharing
This MR introduces DGD Scaling adapter which allows to plug standard kubernetes scaling mechanism (HPA, Keda, ...)
See autoscaling user guide in this MR for more details
Summary by CodeRabbit
New Features
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.