feat: add scaling adapter #4699

julienmancuso · 2025-12-02T18:31:05Z

Overview:

design proposal : https://docs.google.com/document/d/1qHDYqbfMO6WJY2vM30YurAh1Gm3Gk7eKnZWdsakFn6M/edit?usp=sharing

This MR introduces DGD Scaling adapter which allows to plug standard kubernetes scaling mechanism (HPA, Keda, ...)

See autoscaling user guide in this MR for more details

Summary by CodeRabbit

New Features
- Introduced DynamoGraphDeploymentScalingAdapter resource for managing per-service scaling configuration in DynamoGraphDeployments.
Documentation
- Added comprehensive autoscaling guide covering multiple scaling strategies and best practices for Kubernetes environments.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Julien Mancuso <[email protected]>

coderabbitai · 2025-12-02T18:37:48Z

Walkthrough

This pull request removes the built-in autoscaling field from DynamoComponentDeployment and DynamoGraphDeployment CRDs, introducing a new dedicated DynamoGraphDeploymentScalingAdapter CRD and controller to manage per-service scaling instead. Autoscaling validation is removed from webhooks, the Autoscaling struct from API types, and HPA generation logic from controllers. Documentation is added for the new autoscaling architecture.

Changes

Cohort / File(s)	Summary
CRD Autoscaling Field Removals `deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponent.yaml`, `deploy/cloud/helm/crds/templates/nvidia.com_dynamograph.yaml`, `deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponent.yaml`, `deploy/cloud/operator/config/crd/bases/nvidia.com_dynamograph.yaml`	Removes entire autoscaling nested object (behavior, scaleDown, scaleUp, policies, metrics, minReplicas, maxReplicas) from deployment CRD specs
New Scaling Adapter CRD `deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeploymentscalingadapters.yaml`, `deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeploymentscalingadapters.yaml`	Introduces new CustomResourceDefinition for DynamoGraphDeploymentScalingAdapter with scale subresource, dgdRef, replicas, and status fields
API Type Definitions `deploy/cloud/operator/api/v1alpha1/common.go`, `deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go`, `deploy/cloud/operator/api/v1alpha1/dynamographdeploymentscalingadapter_types.go`	Removes Autoscaling struct and import; removes autoscaling field from DynamoComponentDeploymentSharedSpec; adds new DynamoGraphDeploymentScalingAdapter, DynamoGraphDeploymentScalingAdapterSpec, DynamoGraphDeploymentScalingAdapterStatus, and DynamoGraphDeploymentServiceRef types
Generated DeepCopy Methods `deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go`	Removes DeepCopy methods for Autoscaling; adds DeepCopy/DeepCopyInto/DeepCopyObject methods for new scaling adapter types
RBAC and Permissions `deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml`, `deploy/cloud/operator/config/rbac/role.yaml`	Adds dynamographdeploymentscalingadapters resource with CRUD verbs, finalizers, and status subresource permissions
Controller Setup and Registration `deploy/cloud/operator/cmd/main.go`	Registers new DynamoGraphDeploymentScalingAdapterReconciler with manager
Existing Controller Updates `deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go`, `deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go`	Removes HPA generation logic and generateHPA method from component controller; adds scaling adapter reconciliation, cleanup, and generation logic to graph controller
New Scaling Adapter Reconciler `deploy/cloud/operator/internal/controller/dynamographdeploymentscalingadapter_controller.go`	Implements new reconciler for DynamoGraphDeploymentScalingAdapter with sync logic, event recording, status updates, and pod selector generation
Controller Utilities `deploy/cloud/operator/internal/controller/common.go`	Adds helper functions: getServiceKeys and servicesEqual for service comparison
Constants and Internal Updates `deploy/cloud/operator/internal/consts/consts.go`, `deploy/cloud/operator/internal/dynamo/graph.go`	Removes HPACPUDefaultAverageUtilization constant; updates graph generation to use serviceName for labels with new KubeLabelDynamoComponent
Test Updates `deploy/cloud/operator/internal/dynamo/graph_test.go`, `deploy/cloud/operator/internal/webhook/validation/dynamocomponentdeployment_test.go`, `deploy/cloud/operator/internal/webhook/validation/dynamographdeployment_test.go`, `deploy/cloud/operator/internal/webhook/validation/shared.go`, `deploy/cloud/operator/internal/webhook/validation/shared_test.go`	Removes autoscaling test cases and validation; removes Autoscaling field from test fixtures; updates graph test labels to include KubeLabelDynamoComponent
Documentation `docs/_sections/k8s_deployment.rst`, `docs/kubernetes/api_reference.md`, `docs/kubernetes/autoscaling.md`	Adds autoscaling documentation link; documents new DynamoGraphDeploymentScalingAdapter API types with full schema details; adds comprehensive autoscaling guide covering HPA, KEDA, Planner, and custom controller strategies

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

New reconciler logic (dynamographdeploymentscalingadapter_controller.go): Requires careful review of reconciliation flow, event handling, status updates, and DGD reference validation
Controller integration in existing graph deployment controller: Verify scaling adapter creation, orphaned adapter cleanup, and label propagation
Pod selector generation and service reference handling in the new adapter
RBAC and CRD schema validation: Ensure proper scale subresource configuration and finalizer handling
Test coverage: Verify all autoscaling removal paths are properly tested and no orphaned validation logic remains
Label propagation changes: Validate KubeLabelDynamoComponent is correctly applied throughout graph generation

Poem

🐰 The autoscaler hops away,
Adapters bloom to save the day,
One CRD now scales per service,
No more HPA—sleek and nervous!
With labels fresh and logic clean,
The finest scaling I've e'er seen!

Pre-merge checks

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 55.56% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description check	❓ Inconclusive	The PR description is minimal and lacks required structural details. It references external documents but doesn't clearly explain changes, rationale, or testing in the template format.	Add complete sections including Overview, Details of changes, specific files for review, and linked issues. The description should be self-contained and follow the template structure provided.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: add scaling adapter' is clear and concise, directly referencing the main change. However, it is somewhat generic and doesn't specify that this is about DynamoGraphDeployment or the new DGD Scaling Adapter resource.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (5)

deploy/cloud/operator/internal/controller/common.go (1)
66-95: Consider renaming for clarity.

The function name servicesEqual suggests full equality comparison, but it only compares replica counts. The comment correctly describes the behavior, but the name could be more precise to avoid future confusion.
-// servicesEqual compares two services maps to detect changes in replica counts
-func servicesEqual(old, new map[string]*v1alpha1.DynamoComponentDeploymentSharedSpec) bool {
+// servicesReplicasEqual compares two services maps to detect changes in replica counts
+func servicesReplicasEqual(old, new map[string]*v1alpha1.DynamoComponentDeploymentSharedSpec) bool {
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeploymentscalingadapters.yaml (1)
98-108: Consider adding a default value for replicas.

The replicas field is required but has no default. While this is acceptable for a scaling adapter (the user/autoscaler should specify the value), consider whether a default of 1 would improve UX for initial creation.

If a default is desired, add it to the schema:
              replicas:
                description: |-
                  Replicas is the desired number of replicas for the target service.
                  This field is modified by external autoscalers (HPA/KEDA/Planner) or manually by users.
+               default: 1
                format: int32
                minimum: 0
                type: integer
deploy/cloud/operator/internal/controller/dynamographdeploymentscalingadapter_controller.go (3)
78-85: Consider graceful handling when referenced DGD is deleted.

When the DGD is not found, the controller returns an error causing exponential backoff requeue. If the DGD was intentionally deleted, this will generate continuous reconciliation attempts.

Consider: (1) setting a status condition indicating the DGD is missing, or (2) not requeuing when the DGD is gone (the adapter may be orphaned intentionally).
 	if err := r.Get(ctx, dgdKey, dgd); err != nil {
 		if errors.IsNotFound(err) {
 			logger.Error(err, "Referenced DGD not found", "dgd", dgdKey)
-			// DGD doesn't exist, can't proceed
-			return ctrl.Result{}, err
+			// DGD doesn't exist - don't requeue (avoids infinite retry)
+			r.Recorder.Eventf(adapter, corev1.EventTypeWarning, "DGDNotFound",
+				"Referenced DynamoGraphDeployment %s not found", dgdKey.Name)
+			return ctrl.Result{}, nil
 		}
 		return ctrl.Result{}, err
 	}
103-120: Initial reconciliation may trigger false "out-of-band" sync.

On first reconcile of a new adapter, adapter.Status.Replicas will be 0 (zero value). If the DGD service has, say, 3 replicas, this logic will detect it as an "out-of-band DGD change" and sync the adapter spec to 3.

While this might be acceptable behavior (adapter adopts DGD's current state), the log message "Detected out-of-band DGD change" is misleading for the initial sync case. Consider distinguishing between:

Initial status population

Actual out-of-band DGD edits
 	// 4. Detect out-of-band DGD changes (Scenario 1: User manually edited DGD)
 	// If DGD replicas differ from adapter status, DGD was modified externally
-	if currentReplicas != adapter.Status.Replicas {
-		logger.Info("Detected out-of-band DGD change, syncing adapter from DGD",
+	if adapter.Status.Replicas != 0 && currentReplicas != adapter.Status.Replicas {
+		logger.Info("Detected out-of-band DGD change, syncing adapter from DGD",
Or use a condition/annotation to track initialization state.

149-156: Minor: Comment numbering skips from step 5 to step 7.

Step 6 is missing in the comment sequence. This is a cosmetic issue but may cause confusion.
-	// 7. Update adapter status
+	// 6. Update adapter status

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 44e8600 and 6746518.

📒 Files selected for processing (27)

deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml (0 hunks)
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml (0 hunks)
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeploymentscalingadapters.yaml (1 hunks)
deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml (2 hunks)
deploy/cloud/operator/api/v1alpha1/common.go (0 hunks)
deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (0 hunks)
deploy/cloud/operator/api/v1alpha1/dynamographdeploymentscalingadapter_types.go (1 hunks)
deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go (1 hunks)
deploy/cloud/operator/cmd/main.go (1 hunks)
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml (0 hunks)
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml (0 hunks)
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeploymentscalingadapters.yaml (1 hunks)
deploy/cloud/operator/config/rbac/role.yaml (2 hunks)
deploy/cloud/operator/internal/consts/consts.go (0 hunks)
deploy/cloud/operator/internal/controller/common.go (1 hunks)
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (0 hunks)
deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (4 hunks)
deploy/cloud/operator/internal/controller/dynamographdeploymentscalingadapter_controller.go (1 hunks)
deploy/cloud/operator/internal/dynamo/graph.go (2 hunks)
deploy/cloud/operator/internal/dynamo/graph_test.go (10 hunks)
deploy/cloud/operator/internal/webhook/validation/dynamocomponentdeployment_test.go (0 hunks)
deploy/cloud/operator/internal/webhook/validation/dynamographdeployment_test.go (0 hunks)
deploy/cloud/operator/internal/webhook/validation/shared.go (0 hunks)
deploy/cloud/operator/internal/webhook/validation/shared_test.go (0 hunks)
docs/_sections/k8s_deployment.rst (1 hunks)
docs/kubernetes/api_reference.md (17 hunks)
docs/kubernetes/autoscaling.md (1 hunks)

💤 Files with no reviewable changes (12)

deploy/cloud/operator/internal/webhook/validation/dynamographdeployment_test.go
deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go
deploy/cloud/operator/api/v1alpha1/common.go
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml
deploy/cloud/operator/internal/consts/consts.go
deploy/cloud/operator/internal/webhook/validation/shared.go
deploy/cloud/operator/internal/webhook/validation/dynamocomponentdeployment_test.go
deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml
deploy/cloud/operator/internal/webhook/validation/shared_test.go

🧰 Additional context used

🧠 Learnings (8)

📓 Common learnings

Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml:1233-1235
Timestamp: 2025-07-18T16:04:47.465Z
Learning: The `stopSignal` field in Kubernetes CRDs like DynamoGraphDeployment and DynamoComponentDeployment is autogenerated by controller-gen when upgrading Kubernetes library versions, and represents expected upstream API changes rather than manual code that needs custom validation.

Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.

Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1308-1312
Timestamp: 2025-06-11T21:29:28.650Z
Learning: User julienmancuso expects replies in English; avoid switching languages unless explicitly requested.

📚 Learning: 2025-09-04T19:03:06.643Z

Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 2872
File: examples/multimodal/deploy/agg_qwen.yaml:53-60
Timestamp: 2025-09-04T19:03:06.643Z
Learning: In the dynamo repository, Kubernetes Custom Resources use `gpu: "1"` format for GPU resource limits and requests, not the standard Kubernetes `nvidia.com/gpu: 1` format. This applies to DynamoGraphDeployment resources and other dynamo CRs.

Applied to files:

deploy/cloud/operator/api/v1alpha1/dynamographdeploymentscalingadapter_types.go
docs/kubernetes/api_reference.md
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeploymentscalingadapters.yaml
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeploymentscalingadapters.yaml
deploy/cloud/operator/config/rbac/role.yaml

📚 Learning: 2025-07-18T16:04:47.465Z

Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml:1233-1235
Timestamp: 2025-07-18T16:04:47.465Z
Learning: The `stopSignal` field in Kubernetes CRDs like DynamoGraphDeployment and DynamoComponentDeployment is autogenerated by controller-gen when upgrading Kubernetes library versions, and represents expected upstream API changes rather than manual code that needs custom validation.

Applied to files:

deploy/cloud/operator/api/v1alpha1/dynamographdeploymentscalingadapter_types.go
docs/kubernetes/api_reference.md
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeploymentscalingadapters.yaml
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeploymentscalingadapters.yaml
deploy/cloud/operator/internal/dynamo/graph.go
deploy/cloud/operator/internal/dynamo/graph_test.go

📚 Learning: 2025-07-18T16:05:05.534Z

Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.

Applied to files:

docs/kubernetes/api_reference.md
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeploymentscalingadapters.yaml
deploy/cloud/operator/internal/dynamo/graph.go
deploy/cloud/operator/internal/dynamo/graph_test.go

📚 Learning: 2025-07-18T16:04:31.771Z

Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:92-98
Timestamp: 2025-07-18T16:04:31.771Z
Learning: CRD schemas in files like deploy/cloud/helm/crds/templates/*.yaml are auto-generated from Kubernetes library upgrades and should not be manually modified as changes would be overwritten during regeneration.

Applied to files:

docs/kubernetes/api_reference.md
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeploymentscalingadapters.yaml
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeploymentscalingadapters.yaml

📚 Learning: 2025-11-05T20:23:29.539Z

Learnt from: nv-hwoo
Repo: ai-dynamo/dynamo PR: 4112
File: examples/backends/sglang/deploy/agg.yaml:76-78
Timestamp: 2025-11-05T20:23:29.539Z
Learning: In DynamoGraphDeployment (DGD) YAML manifests, the volumeMounts field uses `mountPoint` (not the standard Kubernetes `mountPath`) to specify where to mount volumes. This is defined in the DGD custom resource definition API.

Applied to files:

docs/kubernetes/api_reference.md

📚 Learning: 2025-06-04T13:09:53.416Z

Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 1365
File: deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go:171-178
Timestamp: 2025-06-04T13:09:53.416Z
Learning: The `DYN_DEPLOYMENT_CONFIG` environment variable (commonconsts.DynamoDeploymentConfigEnvVar) in the Dynamo operator will never be set via ValueFrom (secrets/config maps), only via direct Value assignment. The GetDynamoDeploymentConfig method correctly only checks env.Value for this specific environment variable.

Applied to files:

deploy/cloud/operator/internal/dynamo/graph_test.go

📚 Learning: 2025-05-29T16:29:45.152Z

Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 1266
File: deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go:85-85
Timestamp: 2025-05-29T16:29:45.152Z
Learning: In the Dynamo codebase, ComponentTypePlanner constants with different cases ("Planner" vs "planner") are intentional and serve different purposes: component type in config vs component label. These should not be made consistent as they handle different contexts.

Applied to files:

deploy/cloud/operator/internal/dynamo/graph_test.go

🧬 Code graph analysis (6)

deploy/cloud/operator/api/v1alpha1/dynamographdeploymentscalingadapter_types.go (1)

deploy/cloud/operator/api/v1alpha1/groupversion_info.go (1)

SchemeBuilder (35-35)

deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (4)

deploy/cloud/operator/api/v1alpha1/dynamographdeployment_types.go (1)

DynamoGraphDeployment (63-71)

deploy/cloud/operator/internal/controller_common/resource.go (1)

SyncResource (60-195)

deploy/cloud/operator/api/v1alpha1/dynamographdeploymentscalingadapter_types.go (4)

DynamoGraphDeploymentScalingAdapter (83-89)

DynamoGraphDeploymentScalingAdapterSpec (25-35)

DynamoGraphDeploymentServiceRef (38-48)

DynamoGraphDeploymentScalingAdapterList (94-98)

deploy/cloud/operator/internal/consts/consts.go (2)

KubeLabelDynamoGraphDeploymentName (39-39)

KubeLabelDynamoComponent (40-40)

deploy/cloud/operator/internal/dynamo/graph.go (1)

deploy/cloud/operator/internal/consts/consts.go (1)

KubeLabelDynamoComponent (40-40)

deploy/cloud/operator/cmd/main.go (2)

deploy/cloud/operator/internal/controller/dynamographdeploymentscalingadapter_controller.go (1)

DynamoGraphDeploymentScalingAdapterReconciler (45-50)

deploy/cloud/operator/internal/controller_common/predicate.go (1)

Config (61-87)

deploy/cloud/operator/internal/dynamo/graph_test.go (1)

deploy/cloud/operator/internal/consts/consts.go (3)

KubeLabelDynamoComponent (40-40)

KubeLabelDynamoSelector (32-32)

KubeLabelDynamoGraphDeploymentName (39-39)

deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go (1)

deploy/cloud/operator/api/v1alpha1/dynamographdeploymentscalingadapter_types.go (5)

DynamoGraphDeploymentScalingAdapter (83-89)

DynamoGraphDeploymentScalingAdapterList (94-98)

DynamoGraphDeploymentScalingAdapterSpec (25-35)

DynamoGraphDeploymentScalingAdapterStatus (51-65)

DynamoGraphDeploymentServiceRef (38-48)

🪛 markdownlint-cli2 (0.18.1)

docs/kubernetes/autoscaling.md

18-18: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

docs/kubernetes/api_reference.md

293-293: Link fragments should be valid

(MD051, link-fragments)

632-632: Link fragments should be valid

(MD051, link-fragments)

651-651: Link fragments should be valid

(MD051, link-fragments)

714-714: Link fragments should be valid

(MD051, link-fragments)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)

GitHub Check: sglang (arm64)
GitHub Check: trtllm (amd64)
GitHub Check: sglang (amd64)
GitHub Check: trtllm (arm64)
GitHub Check: vllm (arm64)
GitHub Check: operator (amd64)
GitHub Check: vllm (amd64)
GitHub Check: operator (arm64)
GitHub Check: Build and Test - dynamo

🔇 Additional comments (29)

deploy/cloud/operator/internal/dynamo/graph.go (1)

1037-1046: Service‑scoped labels for Grove cliques are consistent and adapter‑friendly

Passing serviceName into generateLabels and adding KubeLabelDynamoComponent = componentName makes all cliques for a given service share the same selector (KubeLabelDynamoSelector) and component label, matching how DynamoComponentDeployment is labeled and simplifying per‑service scaling selection. The behavior looks correct and is well‑aligned with the scaling adapter design.

Also applies to: 1074-1084

deploy/cloud/operator/internal/dynamo/graph_test.go (1)

1297-1303: Tests correctly track new component‑level label and selector semantics

All updated expectations for KubeLabelDynamoComponent and KubeLabelDynamoSelector on cliques and component deployments line up with the implementation in generateLabels and GenerateDynamoComponentsDeployments: selectors are now per‑service (<graph>-<lower(serviceName)>), and every resource carries a KubeLabelDynamoComponent equal to the service key. This keeps tests in sync with the new scaling/labeling model.

Also applies to: 1473-1481, 1873-1885, 2050-2062, 2191-2200, 2353-2361, 2775-2787, 2939-2949, 3079-3088, 3241-3249

docs/_sections/k8s_deployment.rst (1)

13-13: LGTM!

The new toctree entry for the Autoscaling documentation follows the existing pattern and correctly references the new autoscaling.md file.

deploy/cloud/operator/internal/controller/common.go (1)

57-64: LGTM!

The getServiceKeys helper is correctly implemented with proper capacity pre-allocation for the slice.

docs/kubernetes/autoscaling.md (3)

18-39: Excellent architecture diagram.

The ASCII diagram clearly illustrates the relationship between DGD, scaling adapters, and autoscalers. The static analysis warning about missing language specification can be safely ignored here as this is an ASCII diagram, not executable code.

136-161: Well-documented HPA example.

The example correctly demonstrates targeting the DGDSA via scaleTargetRef and includes sensible defaults for stabilization windows. Good practice showing both scale-up and scale-down behavior configuration.

363-395: Helpful troubleshooting section.

The troubleshooting guidance covers common issues with practical kubectl commands. Good inclusion of the raw API check for custom metrics availability.

deploy/cloud/operator/api/v1alpha1/dynamographdeploymentscalingadapter_types.go (6)

24-35: Well-designed spec structure.

The spec correctly uses int32 for replicas (matching Kubernetes conventions) with Minimum=0 validation to support scale-to-zero scenarios with KEDA. Required validation ensures the adapter always has a valid reference.

37-48: LGTM!

The DynamoGraphDeploymentServiceRef cleanly encapsulates the reference to a specific service within a DGD, with appropriate validation to prevent empty strings.

50-65: LGTM!

The status fields correctly support the scale subresource requirements: Replicas for current state, Selector for HPA pod discovery, and LastScaleTime for scaling observability.

67-74: Proper scale subresource configuration.

The kubebuilder markers correctly configure the scale subresource paths, enabling seamless integration with HPA, KEDA, and other scale-aware controllers. The print columns provide a good quick-view experience with kubectl get dgdsa.

83-89: LGTM!

The main type follows standard Kubernetes CRD patterns with embedded TypeMeta/ObjectMeta and separate Spec/Status structs. The omitempty on Spec is safe since the required field validations will be enforced by the webhook.

100-102: LGTM!

The init function correctly registers both the resource and list types with the scheme builder.

deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go (1)

569-676: Auto-generated deepcopy functions look correct.

The controller-gen output correctly handles:

Value copy for Spec (since DGDRef contains only string fields)

Deep copy for Status.LastScaleTime pointer via DeepCopy()

Standard DeepCopyObject for runtime.Object interface compliance

docs/kubernetes/api_reference.md (1)

297-373: Well-documented new scaling adapter API.

The documentation for DynamoGraphDeploymentScalingAdapter is comprehensive, covering the resource purpose, spec fields (replicas, dgdRef), and status fields (replicas, selector, lastScaleTime). The integration with HPA/KEDA via the scale subresource is clearly explained.

deploy/cloud/operator/cmd/main.go (1)

581-589: LGTM - Controller registration follows established pattern.

The DynamoGraphDeploymentScalingAdapterReconciler registration is consistent with other controllers in this file. All required dependencies (Client, Scheme, Recorder, Config) are properly wired, and error handling follows the same pattern.

deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeploymentscalingadapters.yaml (2)

131-136: Well-designed scale subresource configuration.

The scale subresource is correctly configured for HPA/KEDA compatibility with proper paths for specReplicasPath, statusReplicasPath, and labelSelectorPath. This enables standard Kubernetes autoscalers to interact with the adapter.

1-136: LGTM - CRD definition is complete and follows Kubernetes conventions.

The CRD properly defines all required components for a scaling adapter: spec with target reference and desired replicas, status for observed state, and the scale subresource for HPA/KEDA integration. The printer columns provide useful kubectl get output.

deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (4)

660-684: LGTM - Cleanup logic for orphaned adapters.

The cleanup logic correctly lists adapters by DGD name label and removes those referencing services that no longer exist in the DGD spec. Error handling properly ignores NotFound errors, and events are recorded for observability.

689-693: LGTM with minor note on edge case.

The naming function correctly lowercases service names for DNS subdomain compliance. Note that services differing only by case (e.g., "Frontend" vs "frontend") would produce the same adapter name, but this is unlikely in practice given typical service naming conventions.

714-720: Appropriate predicate configuration for owned adapters.

The predicates correctly:

Ignore create events (avoids redundant reconciliation after creating adapter)

React to delete events (ensures adapters are recreated if deleted)

Ignore update events (adapter controller manages status; spec changes flow from DGD)

This design ensures the DGD controller creates adapters but doesn't fight with the adapter controller for updates.

229-234: LGTM - Scaling adapter reconciliation placement in flow.

The scaling adapter reconciliation is appropriately placed after PVC reconciliation and before service-specific resource reconciliation. This ensures adapters exist before any autoscaling-dependent logic runs.

deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml (1)

382-401: No action needed. The DynamoGraphDeploymentScalingAdapter controller does not use finalizers, so the missing dynamographdeploymentscalingadapters/finalizers permission is not required. The RBAC configuration is correct.

deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeploymentscalingadapters.yaml (2)

1-51: LGTM! Well-structured CRD with proper HPA/KEDA integration.

The CRD metadata, naming conventions, and printer columns are appropriately configured. The scale subresource paths correctly reference the schema fields.

131-136: LGTM! Scale subresource is correctly configured for HPA compatibility.

The labelSelectorPath, specReplicasPath, and statusReplicasPath correctly reference the schema fields, enabling standard Kubernetes autoscalers to interact with this resource.

deploy/cloud/operator/internal/controller/dynamographdeploymentscalingadapter_controller.go (4)

44-50: LGTM! Reconciler struct follows standard patterns.

The struct correctly embeds the Kubernetes client and includes the event recorder for observability.

122-147: LGTM! Scaling logic correctly updates DGD and records events.

The logic properly updates the DGD's service replicas and records both Kubernetes events and LastScaleTime for observability.

87-101: > Likely an incorrect or invalid review comment.

171-200: LGTM! Controller setup with efficient watch predicates.

Good use of GenerationChangedPredicate for the primary resource and custom predicates for DGD watches. The servicesEqual comparison ensures reconciliation only on meaningful changes, properly comparing replica counts between service maps with nil-safe handling.

deploy/cloud/operator/config/rbac/role.yaml

deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go

deploy/cloud/operator/internal/controller/dynamographdeploymentscalingadapter_controller.go

docs/kubernetes/api_reference.md

Signed-off-by: Julien Mancuso <[email protected]>

deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go

deploy/cloud/operator/api/v1alpha1/dynamographdeploymentscalingadapter_types.go

tmonty12 · 2025-12-04T23:10:58Z

deploy/cloud/operator/internal/controller/dynamographdeploymentscalingadapter_controller.go

+
+	// 4. Detect out-of-band DGD changes (Scenario 1: User manually edited DGD)
+	// If DGD replicas differ from adapter status, DGD was modified externally
+	if currentReplicas != adapter.Status.Replicas {


It feels problematic and unexpected to allow for edits to the DGD service replica count to update the DGDSA replica count.

I think on the DGD there should be a field to enable the DGDSA for a DGD or specific DGD service. In the case where it's enabled the DGDSA serves as the state of truth whereas when it's not enabled, the DGD service replica count is the source of truth. If a user wants to manually edit the DGD service replica count with SA enabled - they can modify the DGDSA directly.

DGD is the source of truth, users should be directed to update the DGD if they want to do any modifications (replica number included).
DGDSA is just the reflection of what's in the DGD and should be in sync with what's in the DGD.
DGDSA is an adapter and not a state owner which provides a translation layer for k8s-native autoscalers and should unconditionally be present. Adding an "enable DGDSA" flag would introduce conditional behavior and additional confusion (two sources of truth based on a flag)

I think in your current implementation, the DGD is the source of truth as well as the DGDSA. User edits DGD service replicas -> this will always win. If the DGDSA spec replicas is updated and no updates to the DGD -> DGD service replicas are updated. However, there is a subtle case where an update to the DGDSA is discarded if a concurrent user edit it made to the DGD. We can document this, but I don't think it's a good k8s pattern.

If we think about Deployments and ReplicaSets, the Deployment replica count is the source of truth. A user edit to the ReplicaSet replica count will always be overridden by the Deployment replica count.

Adding an enable DGDSA flag determines what resource is determining the state of truth. I'm a user just deploying a DGD and doesn't care about auto-scaling -> DGDSA disabled. I can update the DGD service replicas directly. OR I'm as user who is using planner, HPA or KEDA -> DGDSA enabled. I don't think the user wants to be overriding the DGD replica count with manual edits -> this would conflict with Planner/HPA/KEDA. If they want to edit the DGD service replica count manually (say for testing) -> they can just update the DGDSA.

fixed in 7706428 as per our team discussion

Signed-off-by: Julien Mancuso <[email protected]>

tmonty12 · 2025-12-08T17:41:26Z

deploy/cloud/operator/internal/webhook/validation/shared.go

+
+// GetWarnings returns deprecation warnings for the spec.
+// This should be called after Validate() to collect any deprecation notices.
+func (v *SharedSpecValidator) GetWarnings() []string {


nit: I think it would be cleaner to have DynamoGraphDeploymentValidator.Validate() return (admission.Warnings, error).

fixed in 7706428

tmonty12 · 2025-12-08T17:41:42Z

deploy/cloud/operator/internal/webhook/validation/shared_test.go

 		{
 			name: "valid spec with all fields",
 			spec: &nvidiacomv1alpha1.DynamoComponentDeploymentSharedSpec{
 				Replicas: &validReplicas,


nit: unit test for warning validation?

tmonty12 · 2025-12-08T18:16:54Z

deploy/cloud/operator/internal/controller/dynamographdeploymentscalingadapter_controller.go

+
+	// 4. Detect out-of-band DGD changes (Scenario 1: User manually edited DGD)
+	// If DGD replicas differ from adapter status, DGD was modified externally
+	if currentReplicas != adapter.Status.Replicas {


I think in your current implementation, the DGD is the source of truth as well as the DGDSA. User edits DGD service replicas -> this will always win. If the DGDSA spec replicas is updated and no updates to the DGD -> DGD service replicas are updated. However, there is a subtle case where an update to the DGDSA is discarded if a concurrent user edit it made to the DGD. We can document this, but I don't think it's a good k8s pattern.

If we think about Deployments and ReplicaSets, the Deployment replica count is the source of truth. A user edit to the ReplicaSet replica count will always be overridden by the Deployment replica count.

Adding an enable DGDSA flag determines what resource is determining the state of truth. I'm a user just deploying a DGD and doesn't care about auto-scaling -> DGDSA disabled. I can update the DGD service replicas directly. OR I'm as user who is using planner, HPA or KEDA -> DGDSA enabled. I don't think the user wants to be overriding the DGD replica count with manual edits -> this would conflict with Planner/HPA/KEDA. If they want to edit the DGD service replica count manually (say for testing) -> they can just update the DGDSA.

Signed-off-by: Julien Mancuso <[email protected]>

tmonty12 · 2025-12-09T03:25:32Z

deploy/cloud/operator/internal/webhook/validation/dynamographdeployment.go

 // ValidateUpdate performs stateful validation comparing old and new DynamoGraphDeployment.
 // Returns warnings and error.
 func (v *DynamoGraphDeploymentValidator) ValidateUpdate(old *nvidiacomv1alpha1.DynamoGraphDeployment) (admission.Warnings, error) {
+	return v.ValidateUpdateWithUserInfo(old, nil)


nit: it feels confusing that ValidateUpdate calls ValidateUpdateWithUserInfo. It seems that ValidateUpdate is called when we can't retrieve the user info.

For replica updates, if we fail to get the user info -> then only allow service replica updates where DGDSA is disabled. Right now though, if the user info can't be retrieved we are allowing requests that are modifying service replicas even if DGDSA is enabled. In this case we should reject.

tmonty12 · 2025-12-09T03:32:08Z

docs/kubernetes/autoscaling.md

@@ -0,0 +1,733 @@
+# Autoscaling


This autoscaling doc is awesome! Thanks for the adding in a section on HPA and KEDA

julienmancuso added 4 commits December 1, 2025 14:12

fix: add scaling adapter

b539fd2

Signed-off-by: Julien Mancuso <[email protected]>

fix: add scaling adapter

71cad60

Signed-off-by: Julien Mancuso <[email protected]>

fix: add scaling adapter

2c7bd17

Signed-off-by: Julien Mancuso <[email protected]>

fix: add scaling adapter

6746518

Signed-off-by: Julien Mancuso <[email protected]>

julienmancuso requested a review from a team as a code owner December 2, 2025 18:31

pull-request-size bot added the size/XXL label Dec 2, 2025

github-actions bot added the feat label Dec 2, 2025

coderabbitai bot reviewed Dec 2, 2025

View reviewed changes

fix: add scaling adapter

c9ab33e

Signed-off-by: Julien Mancuso <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB December 2, 2025 20:06 Inactive

copy-pr-bot bot temporarily deployed to GITLAB December 2, 2025 20:10 Inactive

julienmancuso and others added 9 commits December 3, 2025 08:55

fix: add scaling adapter

eca67c7

Signed-off-by: Julien Mancuso <[email protected]>

fix: add scaling adapter

3245e87

Signed-off-by: Julien Mancuso <[email protected]>

fix: add scaling adapter

1576a50

Signed-off-by: Julien Mancuso <[email protected]>

fix: add scaling adapter

0d322d0

Signed-off-by: Julien Mancuso <[email protected]>

fix: add scaling adapter

2dc6a79

Signed-off-by: Julien Mancuso <[email protected]>

fix: add scaling adapter

66fddab

Signed-off-by: Julien Mancuso <[email protected]>

fix: add scaling adapter

f8bed11

Signed-off-by: Julien Mancuso <[email protected]>

fix: add scaling adapter

de1a157

Signed-off-by: Julien Mancuso <[email protected]>

Merge branch 'main' into jsm/autoscale

a66579a

copy-pr-bot bot had a problem deploying to GITLAB December 4, 2025 15:13 Failure

tmonty12 reviewed Dec 4, 2025

View reviewed changes

fix: add scaling adapter

7ec54a6

Signed-off-by: Julien Mancuso <[email protected]>

copy-pr-bot bot had a problem deploying to GITLAB December 5, 2025 01:56 Failure

julienmancuso requested a review from tmonty12 December 5, 2025 16:26

tmonty12 reviewed Dec 8, 2025

View reviewed changes

copy-pr-bot bot had a problem deploying to GITLAB December 8, 2025 22:22 Failure

fix: add scaling adapter

7706428

Signed-off-by: Julien Mancuso <[email protected]>

julienmancuso force-pushed the jsm/autoscale branch from 4cd75f8 to 7706428 Compare December 8, 2025 22:23

copy-pr-bot bot had a problem deploying to GITLAB December 8, 2025 22:23 Failure

julienmancuso requested a review from tmonty12 December 8, 2025 22:26

fix: add scaling adapter

f1cb53b

Signed-off-by: Julien Mancuso <[email protected]>

copy-pr-bot bot had a problem deploying to GITLAB December 8, 2025 23:53 Failure

fix: add scaling adapter

f4abc59

Signed-off-by: Julien Mancuso <[email protected]>

copy-pr-bot bot had a problem deploying to GITLAB December 9, 2025 00:45 Failure

fix: add scaling adapter

e3a25ac

Signed-off-by: Julien Mancuso <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB December 9, 2025 01:08 Inactive

copy-pr-bot bot temporarily deployed to GITLAB December 9, 2025 01:23 Inactive

fix: add scaling adapter

03604d6

Signed-off-by: Julien Mancuso <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB December 9, 2025 01:27 Inactive

copy-pr-bot bot temporarily deployed to GITLAB December 9, 2025 01:34 Inactive

tmonty12 approved these changes Dec 9, 2025

View reviewed changes

feat: add scaling adapter #4699

Are you sure you want to change the base?

feat: add scaling adapter #4699

Conversation

julienmancuso commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

julienmancuso Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tmonty12 Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

julienmancuso commented Dec 2, 2025 •

edited

Loading

coderabbitai bot commented Dec 2, 2025 •

edited

Loading

julienmancuso Dec 4, 2025 •

edited

Loading

tmonty12 Dec 8, 2025 •

edited

Loading