Skip to content

Conversation

@julienmancuso
Copy link
Contributor

@julienmancuso julienmancuso commented Dec 2, 2025

Overview:

design proposal : https://docs.google.com/document/d/1qHDYqbfMO6WJY2vM30YurAh1Gm3Gk7eKnZWdsakFn6M/edit?usp=sharing

This MR introduces DGD Scaling adapter which allows to plug standard kubernetes scaling mechanism (HPA, Keda, ...)

See autoscaling user guide in this MR for more details

Summary by CodeRabbit

  • New Features

    • Introduced DynamoGraphDeploymentScalingAdapter resource for managing per-service scaling configuration in DynamoGraphDeployments.
  • Documentation

    • Added comprehensive autoscaling guide covering multiple scaling strategies and best practices for Kubernetes environments.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Julien Mancuso <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]>
@julienmancuso julienmancuso requested a review from a team as a code owner December 2, 2025 18:31
@github-actions github-actions bot added the feat label Dec 2, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 2, 2025

Walkthrough

This pull request removes the built-in autoscaling field from DynamoComponentDeployment and DynamoGraphDeployment CRDs, introducing a new dedicated DynamoGraphDeploymentScalingAdapter CRD and controller to manage per-service scaling instead. Autoscaling validation is removed from webhooks, the Autoscaling struct from API types, and HPA generation logic from controllers. Documentation is added for the new autoscaling architecture.

Changes

Cohort / File(s) Summary
CRD Autoscaling Field Removals
deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponent*.yaml, deploy/cloud/helm/crds/templates/nvidia.com_dynamograph*.yaml, deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponent*.yaml, deploy/cloud/operator/config/crd/bases/nvidia.com_dynamograph*.yaml
Removes entire autoscaling nested object (behavior, scaleDown, scaleUp, policies, metrics, minReplicas, maxReplicas) from deployment CRD specs
New Scaling Adapter CRD
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeploymentscalingadapters.yaml, deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeploymentscalingadapters.yaml
Introduces new CustomResourceDefinition for DynamoGraphDeploymentScalingAdapter with scale subresource, dgdRef, replicas, and status fields
API Type Definitions
deploy/cloud/operator/api/v1alpha1/common.go, deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go, deploy/cloud/operator/api/v1alpha1/dynamographdeploymentscalingadapter_types.go
Removes Autoscaling struct and import; removes autoscaling field from DynamoComponentDeploymentSharedSpec; adds new DynamoGraphDeploymentScalingAdapter, DynamoGraphDeploymentScalingAdapterSpec, DynamoGraphDeploymentScalingAdapterStatus, and DynamoGraphDeploymentServiceRef types
Generated DeepCopy Methods
deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go
Removes DeepCopy methods for Autoscaling; adds DeepCopy/DeepCopyInto/DeepCopyObject methods for new scaling adapter types
RBAC and Permissions
deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml, deploy/cloud/operator/config/rbac/role.yaml
Adds dynamographdeploymentscalingadapters resource with CRUD verbs, finalizers, and status subresource permissions
Controller Setup and Registration
deploy/cloud/operator/cmd/main.go
Registers new DynamoGraphDeploymentScalingAdapterReconciler with manager
Existing Controller Updates
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go, deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go
Removes HPA generation logic and generateHPA method from component controller; adds scaling adapter reconciliation, cleanup, and generation logic to graph controller
New Scaling Adapter Reconciler
deploy/cloud/operator/internal/controller/dynamographdeploymentscalingadapter_controller.go
Implements new reconciler for DynamoGraphDeploymentScalingAdapter with sync logic, event recording, status updates, and pod selector generation
Controller Utilities
deploy/cloud/operator/internal/controller/common.go
Adds helper functions: getServiceKeys and servicesEqual for service comparison
Constants and Internal Updates
deploy/cloud/operator/internal/consts/consts.go, deploy/cloud/operator/internal/dynamo/graph.go
Removes HPACPUDefaultAverageUtilization constant; updates graph generation to use serviceName for labels with new KubeLabelDynamoComponent
Test Updates
deploy/cloud/operator/internal/dynamo/graph_test.go, deploy/cloud/operator/internal/webhook/validation/dynamocomponentdeployment_test.go, deploy/cloud/operator/internal/webhook/validation/dynamographdeployment_test.go, deploy/cloud/operator/internal/webhook/validation/shared.go, deploy/cloud/operator/internal/webhook/validation/shared_test.go
Removes autoscaling test cases and validation; removes Autoscaling field from test fixtures; updates graph test labels to include KubeLabelDynamoComponent
Documentation
docs/_sections/k8s_deployment.rst, docs/kubernetes/api_reference.md, docs/kubernetes/autoscaling.md
Adds autoscaling documentation link; documents new DynamoGraphDeploymentScalingAdapter API types with full schema details; adds comprehensive autoscaling guide covering HPA, KEDA, Planner, and custom controller strategies

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

  • New reconciler logic (dynamographdeploymentscalingadapter_controller.go): Requires careful review of reconciliation flow, event handling, status updates, and DGD reference validation
  • Controller integration in existing graph deployment controller: Verify scaling adapter creation, orphaned adapter cleanup, and label propagation
  • Pod selector generation and service reference handling in the new adapter
  • RBAC and CRD schema validation: Ensure proper scale subresource configuration and finalizer handling
  • Test coverage: Verify all autoscaling removal paths are properly tested and no orphaned validation logic remains
  • Label propagation changes: Validate KubeLabelDynamoComponent is correctly applied throughout graph generation

Poem

🐰 The autoscaler hops away,
Adapters bloom to save the day,
One CRD now scales per service,
No more HPA—sleek and nervous!
With labels fresh and logic clean,
The finest scaling I've e'er seen!

Pre-merge checks

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 55.56% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description check ❓ Inconclusive The PR description is minimal and lacks required structural details. It references external documents but doesn't clearly explain changes, rationale, or testing in the template format. Add complete sections including Overview, Details of changes, specific files for review, and linked issues. The description should be self-contained and follow the template structure provided.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: add scaling adapter' is clear and concise, directly referencing the main change. However, it is somewhat generic and doesn't specify that this is about DynamoGraphDeployment or the new DGD Scaling Adapter resource.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (5)
deploy/cloud/operator/internal/controller/common.go (1)

66-95: Consider renaming for clarity.

The function name servicesEqual suggests full equality comparison, but it only compares replica counts. The comment correctly describes the behavior, but the name could be more precise to avoid future confusion.

-// servicesEqual compares two services maps to detect changes in replica counts
-func servicesEqual(old, new map[string]*v1alpha1.DynamoComponentDeploymentSharedSpec) bool {
+// servicesReplicasEqual compares two services maps to detect changes in replica counts
+func servicesReplicasEqual(old, new map[string]*v1alpha1.DynamoComponentDeploymentSharedSpec) bool {
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeploymentscalingadapters.yaml (1)

98-108: Consider adding a default value for replicas.

The replicas field is required but has no default. While this is acceptable for a scaling adapter (the user/autoscaler should specify the value), consider whether a default of 1 would improve UX for initial creation.

If a default is desired, add it to the schema:

              replicas:
                description: |-
                  Replicas is the desired number of replicas for the target service.
                  This field is modified by external autoscalers (HPA/KEDA/Planner) or manually by users.
+               default: 1
                format: int32
                minimum: 0
                type: integer
deploy/cloud/operator/internal/controller/dynamographdeploymentscalingadapter_controller.go (3)

78-85: Consider graceful handling when referenced DGD is deleted.

When the DGD is not found, the controller returns an error causing exponential backoff requeue. If the DGD was intentionally deleted, this will generate continuous reconciliation attempts.

Consider: (1) setting a status condition indicating the DGD is missing, or (2) not requeuing when the DGD is gone (the adapter may be orphaned intentionally).

 	if err := r.Get(ctx, dgdKey, dgd); err != nil {
 		if errors.IsNotFound(err) {
 			logger.Error(err, "Referenced DGD not found", "dgd", dgdKey)
-			// DGD doesn't exist, can't proceed
-			return ctrl.Result{}, err
+			// DGD doesn't exist - don't requeue (avoids infinite retry)
+			r.Recorder.Eventf(adapter, corev1.EventTypeWarning, "DGDNotFound",
+				"Referenced DynamoGraphDeployment %s not found", dgdKey.Name)
+			return ctrl.Result{}, nil
 		}
 		return ctrl.Result{}, err
 	}

103-120: Initial reconciliation may trigger false "out-of-band" sync.

On first reconcile of a new adapter, adapter.Status.Replicas will be 0 (zero value). If the DGD service has, say, 3 replicas, this logic will detect it as an "out-of-band DGD change" and sync the adapter spec to 3.

While this might be acceptable behavior (adapter adopts DGD's current state), the log message "Detected out-of-band DGD change" is misleading for the initial sync case. Consider distinguishing between:

  1. Initial status population
  2. Actual out-of-band DGD edits
 	// 4. Detect out-of-band DGD changes (Scenario 1: User manually edited DGD)
 	// If DGD replicas differ from adapter status, DGD was modified externally
-	if currentReplicas != adapter.Status.Replicas {
-		logger.Info("Detected out-of-band DGD change, syncing adapter from DGD",
+	if adapter.Status.Replicas != 0 && currentReplicas != adapter.Status.Replicas {
+		logger.Info("Detected out-of-band DGD change, syncing adapter from DGD",

Or use a condition/annotation to track initialization state.


149-156: Minor: Comment numbering skips from step 5 to step 7.

Step 6 is missing in the comment sequence. This is a cosmetic issue but may cause confusion.

-	// 7. Update adapter status
+	// 6. Update adapter status
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 44e8600 and 6746518.

📒 Files selected for processing (27)
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml (0 hunks)
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml (0 hunks)
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeploymentscalingadapters.yaml (1 hunks)
  • deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml (2 hunks)
  • deploy/cloud/operator/api/v1alpha1/common.go (0 hunks)
  • deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (0 hunks)
  • deploy/cloud/operator/api/v1alpha1/dynamographdeploymentscalingadapter_types.go (1 hunks)
  • deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go (1 hunks)
  • deploy/cloud/operator/cmd/main.go (1 hunks)
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml (0 hunks)
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml (0 hunks)
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeploymentscalingadapters.yaml (1 hunks)
  • deploy/cloud/operator/config/rbac/role.yaml (2 hunks)
  • deploy/cloud/operator/internal/consts/consts.go (0 hunks)
  • deploy/cloud/operator/internal/controller/common.go (1 hunks)
  • deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (0 hunks)
  • deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (4 hunks)
  • deploy/cloud/operator/internal/controller/dynamographdeploymentscalingadapter_controller.go (1 hunks)
  • deploy/cloud/operator/internal/dynamo/graph.go (2 hunks)
  • deploy/cloud/operator/internal/dynamo/graph_test.go (10 hunks)
  • deploy/cloud/operator/internal/webhook/validation/dynamocomponentdeployment_test.go (0 hunks)
  • deploy/cloud/operator/internal/webhook/validation/dynamographdeployment_test.go (0 hunks)
  • deploy/cloud/operator/internal/webhook/validation/shared.go (0 hunks)
  • deploy/cloud/operator/internal/webhook/validation/shared_test.go (0 hunks)
  • docs/_sections/k8s_deployment.rst (1 hunks)
  • docs/kubernetes/api_reference.md (17 hunks)
  • docs/kubernetes/autoscaling.md (1 hunks)
💤 Files with no reviewable changes (12)
  • deploy/cloud/operator/internal/webhook/validation/dynamographdeployment_test.go
  • deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go
  • deploy/cloud/operator/api/v1alpha1/common.go
  • deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml
  • deploy/cloud/operator/internal/consts/consts.go
  • deploy/cloud/operator/internal/webhook/validation/shared.go
  • deploy/cloud/operator/internal/webhook/validation/dynamocomponentdeployment_test.go
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml
  • deploy/cloud/operator/internal/webhook/validation/shared_test.go
🧰 Additional context used
🧠 Learnings (8)
📓 Common learnings
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml:1233-1235
Timestamp: 2025-07-18T16:04:47.465Z
Learning: The `stopSignal` field in Kubernetes CRDs like DynamoGraphDeployment and DynamoComponentDeployment is autogenerated by controller-gen when upgrading Kubernetes library versions, and represents expected upstream API changes rather than manual code that needs custom validation.
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1308-1312
Timestamp: 2025-06-11T21:29:28.650Z
Learning: User julienmancuso expects replies in English; avoid switching languages unless explicitly requested.
📚 Learning: 2025-09-04T19:03:06.643Z
Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 2872
File: examples/multimodal/deploy/agg_qwen.yaml:53-60
Timestamp: 2025-09-04T19:03:06.643Z
Learning: In the dynamo repository, Kubernetes Custom Resources use `gpu: "1"` format for GPU resource limits and requests, not the standard Kubernetes `nvidia.com/gpu: 1` format. This applies to DynamoGraphDeployment resources and other dynamo CRs.

Applied to files:

  • deploy/cloud/operator/api/v1alpha1/dynamographdeploymentscalingadapter_types.go
  • docs/kubernetes/api_reference.md
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeploymentscalingadapters.yaml
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeploymentscalingadapters.yaml
  • deploy/cloud/operator/config/rbac/role.yaml
📚 Learning: 2025-07-18T16:04:47.465Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml:1233-1235
Timestamp: 2025-07-18T16:04:47.465Z
Learning: The `stopSignal` field in Kubernetes CRDs like DynamoGraphDeployment and DynamoComponentDeployment is autogenerated by controller-gen when upgrading Kubernetes library versions, and represents expected upstream API changes rather than manual code that needs custom validation.

Applied to files:

  • deploy/cloud/operator/api/v1alpha1/dynamographdeploymentscalingadapter_types.go
  • docs/kubernetes/api_reference.md
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeploymentscalingadapters.yaml
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeploymentscalingadapters.yaml
  • deploy/cloud/operator/internal/dynamo/graph.go
  • deploy/cloud/operator/internal/dynamo/graph_test.go
📚 Learning: 2025-07-18T16:05:05.534Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.

Applied to files:

  • docs/kubernetes/api_reference.md
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeploymentscalingadapters.yaml
  • deploy/cloud/operator/internal/dynamo/graph.go
  • deploy/cloud/operator/internal/dynamo/graph_test.go
📚 Learning: 2025-07-18T16:04:31.771Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:92-98
Timestamp: 2025-07-18T16:04:31.771Z
Learning: CRD schemas in files like deploy/cloud/helm/crds/templates/*.yaml are auto-generated from Kubernetes library upgrades and should not be manually modified as changes would be overwritten during regeneration.

Applied to files:

  • docs/kubernetes/api_reference.md
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeploymentscalingadapters.yaml
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeploymentscalingadapters.yaml
📚 Learning: 2025-11-05T20:23:29.539Z
Learnt from: nv-hwoo
Repo: ai-dynamo/dynamo PR: 4112
File: examples/backends/sglang/deploy/agg.yaml:76-78
Timestamp: 2025-11-05T20:23:29.539Z
Learning: In DynamoGraphDeployment (DGD) YAML manifests, the volumeMounts field uses `mountPoint` (not the standard Kubernetes `mountPath`) to specify where to mount volumes. This is defined in the DGD custom resource definition API.

Applied to files:

  • docs/kubernetes/api_reference.md
📚 Learning: 2025-06-04T13:09:53.416Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 1365
File: deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go:171-178
Timestamp: 2025-06-04T13:09:53.416Z
Learning: The `DYN_DEPLOYMENT_CONFIG` environment variable (commonconsts.DynamoDeploymentConfigEnvVar) in the Dynamo operator will never be set via ValueFrom (secrets/config maps), only via direct Value assignment. The GetDynamoDeploymentConfig method correctly only checks env.Value for this specific environment variable.

Applied to files:

  • deploy/cloud/operator/internal/dynamo/graph_test.go
📚 Learning: 2025-05-29T16:29:45.152Z
Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 1266
File: deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go:85-85
Timestamp: 2025-05-29T16:29:45.152Z
Learning: In the Dynamo codebase, ComponentTypePlanner constants with different cases ("Planner" vs "planner") are intentional and serve different purposes: component type in config vs component label. These should not be made consistent as they handle different contexts.

Applied to files:

  • deploy/cloud/operator/internal/dynamo/graph_test.go
🧬 Code graph analysis (6)
deploy/cloud/operator/api/v1alpha1/dynamographdeploymentscalingadapter_types.go (1)
deploy/cloud/operator/api/v1alpha1/groupversion_info.go (1)
  • SchemeBuilder (35-35)
deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (4)
deploy/cloud/operator/api/v1alpha1/dynamographdeployment_types.go (1)
  • DynamoGraphDeployment (63-71)
deploy/cloud/operator/internal/controller_common/resource.go (1)
  • SyncResource (60-195)
deploy/cloud/operator/api/v1alpha1/dynamographdeploymentscalingadapter_types.go (4)
  • DynamoGraphDeploymentScalingAdapter (83-89)
  • DynamoGraphDeploymentScalingAdapterSpec (25-35)
  • DynamoGraphDeploymentServiceRef (38-48)
  • DynamoGraphDeploymentScalingAdapterList (94-98)
deploy/cloud/operator/internal/consts/consts.go (2)
  • KubeLabelDynamoGraphDeploymentName (39-39)
  • KubeLabelDynamoComponent (40-40)
deploy/cloud/operator/internal/dynamo/graph.go (1)
deploy/cloud/operator/internal/consts/consts.go (1)
  • KubeLabelDynamoComponent (40-40)
deploy/cloud/operator/cmd/main.go (2)
deploy/cloud/operator/internal/controller/dynamographdeploymentscalingadapter_controller.go (1)
  • DynamoGraphDeploymentScalingAdapterReconciler (45-50)
deploy/cloud/operator/internal/controller_common/predicate.go (1)
  • Config (61-87)
deploy/cloud/operator/internal/dynamo/graph_test.go (1)
deploy/cloud/operator/internal/consts/consts.go (3)
  • KubeLabelDynamoComponent (40-40)
  • KubeLabelDynamoSelector (32-32)
  • KubeLabelDynamoGraphDeploymentName (39-39)
deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go (1)
deploy/cloud/operator/api/v1alpha1/dynamographdeploymentscalingadapter_types.go (5)
  • DynamoGraphDeploymentScalingAdapter (83-89)
  • DynamoGraphDeploymentScalingAdapterList (94-98)
  • DynamoGraphDeploymentScalingAdapterSpec (25-35)
  • DynamoGraphDeploymentScalingAdapterStatus (51-65)
  • DynamoGraphDeploymentServiceRef (38-48)
🪛 markdownlint-cli2 (0.18.1)
docs/kubernetes/autoscaling.md

18-18: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

docs/kubernetes/api_reference.md

293-293: Link fragments should be valid

(MD051, link-fragments)


632-632: Link fragments should be valid

(MD051, link-fragments)


651-651: Link fragments should be valid

(MD051, link-fragments)


714-714: Link fragments should be valid

(MD051, link-fragments)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: sglang (arm64)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: sglang (amd64)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: operator (amd64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: operator (arm64)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (29)
deploy/cloud/operator/internal/dynamo/graph.go (1)

1037-1046: Service‑scoped labels for Grove cliques are consistent and adapter‑friendly

Passing serviceName into generateLabels and adding KubeLabelDynamoComponent = componentName makes all cliques for a given service share the same selector (KubeLabelDynamoSelector) and component label, matching how DynamoComponentDeployment is labeled and simplifying per‑service scaling selection. The behavior looks correct and is well‑aligned with the scaling adapter design.

Also applies to: 1074-1084

deploy/cloud/operator/internal/dynamo/graph_test.go (1)

1297-1303: Tests correctly track new component‑level label and selector semantics

All updated expectations for KubeLabelDynamoComponent and KubeLabelDynamoSelector on cliques and component deployments line up with the implementation in generateLabels and GenerateDynamoComponentsDeployments: selectors are now per‑service (<graph>-<lower(serviceName)>), and every resource carries a KubeLabelDynamoComponent equal to the service key. This keeps tests in sync with the new scaling/labeling model.

Also applies to: 1473-1481, 1873-1885, 2050-2062, 2191-2200, 2353-2361, 2775-2787, 2939-2949, 3079-3088, 3241-3249

docs/_sections/k8s_deployment.rst (1)

13-13: LGTM!

The new toctree entry for the Autoscaling documentation follows the existing pattern and correctly references the new autoscaling.md file.

deploy/cloud/operator/internal/controller/common.go (1)

57-64: LGTM!

The getServiceKeys helper is correctly implemented with proper capacity pre-allocation for the slice.

docs/kubernetes/autoscaling.md (3)

18-39: Excellent architecture diagram.

The ASCII diagram clearly illustrates the relationship between DGD, scaling adapters, and autoscalers. The static analysis warning about missing language specification can be safely ignored here as this is an ASCII diagram, not executable code.


136-161: Well-documented HPA example.

The example correctly demonstrates targeting the DGDSA via scaleTargetRef and includes sensible defaults for stabilization windows. Good practice showing both scale-up and scale-down behavior configuration.


363-395: Helpful troubleshooting section.

The troubleshooting guidance covers common issues with practical kubectl commands. Good inclusion of the raw API check for custom metrics availability.

deploy/cloud/operator/api/v1alpha1/dynamographdeploymentscalingadapter_types.go (6)

24-35: Well-designed spec structure.

The spec correctly uses int32 for replicas (matching Kubernetes conventions) with Minimum=0 validation to support scale-to-zero scenarios with KEDA. Required validation ensures the adapter always has a valid reference.


37-48: LGTM!

The DynamoGraphDeploymentServiceRef cleanly encapsulates the reference to a specific service within a DGD, with appropriate validation to prevent empty strings.


50-65: LGTM!

The status fields correctly support the scale subresource requirements: Replicas for current state, Selector for HPA pod discovery, and LastScaleTime for scaling observability.


67-74: Proper scale subresource configuration.

The kubebuilder markers correctly configure the scale subresource paths, enabling seamless integration with HPA, KEDA, and other scale-aware controllers. The print columns provide a good quick-view experience with kubectl get dgdsa.


83-89: LGTM!

The main type follows standard Kubernetes CRD patterns with embedded TypeMeta/ObjectMeta and separate Spec/Status structs. The omitempty on Spec is safe since the required field validations will be enforced by the webhook.


100-102: LGTM!

The init function correctly registers both the resource and list types with the scheme builder.

deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go (1)

569-676: Auto-generated deepcopy functions look correct.

The controller-gen output correctly handles:

  • Value copy for Spec (since DGDRef contains only string fields)
  • Deep copy for Status.LastScaleTime pointer via DeepCopy()
  • Standard DeepCopyObject for runtime.Object interface compliance
docs/kubernetes/api_reference.md (1)

297-373: Well-documented new scaling adapter API.

The documentation for DynamoGraphDeploymentScalingAdapter is comprehensive, covering the resource purpose, spec fields (replicas, dgdRef), and status fields (replicas, selector, lastScaleTime). The integration with HPA/KEDA via the scale subresource is clearly explained.

deploy/cloud/operator/cmd/main.go (1)

581-589: LGTM - Controller registration follows established pattern.

The DynamoGraphDeploymentScalingAdapterReconciler registration is consistent with other controllers in this file. All required dependencies (Client, Scheme, Recorder, Config) are properly wired, and error handling follows the same pattern.

deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeploymentscalingadapters.yaml (2)

131-136: Well-designed scale subresource configuration.

The scale subresource is correctly configured for HPA/KEDA compatibility with proper paths for specReplicasPath, statusReplicasPath, and labelSelectorPath. This enables standard Kubernetes autoscalers to interact with the adapter.


1-136: LGTM - CRD definition is complete and follows Kubernetes conventions.

The CRD properly defines all required components for a scaling adapter: spec with target reference and desired replicas, status for observed state, and the scale subresource for HPA/KEDA integration. The printer columns provide useful kubectl get output.

deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (4)

660-684: LGTM - Cleanup logic for orphaned adapters.

The cleanup logic correctly lists adapters by DGD name label and removes those referencing services that no longer exist in the DGD spec. Error handling properly ignores NotFound errors, and events are recorded for observability.


689-693: LGTM with minor note on edge case.

The naming function correctly lowercases service names for DNS subdomain compliance. Note that services differing only by case (e.g., "Frontend" vs "frontend") would produce the same adapter name, but this is unlikely in practice given typical service naming conventions.


714-720: Appropriate predicate configuration for owned adapters.

The predicates correctly:

  • Ignore create events (avoids redundant reconciliation after creating adapter)
  • React to delete events (ensures adapters are recreated if deleted)
  • Ignore update events (adapter controller manages status; spec changes flow from DGD)

This design ensures the DGD controller creates adapters but doesn't fight with the adapter controller for updates.


229-234: LGTM - Scaling adapter reconciliation placement in flow.

The scaling adapter reconciliation is appropriately placed after PVC reconciliation and before service-specific resource reconciliation. This ensures adapters exist before any autoscaling-dependent logic runs.

deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml (1)

382-401: No action needed. The DynamoGraphDeploymentScalingAdapter controller does not use finalizers, so the missing dynamographdeploymentscalingadapters/finalizers permission is not required. The RBAC configuration is correct.

deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeploymentscalingadapters.yaml (2)

1-51: LGTM! Well-structured CRD with proper HPA/KEDA integration.

The CRD metadata, naming conventions, and printer columns are appropriately configured. The scale subresource paths correctly reference the schema fields.


131-136: LGTM! Scale subresource is correctly configured for HPA compatibility.

The labelSelectorPath, specReplicasPath, and statusReplicasPath correctly reference the schema fields, enabling standard Kubernetes autoscalers to interact with this resource.

deploy/cloud/operator/internal/controller/dynamographdeploymentscalingadapter_controller.go (4)

44-50: LGTM! Reconciler struct follows standard patterns.

The struct correctly embeds the Kubernetes client and includes the event recorder for observability.


122-147: LGTM! Scaling logic correctly updates DGD and records events.

The logic properly updates the DGD's service replicas and records both Kubernetes events and LastScaleTime for observability.


87-101: > Likely an incorrect or invalid review comment.


171-200: LGTM! Controller setup with efficient watch predicates.

Good use of GenerationChangedPredicate for the primary resource and custom predicates for DGD watches. The servicesEqual comparison ensures reconciliation only on meaningful changes, properly comparing replica counts between service maps with nil-safe handling.

Signed-off-by: Julien Mancuso <[email protected]>
julienmancuso and others added 9 commits December 3, 2025 08:55
Signed-off-by: Julien Mancuso <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]>

// 4. Detect out-of-band DGD changes (Scenario 1: User manually edited DGD)
// If DGD replicas differ from adapter status, DGD was modified externally
if currentReplicas != adapter.Status.Replicas {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels problematic and unexpected to allow for edits to the DGD service replica count to update the DGDSA replica count.

I think on the DGD there should be a field to enable the DGDSA for a DGD or specific DGD service. In the case where it's enabled the DGDSA serves as the state of truth whereas when it's not enabled, the DGD service replica count is the source of truth. If a user wants to manually edit the DGD service replica count with SA enabled - they can modify the DGDSA directly.

Copy link
Contributor Author

@julienmancuso julienmancuso Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DGD is the source of truth, users should be directed to update the DGD if they want to do any modifications (replica number included).
DGDSA is just the reflection of what's in the DGD and should be in sync with what's in the DGD.
DGDSA is an adapter and not a state owner which provides a translation layer for k8s-native autoscalers and should unconditionally be present. Adding an "enable DGDSA" flag would introduce conditional behavior and additional confusion (two sources of truth based on a flag)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in your current implementation, the DGD is the source of truth as well as the DGDSA. User edits DGD service replicas -> this will always win. If the DGDSA spec replicas is updated and no updates to the DGD -> DGD service replicas are updated. However, there is a subtle case where an update to the DGDSA is discarded if a concurrent user edit it made to the DGD. We can document this, but I don't think it's a good k8s pattern.

If we think about Deployments and ReplicaSets, the Deployment replica count is the source of truth. A user edit to the ReplicaSet replica count will always be overridden by the Deployment replica count.

Adding an enable DGDSA flag determines what resource is determining the state of truth. I'm a user just deploying a DGD and doesn't care about auto-scaling -> DGDSA disabled. I can update the DGD service replicas directly. OR I'm as user who is using planner, HPA or KEDA -> DGDSA enabled. I don't think the user wants to be overriding the DGD replica count with manual edits -> this would conflict with Planner/HPA/KEDA. If they want to edit the DGD service replica count manually (say for testing) -> they can just update the DGDSA.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 7706428 as per our team discussion

Signed-off-by: Julien Mancuso <[email protected]>

// GetWarnings returns deprecation warnings for the spec.
// This should be called after Validate() to collect any deprecation notices.
func (v *SharedSpecValidator) GetWarnings() []string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think it would be cleaner to have DynamoGraphDeploymentValidator.Validate() return (admission.Warnings, error).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 7706428

{
name: "valid spec with all fields",
spec: &nvidiacomv1alpha1.DynamoComponentDeploymentSharedSpec{
Replicas: &validReplicas,
Copy link
Contributor

@tmonty12 tmonty12 Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unit test for warning validation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated


// 4. Detect out-of-band DGD changes (Scenario 1: User manually edited DGD)
// If DGD replicas differ from adapter status, DGD was modified externally
if currentReplicas != adapter.Status.Replicas {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in your current implementation, the DGD is the source of truth as well as the DGDSA. User edits DGD service replicas -> this will always win. If the DGDSA spec replicas is updated and no updates to the DGD -> DGD service replicas are updated. However, there is a subtle case where an update to the DGDSA is discarded if a concurrent user edit it made to the DGD. We can document this, but I don't think it's a good k8s pattern.

If we think about Deployments and ReplicaSets, the Deployment replica count is the source of truth. A user edit to the ReplicaSet replica count will always be overridden by the Deployment replica count.

Adding an enable DGDSA flag determines what resource is determining the state of truth. I'm a user just deploying a DGD and doesn't care about auto-scaling -> DGDSA disabled. I can update the DGD service replicas directly. OR I'm as user who is using planner, HPA or KEDA -> DGDSA enabled. I don't think the user wants to be overriding the DGD replica count with manual edits -> this would conflict with Planner/HPA/KEDA. If they want to edit the DGD service replica count manually (say for testing) -> they can just update the DGDSA.

Signed-off-by: Julien Mancuso <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]>
// ValidateUpdate performs stateful validation comparing old and new DynamoGraphDeployment.
// Returns warnings and error.
func (v *DynamoGraphDeploymentValidator) ValidateUpdate(old *nvidiacomv1alpha1.DynamoGraphDeployment) (admission.Warnings, error) {
return v.ValidateUpdateWithUserInfo(old, nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it feels confusing that ValidateUpdate calls ValidateUpdateWithUserInfo. It seems that ValidateUpdate is called when we can't retrieve the user info.

For replica updates, if we fail to get the user info -> then only allow service replica updates where DGDSA is disabled. Right now though, if the user info can't be retrieved we are allowing requests that are modifying service replicas even if DGDSA is enabled. In this case we should reject.

@@ -0,0 +1,733 @@
# Autoscaling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This autoscaling doc is awesome! Thanks for the adding in a section on HPA and KEDA

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants