Skip to content

Conversation

@tedzhouhk
Copy link
Contributor

@tedzhouhk tedzhouhk commented Dec 9, 2025

image

close DYN-1532: Grafana Dashboard for Planner

Summary by CodeRabbit

  • New Features

    • Added a Grafana dashboard for real-time monitoring of SLA Planner performance metrics including worker counts, latency, throughput, and correction factors.
  • Documentation

    • Enhanced quick start guide with step-by-step deployment instructions for the monitoring dashboard.
    • Expanded configuration reference with detailed SLA planning settings and optional deployment parameters.

✏️ Tip: You can customize this high-level summary in your review settings.

@tedzhouhk tedzhouhk requested a review from a team as a code owner December 9, 2025 01:32
@github-actions github-actions bot added the feat label Dec 9, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 9, 2025

Walkthrough

A new Grafana dashboard ConfigMap is added to monitor Dynamo Planner metrics, alongside documentation updates in the SLA Planner quickstart guide that explain dashboard deployment, access instructions, and configuration details.

Changes

Cohort / File(s) Summary
Observability Infrastructure
deploy/observability/k8s/grafana-planner-dashboard-configmap.yaml
New Kubernetes ConfigMap manifest defining a Grafana dashboard for Dynamo Planner monitoring, with panels tracking worker counts, latency, throughput, observed/predicted metrics, correction factors, and replica counts using Prometheus data source and PromQL expressions.
Documentation
docs/planner/sla_planner_quickstart.md
Extended SLA Planner quickstart with additional step for deploying and viewing the Planner Grafana dashboard, including access instructions, dashboard content overview, and expanded configuration reference table with DGDR and SLA-related fields.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

  • Review the ConfigMap YAML structure, panel configurations, and PromQL expressions for accuracy
  • Verify documentation instructions for dashboard deployment and Grafana UI access are correct and complete

Poem

🐰 A dashboard hops into view,
Metrics dancing through and through,
Workers, latency, and throughput shine,
Prometheus whispers—all is fine!
Now quick-start guides show the way,
To monitor plans, hip-hop-hooray! 🎯

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is incomplete and largely missing required template sections. It lacks an Overview, Details explanation of changes, and guidance on where reviewers should start. Expand the description to include Overview, detailed explanation of changes made, and guidance on which files reviewers should focus on initially.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding a Grafana dashboard for the planner, which aligns with the primary changeset adding the dashboard ConfigMap and documentation.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd4366d and 3855bac.

📒 Files selected for processing (2)
  • deploy/observability/k8s/grafana-planner-dashboard-configmap.yaml (1 hunks)
  • docs/planner/sla_planner_quickstart.md (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: sglang (amd64)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: sglang (arm64)
  • GitHub Check: operator (amd64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: operator (arm64)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (3)
docs/planner/sla_planner_quickstart.md (1)

182-200: Verify relative documentation path is correct.

Lines 190 reference the path ../kubernetes/observability/metrics.md with a relative path. Please verify this resolves correctly from the document's location at docs/planner/sla_planner_quickstart.md.

The expected path should be docs/kubernetes/observability/metrics.md. If the file is located elsewhere, update the reference accordingly.

deploy/observability/k8s/grafana-planner-dashboard-configmap.yaml (2)

1-11: Kubernetes manifest structure is correct.

The ConfigMap metadata and label grafana_dashboard: "1" are properly configured for Grafana auto-discovery in the monitoring namespace.


13-1524: Verify planner metrics are exported and compatible with Grafana version before deployment.

The dashboard queries approximately 15 planner metrics (e.g., planner:num_p_workers, planner:observed_ttft, planner:predicted_num_d, planner:p_correction_factor, etc.). Before deployment, confirm:

  1. Metric availability: These metrics are actually exported by the planner component and available in Prometheus with the namespace label for filtering
  2. Grafana compatibility: The dashboard uses schema version 41 and plugin version 12.0.1. Verify your Grafana instance supports these versions
  3. Namespace assumption: The ConfigMap is deployed to the monitoring namespace. If your kube-prometheus-stack is deployed in a different namespace, update the Prometheus datasource configuration in Grafana accordingly

To verify metric availability, query your Prometheus instance:

curl "http://<prometheus-endpoint>:9090/api/v1/query?query=planner:num_p_workers"

Ensure the metrics are present and include the namespace label for the variable filtering to work correctly.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants