BlogPost: Databricks Platform Observability Dashboard#74
BlogPost: Databricks Platform Observability Dashboard#74sashankkotta-db wants to merge 9 commits intodatabricks-solutions:mainfrom
Conversation
|
TLC ticket having all the review comments: https://databricks.atlassian.net/browse/TLC-982 |
There was a problem hiding this comment.
Pull request overview
Adds a new “Databricks Platform Observability Dashboard” package containing a Databricks notebook to materialize cost/usage/reliability/hygiene tables from System Tables, plus documentation and ownership updates to support maintaining and operating the dashboard assets.
Changes:
- Add a large Databricks notebook that creates/optimizes/vacuums multiple curated Delta tables for dashboarding, using parallel execution.
- Add project documentation and a Databricks-specific license for the new dashboard package.
- Update
CODEOWNERSto include owners for the new2026-02-platform-observability-dashboarddirectory.
Reviewed changes
Copilot reviewed 4 out of 6 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
CODEOWNERS |
Adds code ownership for the new dashboard directory. |
2026-02-platform-observability-dashboard/materialize_dashboard_queries_run_parallely.py |
New materialization notebook for creating/optimizing/vacuuming dashboard tables from Databricks System Tables. |
2026-02-platform-observability-dashboard/README.md |
Usage and setup documentation for importing/running the dashboard assets. |
2026-02-platform-observability-dashboard/LICENSE.md |
Adds DB license text for this package. |
.DS_Store |
Adds an OS-generated metadata file (should not be committed). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| SELECT * FROM usage_cost;""" | ||
|
|
||
| spark.sql(query) | ||
|
|
||
| query_optimize=f"OPTIMIZE {destination_catalog}.{destination_schema}.ap_cluster_by_job_runs_total_job_runs;" | ||
| query_vaccum=f"VACUUM {destination_catalog}.{destination_schema}.ap_cluster_by_job_runs_total_job_runs;" | ||
|
|
There was a problem hiding this comment.
This query is executed immediately via spark.sql(query) and also appended to queries_to_be_executed_parallely, so it will run twice (and potentially race with itself). Additionally, the OPTIMIZE/VACUUM statements reference ap_cluster_by_job_runs_total_job_runs instead of the table created here (ap_cluster_by_job_runs_usage_cost). Remove the immediate execution and fix the OPTIMIZE/VACUUM target table names.
2026-02-platform-observability-dashboard/materialize_dashboard_queries_run_parallely.py
Outdated
Show resolved
Hide resolved
2026-02-platform-observability-dashboard/materialize_dashboard_queries_run_parallely.py
Outdated
Show resolved
Hide resolved
| # DBTITLE 1,Execute Optimize Spark SQL Queries in Parallel | ||
| # # Run all OPTIMIZE commands in parallel to compact files and improve performance for each materialized dashboard table. | ||
| from concurrent.futures import ThreadPoolExecutor | ||
|
|
||
| def run_query(query): | ||
| return spark.sql(query) | ||
|
|
||
| with ThreadPoolExecutor() as executor: | ||
| futures = [executor.submit(run_query, q) for q in optimize_queries_to_be_executed_parallely] | ||
| results = [f.result() for f in futures] |
There was a problem hiding this comment.
OPTIMIZE is executed even when one or more materialization queries failed, and exceptions from f.result() are not caught. That can terminate the notebook during OPTIMIZE, masking the original failure and skipping the final aggregated failure check. Consider only running OPTIMIZE when failed_list is empty (or wrap per-table OPTIMIZE in try/except and record failures similarly to the main query loop).
| with ThreadPoolExecutor() as executor: | ||
| futures = [executor.submit(run_query, q) for q in vaccum_queries_to_be_executed_parallely] | ||
| results = [f.result() for f in futures] |
There was a problem hiding this comment.
VACUUM is executed regardless of whether the table-creation queries succeeded, and exceptions from f.result() are not caught. If a table wasn’t created, VACUUM can fail and stop the notebook before the final status check runs. Gate VACUUM on successful materialization (or capture VACUUM failures in the status log similarly to the create-table steps).
| with ThreadPoolExecutor() as executor: | |
| futures = [executor.submit(run_query, q) for q in vaccum_queries_to_be_executed_parallely] | |
| results = [f.result() for f in futures] | |
| # Only run VACUUM if all materialization queries succeeded to avoid | |
| # failing on non-existent tables, and handle VACUUM errors per query. | |
| if vaccum_queries_to_be_executed_parallely: | |
| if final_df_status.filter(final_df_status.status != "success").count() == 0: | |
| with ThreadPoolExecutor() as executor: | |
| futures = [executor.submit(run_query, q) for q in vaccum_queries_to_be_executed_parallely] | |
| for future in futures: | |
| try: | |
| _ = future.result() | |
| except Exception as e: | |
| # Log the VACUUM failure and continue with remaining queries | |
| print(f"VACUUM query failed with error: {e}") | |
| else: | |
| print("Skipping VACUUM execution due to prior materialization failures.") |
2026-02-platform-observability-dashboard/materialize_dashboard_queries_run_parallely.py
Outdated
Show resolved
Hide resolved
| ) AS success_rate_percent | ||
| FROM job_runs j | ||
| LEFT JOIN latest_clusters c ON c.cluster_id = j.cluster_id | ||
| AND j.period_end_time BETWEEN current_timestamp() - INTERVAL 3 YEARS AND current_timestamp() |
There was a problem hiding this comment.
The 3-year time filter is applied in the LEFT JOIN condition to latest_clusters, not to the job_runs rows themselves. As written, job_runs includes all history and rows outside the time window will still be counted (they’ll just fail to match a cluster). Move the period_end_time filter into the job_runs CTE (or a WHERE in total_job_runs) so it actually restricts the job run population.
| AND j.period_end_time BETWEEN current_timestamp() - INTERVAL 3 YEARS AND current_timestamp() | |
| WHERE j.period_end_time BETWEEN current_timestamp() - INTERVAL 3 YEARS AND current_timestamp() |
| ROUND((COUNT(DISTINCT CASE WHEN result_state != 'SUCCEEDED' THEN run_id END) / COUNT(*)) * 100, 2) AS failure_rate_percent, | ||
| ROUND((COUNT(DISTINCT CASE WHEN result_state = 'SUCCEEDED' THEN run_id END) / COUNT(*)) * 100, 2) AS success_rate_percent, |
There was a problem hiding this comment.
failure_rate_percent / success_rate_percent use COUNT(*) as the denominator even though this query joins to system.billing.usage (potentially multiple usage rows per run). This will skew the rates; the denominator should be based on runs (e.g., COUNT(DISTINCT run_id) / total_job_runs) to match the numerator.
| ROUND((COUNT(DISTINCT CASE WHEN result_state != 'SUCCEEDED' THEN run_id END) / COUNT(*)) * 100, 2) AS failure_rate_percent, | |
| ROUND((COUNT(DISTINCT CASE WHEN result_state = 'SUCCEEDED' THEN run_id END) / COUNT(*)) * 100, 2) AS success_rate_percent, | |
| ROUND((COUNT(DISTINCT CASE WHEN result_state != 'SUCCEEDED' THEN run_id END) / COUNT(DISTINCT run_id)) * 100, 2) AS failure_rate_percent, | |
| ROUND((COUNT(DISTINCT CASE WHEN result_state = 'SUCCEEDED' THEN run_id END) / COUNT(DISTINCT run_id)) * 100, 2) AS success_rate_percent, |
2026-02-platform-observability-dashboard/materialize_dashboard_queries_run_parallely.py
Outdated
Show resolved
Hide resolved
2026-02-platform-observability-dashboard/materialize_dashboard_queries_run_parallely.py
Outdated
Show resolved
Hide resolved
…_queries_run_parallely.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…_queries_run_parallely.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…_queries_run_parallely.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…_queries_run_parallely.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…_queries_run_parallely.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
A comprehensive solution for monitoring, tracking, and optimizing Databricks platform costs, usage, and operational health. This repository provides ready-to-use dashboards and automation scripts to gain complete visibility into your Databricks environment.