Skip to content

BlogPost: Databricks Platform Observability Dashboard#74

Open
sashankkotta-db wants to merge 9 commits intodatabricks-solutions:mainfrom
sashankkotta-db:main
Open

BlogPost: Databricks Platform Observability Dashboard#74
sashankkotta-db wants to merge 9 commits intodatabricks-solutions:mainfrom
sashankkotta-db:main

Conversation

@sashankkotta-db
Copy link

A comprehensive solution for monitoring, tracking, and optimizing Databricks platform costs, usage, and operational health. This repository provides ready-to-use dashboards and automation scripts to gain complete visibility into your Databricks environment.

@sashankkotta-db
Copy link
Author

TLC ticket having all the review comments: https://databricks.atlassian.net/browse/TLC-982

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new “Databricks Platform Observability Dashboard” package containing a Databricks notebook to materialize cost/usage/reliability/hygiene tables from System Tables, plus documentation and ownership updates to support maintaining and operating the dashboard assets.

Changes:

  • Add a large Databricks notebook that creates/optimizes/vacuums multiple curated Delta tables for dashboarding, using parallel execution.
  • Add project documentation and a Databricks-specific license for the new dashboard package.
  • Update CODEOWNERS to include owners for the new 2026-02-platform-observability-dashboard directory.

Reviewed changes

Copilot reviewed 4 out of 6 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
CODEOWNERS Adds code ownership for the new dashboard directory.
2026-02-platform-observability-dashboard/materialize_dashboard_queries_run_parallely.py New materialization notebook for creating/optimizing/vacuuming dashboard tables from Databricks System Tables.
2026-02-platform-observability-dashboard/README.md Usage and setup documentation for importing/running the dashboard assets.
2026-02-platform-observability-dashboard/LICENSE.md Adds DB license text for this package.
.DS_Store Adds an OS-generated metadata file (should not be committed).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +419 to +425
SELECT * FROM usage_cost;"""

spark.sql(query)

query_optimize=f"OPTIMIZE {destination_catalog}.{destination_schema}.ap_cluster_by_job_runs_total_job_runs;"
query_vaccum=f"VACUUM {destination_catalog}.{destination_schema}.ap_cluster_by_job_runs_total_job_runs;"

Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This query is executed immediately via spark.sql(query) and also appended to queries_to_be_executed_parallely, so it will run twice (and potentially race with itself). Additionally, the OPTIMIZE/VACUUM statements reference ap_cluster_by_job_runs_total_job_runs instead of the table created here (ap_cluster_by_job_runs_usage_cost). Remove the immediate execution and fix the OPTIMIZE/VACUUM target table names.

Copilot uses AI. Check for mistakes.
Comment on lines +2477 to +2486
# DBTITLE 1,Execute Optimize Spark SQL Queries in Parallel
# # Run all OPTIMIZE commands in parallel to compact files and improve performance for each materialized dashboard table.
from concurrent.futures import ThreadPoolExecutor

def run_query(query):
return spark.sql(query)

with ThreadPoolExecutor() as executor:
futures = [executor.submit(run_query, q) for q in optimize_queries_to_be_executed_parallely]
results = [f.result() for f in futures]
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OPTIMIZE is executed even when one or more materialization queries failed, and exceptions from f.result() are not caught. That can terminate the notebook during OPTIMIZE, masking the original failure and skipping the final aggregated failure check. Consider only running OPTIMIZE when failed_list is empty (or wrap per-table OPTIMIZE in try/except and record failures similarly to the main query loop).

Copilot uses AI. Check for mistakes.
Comment on lines +2498 to +2500
with ThreadPoolExecutor() as executor:
futures = [executor.submit(run_query, q) for q in vaccum_queries_to_be_executed_parallely]
results = [f.result() for f in futures]
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VACUUM is executed regardless of whether the table-creation queries succeeded, and exceptions from f.result() are not caught. If a table wasn’t created, VACUUM can fail and stop the notebook before the final status check runs. Gate VACUUM on successful materialization (or capture VACUUM failures in the status log similarly to the create-table steps).

Suggested change
with ThreadPoolExecutor() as executor:
futures = [executor.submit(run_query, q) for q in vaccum_queries_to_be_executed_parallely]
results = [f.result() for f in futures]
# Only run VACUUM if all materialization queries succeeded to avoid
# failing on non-existent tables, and handle VACUUM errors per query.
if vaccum_queries_to_be_executed_parallely:
if final_df_status.filter(final_df_status.status != "success").count() == 0:
with ThreadPoolExecutor() as executor:
futures = [executor.submit(run_query, q) for q in vaccum_queries_to_be_executed_parallely]
for future in futures:
try:
_ = future.result()
except Exception as e:
# Log the VACUUM failure and continue with remaining queries
print(f"VACUUM query failed with error: {e}")
else:
print("Skipping VACUUM execution due to prior materialization failures.")

Copilot uses AI. Check for mistakes.
) AS success_rate_percent
FROM job_runs j
LEFT JOIN latest_clusters c ON c.cluster_id = j.cluster_id
AND j.period_end_time BETWEEN current_timestamp() - INTERVAL 3 YEARS AND current_timestamp()
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 3-year time filter is applied in the LEFT JOIN condition to latest_clusters, not to the job_runs rows themselves. As written, job_runs includes all history and rows outside the time window will still be counted (they’ll just fail to match a cluster). Move the period_end_time filter into the job_runs CTE (or a WHERE in total_job_runs) so it actually restricts the job run population.

Suggested change
AND j.period_end_time BETWEEN current_timestamp() - INTERVAL 3 YEARS AND current_timestamp()
WHERE j.period_end_time BETWEEN current_timestamp() - INTERVAL 3 YEARS AND current_timestamp()

Copilot uses AI. Check for mistakes.
Comment on lines +1260 to +1261
ROUND((COUNT(DISTINCT CASE WHEN result_state != 'SUCCEEDED' THEN run_id END) / COUNT(*)) * 100, 2) AS failure_rate_percent,
ROUND((COUNT(DISTINCT CASE WHEN result_state = 'SUCCEEDED' THEN run_id END) / COUNT(*)) * 100, 2) AS success_rate_percent,
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

failure_rate_percent / success_rate_percent use COUNT(*) as the denominator even though this query joins to system.billing.usage (potentially multiple usage rows per run). This will skew the rates; the denominator should be based on runs (e.g., COUNT(DISTINCT run_id) / total_job_runs) to match the numerator.

Suggested change
ROUND((COUNT(DISTINCT CASE WHEN result_state != 'SUCCEEDED' THEN run_id END) / COUNT(*)) * 100, 2) AS failure_rate_percent,
ROUND((COUNT(DISTINCT CASE WHEN result_state = 'SUCCEEDED' THEN run_id END) / COUNT(*)) * 100, 2) AS success_rate_percent,
ROUND((COUNT(DISTINCT CASE WHEN result_state != 'SUCCEEDED' THEN run_id END) / COUNT(DISTINCT run_id)) * 100, 2) AS failure_rate_percent,
ROUND((COUNT(DISTINCT CASE WHEN result_state = 'SUCCEEDED' THEN run_id END) / COUNT(DISTINCT run_id)) * 100, 2) AS success_rate_percent,

Copilot uses AI. Check for mistakes.
sashankkotta-db and others added 5 commits March 4, 2026 18:19
…_queries_run_parallely.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…_queries_run_parallely.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…_queries_run_parallely.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…_queries_run_parallely.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…_queries_run_parallely.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants