BlogPost: Databricks Platform Observability Dashboard#74

Open

sashankkotta-db wants to merge 9 commits intodatabricks-solutions:mainfrom

sashankkotta-db:main

sashankkotta-db commented Mar 4, 2026

A comprehensive solution for monitoring, tracking, and optimizing Databricks platform costs, usage, and operational health. This repository provides ready-to-use dashboards and automation scripts to gain complete visibility into your Databricks environment.

sashankkotta-db added 4 commits

February 24, 2026 18:39


          code for platform observability dashboard

b16ac49


          added License and updated codeowners

19d991b


          Updated default catalog and schema in documentation

5c8f8f7


          updated the dashboard to latest version

181ea69

sashankkotta-db requested review from QuentinAmbard, alanreese-dbrx, alexott, anupkalburgi, kwulffert23, matthewmoorcroft, slcc2c and srinivasadmala as code owners

March 4, 2026 12:07

Author

sashankkotta-db commented Mar 4, 2026

TLC ticket having all the review comments: https://databricks.atlassian.net/browse/TLC-982

alexott requested a review from Copilot

March 4, 2026 12:25

Copilot started reviewing on behalf of alexott

March 4, 2026 12:26

Copilot AI reviewed

View reviewed changes

Contributor

Copilot AI left a comment

Pull request overview

Adds a new “Databricks Platform Observability Dashboard” package containing a Databricks notebook to materialize cost/usage/reliability/hygiene tables from System Tables, plus documentation and ownership updates to support maintaining and operating the dashboard assets.

Changes:

Add a large Databricks notebook that creates/optimizes/vacuums multiple curated Delta tables for dashboarding, using parallel execution.
Add project documentation and a Databricks-specific license for the new dashboard package.
Update CODEOWNERS to include owners for the new 2026-02-platform-observability-dashboard directory.

Reviewed changes

Copilot reviewed 4 out of 6 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
`CODEOWNERS`	Adds code ownership for the new dashboard directory.
`2026-02-platform-observability-dashboard/materialize_dashboard_queries_run_parallely.py`	New materialization notebook for creating/optimizing/vacuuming dashboard tables from Databricks System Tables.
`2026-02-platform-observability-dashboard/README.md`	Usage and setup documentation for importing/running the dashboard assets.
`2026-02-platform-observability-dashboard/LICENSE.md`	Adds DB license text for this package.
`.DS_Store`	Adds an OS-generated metadata file (should not be committed).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

2026-02-platform-observability-dashboard/materialize_dashboard_queries_run_parallely.py

Comment on lines +419 to +425

+              SELECT * FROM usage_cost;"""
+              spark.sql(query)
+              query_optimize=f"OPTIMIZE {destination_catalog}.{destination_schema}.ap_cluster_by_job_runs_total_job_runs;"
+              query_vaccum=f"VACUUM {destination_catalog}.{destination_schema}.ap_cluster_by_job_runs_total_job_runs;"

Copilot AI Mar 4, 2026

This query is executed immediately via spark.sql(query) and also appended to queries_to_be_executed_parallely, so it will run twice (and potentially race with itself). Additionally, the OPTIMIZE/VACUUM statements reference ap_cluster_by_job_runs_total_job_runs instead of the table created here (ap_cluster_by_job_runs_usage_cost). Remove the immediate execution and fix the OPTIMIZE/VACUUM target table names.

Copilot uses AI. Check for mistakes.

2026-02-platform-observability-dashboard/materialize_dashboard_queries_run_parallely.py Outdated Show resolved Hide resolved

2026-02-platform-observability-dashboard/materialize_dashboard_queries_run_parallely.py Outdated Show resolved Hide resolved

2026-02-platform-observability-dashboard/materialize_dashboard_queries_run_parallely.py

Comment on lines +2477 to +2486

+              # DBTITLE 1,Execute Optimize Spark SQL Queries in Parallel
+              # # Run all OPTIMIZE commands in parallel to compact files and improve performance for each materialized dashboard table.
+              from concurrent.futures import ThreadPoolExecutor
+              def run_query(query):
+                  return spark.sql(query)
+              with ThreadPoolExecutor() as executor:
+                  futures = [executor.submit(run_query, q) for q in optimize_queries_to_be_executed_parallely]
+                  results = [f.result() for f in futures]

Copilot AI Mar 4, 2026

OPTIMIZE is executed even when one or more materialization queries failed, and exceptions from f.result() are not caught. That can terminate the notebook during OPTIMIZE, masking the original failure and skipping the final aggregated failure check. Consider only running OPTIMIZE when failed_list is empty (or wrap per-table OPTIMIZE in try/except and record failures similarly to the main query loop).

Copilot uses AI. Check for mistakes.

2026-02-platform-observability-dashboard/materialize_dashboard_queries_run_parallely.py

Comment on lines +2498 to +2500

+              with ThreadPoolExecutor() as executor:
+                  futures = [executor.submit(run_query, q) for q in vaccum_queries_to_be_executed_parallely]
+                  results = [f.result() for f in futures]

Copilot AI Mar 4, 2026

VACUUM is executed regardless of whether the table-creation queries succeeded, and exceptions from f.result() are not caught. If a table wasn’t created, VACUUM can fail and stop the notebook before the final status check runs. Gate VACUUM on successful materialization (or capture VACUUM failures in the status log similarly to the create-table steps).

Suggested change

      
            with ThreadPoolExecutor() as executor:
          
                futures = [executor.submit(run_query, q) for q in vaccum_queries_to_be_executed_parallely]
          
                results = [f.result() for f in futures]
          
            # Only run VACUUM if all materialization queries succeeded to avoid
          
            # failing on non-existent tables, and handle VACUUM errors per query.
          
            if vaccum_queries_to_be_executed_parallely:
          
                if final_df_status.filter(final_df_status.status != "success").count() == 0:
          
                    with ThreadPoolExecutor() as executor:
          
                        futures = [executor.submit(run_query, q) for q in vaccum_queries_to_be_executed_parallely]
          
                        for future in futures:
          
                            try:
          
                                _ = future.result()
          
                            except Exception as e:
          
                                # Log the VACUUM failure and continue with remaining queries
          
                                print(f"VACUUM query failed with error: {e}")
          
                else:
          
                    print("Skipping VACUUM execution due to prior materialization failures.")

Copilot uses AI. Check for mistakes.

2026-02-platform-observability-dashboard/materialize_dashboard_queries_run_parallely.py Outdated Show resolved Hide resolved

2026-02-platform-observability-dashboard/materialize_dashboard_queries_run_parallely.py

+                  ) AS success_rate_percent
+                FROM job_runs j
+                LEFT JOIN latest_clusters c ON c.cluster_id = j.cluster_id
+                AND j.period_end_time BETWEEN current_timestamp() - INTERVAL 3 YEARS AND current_timestamp()

Copilot AI Mar 4, 2026

The 3-year time filter is applied in the LEFT JOIN condition to latest_clusters, not to the job_runs rows themselves. As written, job_runs includes all history and rows outside the time window will still be counted (they’ll just fail to match a cluster). Move the period_end_time filter into the job_runs CTE (or a WHERE in total_job_runs) so it actually restricts the job run population.

Suggested change

      
              AND j.period_end_time BETWEEN current_timestamp() - INTERVAL 3 YEARS AND current_timestamp()
          
              WHERE j.period_end_time BETWEEN current_timestamp() - INTERVAL 3 YEARS AND current_timestamp()

Copilot uses AI. Check for mistakes.

2026-02-platform-observability-dashboard/materialize_dashboard_queries_run_parallely.py

Comment on lines +1260 to +1261

		ROUND((COUNT(DISTINCT CASE WHEN result_state != 'SUCCEEDED' THEN run_id END) / COUNT()) 100, 2) AS failure_rate_percent,
		ROUND((COUNT(DISTINCT CASE WHEN result_state = 'SUCCEEDED' THEN run_id END) / COUNT()) 100, 2) AS success_rate_percent,

Copilot AI Mar 4, 2026

failure_rate_percent / success_rate_percent use COUNT(*) as the denominator even though this query joins to system.billing.usage (potentially multiple usage rows per run). This will skew the rates; the denominator should be based on runs (e.g., COUNT(DISTINCT run_id) / total_job_runs) to match the numerator.

Suggested change

      
                ROUND((COUNT(DISTINCT CASE WHEN result_state != 'SUCCEEDED' THEN run_id END) / COUNT(*)) * 100, 2) AS failure_rate_percent,
          
                ROUND((COUNT(DISTINCT CASE WHEN result_state = 'SUCCEEDED' THEN run_id END) / COUNT(*)) * 100, 2) AS success_rate_percent,
          
                ROUND((COUNT(DISTINCT CASE WHEN result_state != 'SUCCEEDED' THEN run_id END) / COUNT(DISTINCT run_id)) * 100, 2) AS failure_rate_percent,
          
                ROUND((COUNT(DISTINCT CASE WHEN result_state = 'SUCCEEDED' THEN run_id END) / COUNT(DISTINCT run_id)) * 100, 2) AS success_rate_percent,

Copilot uses AI. Check for mistakes.

2026-02-platform-observability-dashboard/materialize_dashboard_queries_run_parallely.py Outdated Show resolved Hide resolved

2026-02-platform-observability-dashboard/materialize_dashboard_queries_run_parallely.py Outdated Show resolved Hide resolved

sashankkotta-db and others added 5 commits

March 4, 2026 18:19


          Update 2026-02-platform-observability-dashboard/materialize_dashboard…

a8bd9c8

…_queries_run_parallely.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>


          Update 2026-02-platform-observability-dashboard/materialize_dashboard…

0a49cb4

…_queries_run_parallely.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>


          Update 2026-02-platform-observability-dashboard/materialize_dashboard…

a3f84dd

…_queries_run_parallely.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>


          Update 2026-02-platform-observability-dashboard/materialize_dashboard…

7f79134

…_queries_run_parallely.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>


          Update 2026-02-platform-observability-dashboard/materialize_dashboard…

fb1330c

…_queries_run_parallely.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

Copilot code review Copilot Copilot left review comments

kwulffert23 Awaiting requested review from kwulffert23 kwulffert23 is a code owner

matthewmoorcroft Awaiting requested review from matthewmoorcroft matthewmoorcroft is a code owner

anupkalburgi Awaiting requested review from anupkalburgi anupkalburgi is a code owner

QuentinAmbard Awaiting requested review from QuentinAmbard QuentinAmbard is a code owner

alanreese-dbrx Awaiting requested review from alanreese-dbrx alanreese-dbrx is a code owner

srinivasadmala Awaiting requested review from srinivasadmala srinivasadmala is a code owner

alexott Awaiting requested review from alexott alexott is a code owner

slcc2c Awaiting requested review from slcc2c slcc2c is a code owner

At least 1 approving review is required to merge this pull request.

Labels

None yet