Various fixes in get_table_details()#165
Merged
calreynolds merged 6 commits intodatabricks-solutions:mainfrom Feb 26, 2026
Merged
Various fixes in get_table_details()#165calreynolds merged 6 commits intodatabricks-solutions:mainfrom
get_table_details()#165calreynolds merged 6 commits intodatabricks-solutions:mainfrom
Conversation
- Restore min/max parsing for date columns in _parse_stats_results else branch, which was lost during the refactor from constructing new ColumnDetail objects to updating them in-place. - Remove redundant column_details[col_name] assignment inside the stats loop that overwrote the identical entry already created in the pre-stats-loop DESCRIBE TABLE population step. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move try/except inside the loop so a single table failing tables.get() (e.g. typo, missing permissions) doesn't abort the entire request. Falls back to minimal metadata (comment=None, updated_at=None) for that table while still proceeding with stats collection for the rest. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
calreynolds
previously approved these changes
Feb 26, 2026
Contributor
Author
|
I saw you commits and it seems when there is a failure in a table it will now silently return empty info. Don't you think this is risky? I would probably prefer an explicit error in the tool call Or at least be explicit on the output of that table that an error ocurred rather than silently returning empty info WDYT @calreynolds ? |
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of silently falling back to empty metadata (comment=None, updated_at=None) and proceeding with stats collection, failed tables now appear in the result with a populated error field. This makes failures visible to callers while still returning results for the tables that succeeded. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Collaborator
Good call 👍 . Adjusted! |
calreynolds
approved these changes
Feb 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Various fixes in
get_table_details()I observed some inconsistent behavior when using
get_table_details(), and this PR is my best attempt to fix them.BTW this repo is amazing, congrats to the team behind it
Summary
This PR fixes six issues in
databricks_tools_core.sql.table_statsthat produced incorrect or incomplete results depending on table shape and requested stat level.Tracker
Noneon direct table lookupget_table_detailswith exact table names (no glob), the code builttables_to_fetchwithcomment: Nonehardcoded, never consulting the Unity Catalog API. The comment field in the returnedDataSourceInfowas alwaysnull.client.tables.get(full_name)is called to retrieve the actual metadata, includingcommentandupdated_at. This also makes the cache key accurate (previouslyupdated_atwas alwaysNonefor direct lookups, making cache invalidation impossible).Noneor{}forcolumn_details_collect_stats_for_refmethod only builtcolumn_detailsinside the stats loop (Step 3+). If the table had zero rows andunion_querieswas empty, it returned{}instead of the actual column schema — losing column names and types entirely.column_detailsis now populated fromDESCRIBE TABLEimmediately after the describe step (before the stats loop). The early-return pathif not union_queriesreturns the already-populated dict, so column names and types are always present regardless of row count.MAP,STRUCT,ARRAY, orVARIANTcolumns crashed stats collection or triggered a silentGROUP BYerror"array"and setcol_type = "array". Columns with types likeMAP<STRING,INT>,STRUCT<name:STRING,score:DOUBLE>, orVARIANTfell through to the categorical branch. That branch ranapprox_count_distinct()— which Databricks SQL rejects for MAP/STRUCT (crashing the entire stats query), and which happens to succeed for VARIANT but then assigns aunique_count < 30, causing_fetch_value_countsto attemptGROUP BY variant_col— which Databricks rejects because VARIANT is not an orderable type (SQLSTATE: 42822). Although_fetch_value_countshas its owntry/except, this still produced a silent internal error and incorrect stats for those columns."struct","map", and"variant"(via"variant" in data_type) and maps all four tocol_type = "complex". The_build_column_stats_querycomplexbranch emits onlyCOUNT(*)andnull_count— no aggregations that require orderable or comparable types. Neitherapprox_count_distinctnorGROUP BYis ever attempted on these types.TableStatLevel.NONEreturnedcolumn_details: None— losing schema (especially for Views)get_table_infoonly calledcollect_column_statswhencollect_stats=True. ForNONElevel,column_detailsremainedNone. Callers had to parse theddlstring to guess the schema, which is impossible for Views (since their DDL is just theSELECTquery, lacking column types).get_table_infonow calls a new private_describe_columns()helper (usingDESCRIBE TABLE) whencollect_stats=False. This guarantees accurate schema retrieval (names and types) for both tables and views with zero stats overhead.remove_stats()was also updated to preserve these basic details.SELECT * LIMIT Nsample query had no error handling. For tables with complex nested types (MAP, STRUCT, VARIANT), Databricks SQL may execute the query correctly but the Python SDK can fail to deserialize the result into plain dicts. This would raise an exception that propagated up and discarded everything already collected (column schema, row count, stats).try/except. On failure, a warning is logged and the method returns with emptysample_data, preserving the schema, row count, and stats already collected. Note: the statsUNION ALLquery itself is intentionally not wrapped — if stats collection fails, the error propagates to the caller. Sample data is genuinely lower-stakes (nice-to-have context); stats and row count are not.get_table_ddl,collect_column_stats, and_fetch_value_countscomposed table references as barecatalog.schema.table. Table names that are SQL reserved words (e.g.select,order) or that contain special letters would cause SQL parse errors.`catalog`.`schema`.`table`.Open Discussion: Column Comments in
ColumnDetailCurrently, the
ColumnDetailmodel does not have acommentattribute. Column comments are only exposed to the LLM via the rawddlstring (SHOW CREATE TABLE).While this saves tokens by avoiding duplication (the comment is only printed once in the DDL string rather than repeated in the JSON structure), it forces the LLM to cross-reference the JSON schema with the raw SQL string to understand the semantic meaning of a column.
I intentionally left this as-is to respect the original design and avoid modifying the Pydantic models. However, if the team feels that having structured access to column comments is valuable for agent performance, we could consider adding
comment: Optional[str] = NonetoColumnDetailand populate it from theDESCRIBE TABLEoutput.Reproduction Cases
Issue 1 — Table comment always
Noneon direct lookupDesign note: The original code only retrieved
commentwhen listing all tables (viaclient.tables.list()), which populates the field from the SDK response. For direct lookups it short-circuited that step. Now a singleclient.tables.get()per table fills bothcommentandupdated_at. This is a minor extra API call per table (not per column), and it also fixes cache invalidation for direct lookups — previouslyupdated_at=Nonemeant the cache key was always stale-proof but also never validated.Issue 2 — Empty tables return
None/{}forcolumn_detailsDesign note: The previous code built
column_detailsentries inside the iteration that also accumulatedunion_queries. If there were no rows—or the table had only complex-type columns (pre-fix 3)—union_querieswas empty and the method returned{}before anycolumn_detailswere ever stored. The fix separates the "populate column schema" step (cheap, always runs) from the "build stats query" step (only runs if there are queryable columns and rows).Issue 3 —
MAP,STRUCT,ARRAY, andVARIANTcolumns cause stats failuresDesign note: All four types now route to
col_type = "complex"via a single check (is_array or is_struct_or_map, whereis_struct_or_mapcoversstruct,map, andvariant). OnlyCOUNT(*)andnull_countare emitted — aggregations that work on any type. The distinction between MAP/STRUCT (hard crash) and VARIANT (silent wrong result) is an implementation detail; the fix and the correct classification are the same for all.Issue 4 —
TableStatLevel.NONEreturnscolumn_details: None(Fixes schema for Views)Design note:
NONEis intended for fast lookups. Previously, callers had to rely on theddlfield (SHOW CREATE TABLE) to infer the schema. This was a critical flaw for Views, because a view's DDL only contains the underlyingSELECTquery, completely hiding the column data types. By usingDESCRIBE TABLEinstead, we guarantee accurate schema retrieval for both tables and views with zero stats overhead.Note on MCP Server Integration: To fully benefit from this lightweight mode, the MCP server implementation (
databricks_mcp_server/tools/sql.py) was also updated to useresult.model_dump(exclude_none=True). This prevents the JSON output from being cluttered with dozens ofnullstat fields, keeping the context window usage efficient.Issues 5 & 6 — Sample data graceful fallback and unquoted identifiers
Design note on sample data resilience: The
SELECT * LIMIT Nsample query is wrapped intry/except. For tables with deeply nested complex types, Databricks SQL may execute the query but the SDK can fail to deserialize the rows into plain Python dicts. Sample data is genuinely lower-stakes — it's display context — so a warning + emptysample_datais the right trade-off here. The statsUNION ALLquery is intentionally not wrapped: if stats collection fails, the error propagates to the caller. The schema, row count, and stats are the facts; samples are optional context. Silently degrading facts would be worse than a clear error.