Skip to content

Various fixes in get_table_details()#165

Merged
calreynolds merged 6 commits intodatabricks-solutions:mainfrom
MigQ2:get-table-details-fixes
Feb 26, 2026
Merged

Various fixes in get_table_details()#165
calreynolds merged 6 commits intodatabricks-solutions:mainfrom
MigQ2:get-table-details-fixes

Conversation

@MigQ2
Copy link
Contributor

@MigQ2 MigQ2 commented Feb 22, 2026

Various fixes in get_table_details()

I observed some inconsistent behavior when using get_table_details(), and this PR is my best attempt to fix them.

BTW this repo is amazing, congrats to the team behind it

Summary

This PR fixes six issues in databricks_tools_core.sql.table_stats that produced incorrect or incomplete results depending on table shape and requested stat level.


Tracker

# Issue Previous behavior Current implementation
1 Table comment was always None on direct table lookup When calling get_table_details with exact table names (no glob), the code built tables_to_fetch with comment: None hardcoded, never consulting the Unity Catalog API. The comment field in the returned DataSourceInfo was always null. For each exact table name, the SDK client.tables.get(full_name) is called to retrieve the actual metadata, including comment and updated_at. This also makes the cache key accurate (previously updated_at was always None for direct lookups, making cache invalidation impossible).
2 Empty tables returned None or {} for column_details The _collect_stats_for_ref method only built column_details inside the stats loop (Step 3+). If the table had zero rows and union_queries was empty, it returned {} instead of the actual column schema — losing column names and types entirely. column_details is now populated from DESCRIBE TABLE immediately after the describe step (before the stats loop). The early-return path if not union_queries returns the already-populated dict, so column names and types are always present regardless of row count.
3 Tables with MAP, STRUCT, ARRAY, or VARIANT columns crashed stats collection or triggered a silent GROUP BY error The type classifier only checked for "array" and set col_type = "array". Columns with types like MAP<STRING,INT>, STRUCT<name:STRING,score:DOUBLE>, or VARIANT fell through to the categorical branch. That branch ran approx_count_distinct() — which Databricks SQL rejects for MAP/STRUCT (crashing the entire stats query), and which happens to succeed for VARIANT but then assigns a unique_count < 30, causing _fetch_value_counts to attempt GROUP BY variant_col — which Databricks rejects because VARIANT is not an orderable type (SQLSTATE: 42822). Although _fetch_value_counts has its own try/except, this still produced a silent internal error and incorrect stats for those columns. The classifier now also catches "struct", "map", and "variant" (via "variant" in data_type) and maps all four to col_type = "complex". The _build_column_stats_query complex branch emits only COUNT(*) and null_count — no aggregations that require orderable or comparable types. Neither approx_count_distinct nor GROUP BY is ever attempted on these types.
4 TableStatLevel.NONE returned column_details: None — losing schema (especially for Views) get_table_info only called collect_column_stats when collect_stats=True. For NONE level, column_details remained None. Callers had to parse the ddl string to guess the schema, which is impossible for Views (since their DDL is just the SELECT query, lacking column types). get_table_info now calls a new private _describe_columns() helper (using DESCRIBE TABLE) when collect_stats=False. This guarantees accurate schema retrieval (names and types) for both tables and views with zero stats overhead. remove_stats() was also updated to preserve these basic details.
5 Sample data query failure had no graceful fallback The SELECT * LIMIT N sample query had no error handling. For tables with complex nested types (MAP, STRUCT, VARIANT), Databricks SQL may execute the query correctly but the Python SDK can fail to deserialize the result into plain dicts. This would raise an exception that propagated up and discarded everything already collected (column schema, row count, stats). The sample data query is now wrapped in try/except. On failure, a warning is logged and the method returns with empty sample_data, preserving the schema, row count, and stats already collected. Note: the stats UNION ALL query itself is intentionally not wrapped — if stats collection fails, the error propagates to the caller. Sample data is genuinely lower-stakes (nice-to-have context); stats and row count are not.
6 Unquoted table names failed for reserved words or mixed case get_table_ddl, collect_column_stats, and _fetch_value_counts composed table references as bare catalog.schema.table. Table names that are SQL reserved words (e.g. select, order) or that contain special letters would cause SQL parse errors. All UC table references in SQL statements now use backtick quoting: `catalog`.`schema`.`table`.

Open Discussion: Column Comments in ColumnDetail

Currently, the ColumnDetail model does not have a comment attribute. Column comments are only exposed to the LLM via the raw ddl string (SHOW CREATE TABLE).

While this saves tokens by avoiding duplication (the comment is only printed once in the DDL string rather than repeated in the JSON structure), it forces the LLM to cross-reference the JSON schema with the raw SQL string to understand the semantic meaning of a column.

I intentionally left this as-is to respect the original design and avoid modifying the Pydantic models. However, if the team feels that having structured access to column comments is valuable for agent performance, we could consider adding comment: Optional[str] = None to ColumnDetail and populate it from the DESCRIBE TABLE output.


Reproduction Cases

Issue 1 — Table comment always None on direct lookup

# Setup
# CREATE TABLE mycatalog.myschema.mytable (id BIGINT)
# COMMENT ON TABLE mycatalog.myschema.mytable IS 'My important table'

from databricks_tools_core.sql.table_stats import get_table_details
from databricks_tools_core.sql.sql_utils.models import TableStatLevel

result = get_table_details(
    "mycatalog", "myschema",
    table_names=["mytable"],          # exact name → triggers direct lookup path
    table_stat_level=TableStatLevel.SIMPLE,
)

# BEFORE: result.tables[0].comment was always None
# AFTER:  result.tables[0].comment == "My important table"
print(result.tables[0].comment)

Design note: The original code only retrieved comment when listing all tables (via client.tables.list()), which populates the field from the SDK response. For direct lookups it short-circuited that step. Now a single client.tables.get() per table fills both comment and updated_at. This is a minor extra API call per table (not per column), and it also fixes cache invalidation for direct lookups — previously updated_at=None meant the cache key was always stale-proof but also never validated.


Issue 2 — Empty tables return None/{} for column_details

# Setup
# CREATE TABLE mycatalog.myschema.mytable (
#   user_id BIGINT,
#   status  STRING,
#   score   DOUBLE
# )
# -- Table is empty (zero rows)

from databricks_tools_core.sql.table_stats import get_table_details
from databricks_tools_core.sql.sql_utils.models import TableStatLevel

result = get_table_details(
    "mycatalog", "myschema",
    table_names=["mytable"],
    table_stat_level=TableStatLevel.SIMPLE,
)

t = result.tables[0]

# BEFORE: t.column_details was {} (empty dict) — schema invisible to callers
# AFTER:
#   t.column_details == {
#     "user_id": ColumnDetail(name="user_id", data_type="bigint", ...),
#     "status":  ColumnDetail(name="status",  data_type="string",  ...),
#     "score":   ColumnDetail(name="score",   data_type="double",  ...),
#   }
#   t.total_rows == 0
print(list(t.column_details.keys()))   # ['user_id', 'status', 'score']
print(t.total_rows)                    # 0

Design note: The previous code built column_details entries inside the iteration that also accumulated union_queries. If there were no rows—or the table had only complex-type columns (pre-fix 3)—union_queries was empty and the method returned {} before any column_details were ever stored. The fix separates the "populate column schema" step (cheap, always runs) from the "build stats query" step (only runs if there are queryable columns and rows).


Issue 3 — MAP, STRUCT, ARRAY, and VARIANT columns cause stats failures

# CREATE TABLE mycatalog.myschema.mytable (
#   id       BIGINT,
#   metadata MAP<STRING, INT>,
#   payload  STRUCT<name: STRING, score: DOUBLE>,
#   tags     ARRAY<STRING>,
#   label    STRING
# )
# INSERT INTO mycatalog.myschema.mytable VALUES
#   (1, map('a',1), struct('Alice',9.5), array('x','y'), 'row1')

from databricks_tools_core.sql.table_stats import get_table_details
from databricks_tools_core.sql.sql_utils.models import TableStatLevel

result = get_table_details(
    "mycatalog", "myschema",
    table_names=["mytable"],
    table_stat_level=TableStatLevel.SIMPLE,
)
t = result.tables[0]

# BEFORE (MAP/STRUCT): t.error was set — the entire UNION ALL failed because
#   approx_count_distinct(MAP<...>) is not supported. All columns lost stats.
#
# BEFORE (VARIANT, low cardinality — e.g. only 3 distinct JSON shapes):
#   approx_count_distinct(VARIANT) succeeds and returns 3. Because 3 < MAX_CATEGORICAL_VALUES,
#   _fetch_value_counts attempted GROUP BY variant_col → SQLSTATE 42822 (not orderable).
#   The try/except in _fetch_value_counts silently absorbed the error.
#   No visible failure, but unique_count and value_counts were wrong.
#
# AFTER (all four types):
#   metadata → ColumnDetail(data_type="map<string,int>",  unique_count=None, null_count=0)
#   payload  → ColumnDetail(data_type="struct<...>",       unique_count=None, null_count=0)
#   tags     → ColumnDetail(data_type="array<string>",     unique_count=None, null_count=0)
#   label    → ColumnDetail(data_type="string", unique_count=1, ...)  ← full stats
for name, col in t.column_details.items():
    print(name, col.data_type, col.total_count, col.unique_count)

Design note: All four types now route to col_type = "complex" via a single check (is_array or is_struct_or_map, where is_struct_or_map covers struct, map, and variant). Only COUNT(*) and null_count are emitted — aggregations that work on any type. The distinction between MAP/STRUCT (hard crash) and VARIANT (silent wrong result) is an implementation detail; the fix and the correct classification are the same for all.


Issue 4 — TableStatLevel.NONE returns column_details: None (Fixes schema for Views)

# Setup
# CREATE TABLE mycatalog.myschema.mytable (
#   order_id BIGINT,
#   customer STRING,
#   amount   DECIMAL(10,2)
# )

from databricks_tools_core.sql.table_stats import get_table_details
from databricks_tools_core.sql.sql_utils.models import TableStatLevel

result = get_table_details(
    "mycatalog", "myschema",
    table_names=["mytable"],
    table_stat_level=TableStatLevel.NONE,     # fast, no stats
)

t = result.tables[0]

# BEFORE:
#   t.column_details was None — completely invisible schema
#   t.ddl was present, but callers using column_details got nothing
#
# AFTER:
#   t.column_details == {
#     "order_id": ColumnDetail(name="order_id", data_type="bigint"),
#     "customer": ColumnDetail(name="customer", data_type="string"),
#     "amount":   ColumnDetail(name="amount",   data_type="decimal(10,2)"),
#   }
#   t.total_rows is None   ← no COUNT(*) issued
#   t.sample_data is None  ← no SELECT issued
print(list(t.column_details.keys()))    # ['order_id', 'customer', 'amount']
print(t.total_rows)                     # None — no stats queries run

Design note: NONE is intended for fast lookups. Previously, callers had to rely on the ddl field (SHOW CREATE TABLE) to infer the schema. This was a critical flaw for Views, because a view's DDL only contains the underlying SELECT query, completely hiding the column data types. By using DESCRIBE TABLE instead, we guarantee accurate schema retrieval for both tables and views with zero stats overhead.

Note on MCP Server Integration: To fully benefit from this lightweight mode, the MCP server implementation (databricks_mcp_server/tools/sql.py) was also updated to use result.model_dump(exclude_none=True). This prevents the JSON output from being cluttered with dozens of null stat fields, keeping the context window usage efficient.


Issues 5 & 6 — Sample data graceful fallback and unquoted identifiers

# Issue 6: table with a reserved SQL keyword as name
# CREATE TABLE mycatalog.myschema.`select` (id BIGINT, `order` STRING)
# INSERT INTO mycatalog.myschema.`select` VALUES (1, 'first')

from databricks_tools_core.sql.table_stats import get_table_details
from databricks_tools_core.sql.sql_utils.models import TableStatLevel

# BEFORE (unquoted identifiers):
#   SHOW CREATE TABLE mycatalog.myschema.select  ← SQL parse error
#   result.tables[0].error was set, ddl was ""

# AFTER (backtick quoting):
#   SHOW CREATE TABLE `mycatalog`.`myschema`.`select`  ← succeeds
result = get_table_details(
    "mycatalog", "myschema",
    table_names=["select"],
    table_stat_level=TableStatLevel.SIMPLE,
)
print(result.tables[0].ddl)   # non-empty CREATE TABLE statement
print(result.tables[0].error) # None

Design note on sample data resilience: The SELECT * LIMIT N sample query is wrapped in try/except. For tables with deeply nested complex types, Databricks SQL may execute the query but the SDK can fail to deserialize the rows into plain Python dicts. Sample data is genuinely lower-stakes — it's display context — so a warning + empty sample_data is the right trade-off here. The stats UNION ALL query is intentionally not wrapped: if stats collection fails, the error propagates to the caller. The schema, row count, and stats are the facts; samples are optional context. Silently degrading facts would be worse than a clear error.

MigQ2 and others added 3 commits February 25, 2026 00:32
- Restore min/max parsing for date columns in _parse_stats_results else
  branch, which was lost during the refactor from constructing new
  ColumnDetail objects to updating them in-place.
- Remove redundant column_details[col_name] assignment inside the stats
  loop that overwrote the identical entry already created in the
  pre-stats-loop DESCRIBE TABLE population step.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move try/except inside the loop so a single table failing tables.get()
(e.g. typo, missing permissions) doesn't abort the entire request. Falls
back to minimal metadata (comment=None, updated_at=None) for that table
while still proceeding with stats collection for the rest.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@calreynolds calreynolds self-requested a review February 26, 2026 15:02
calreynolds
calreynolds previously approved these changes Feb 26, 2026
@MigQ2
Copy link
Contributor Author

MigQ2 commented Feb 26, 2026

I saw you commits and it seems when there is a failure in a table it will now silently return empty info.

Don't you think this is risky? I would probably prefer an explicit error in the tool call

Or at least be explicit on the output of that table that an error ocurred rather than silently returning empty info

WDYT @calreynolds ?

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of silently falling back to empty metadata (comment=None,
updated_at=None) and proceeding with stats collection, failed tables
now appear in the result with a populated error field. This makes
failures visible to callers while still returning results for the
tables that succeeded.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@calreynolds
Copy link
Collaborator

I saw you commits and it seems when there is a failure in a table it will now silently return empty info.

Don't you think this is risky? I would probably prefer an explicit error in the tool call

Or at least be explicit on the output of that table that an error ocurred rather than silently returning empty info

WDYT @calreynolds ?

Good call 👍 . Adjusted!

@calreynolds calreynolds merged commit f4d6b21 into databricks-solutions:main Feb 26, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants