Various fixes in `get_table_details()` by MigQ2 · Pull Request #165 · databricks-solutions/ai-dev-kit

MigQ2 · 2026-02-22T23:14:23Z

Various fixes in `get_table_details()`

I observed some inconsistent behavior when using get_table_details(), and this PR is my best attempt to fix them.

BTW this repo is amazing, congrats to the team behind it

Summary

This PR fixes six issues in databricks_tools_core.sql.table_stats that produced incorrect or incomplete results depending on table shape and requested stat level.

Tracker

#	Issue	Previous behavior	Current implementation
1	Table comment was always `None` on direct table lookup	When calling `get_table_details` with exact table names (no glob), the code built `tables_to_fetch` with `comment: None` hardcoded, never consulting the Unity Catalog API. The comment field in the returned `DataSourceInfo` was always `null`.	For each exact table name, the SDK `client.tables.get(full_name)` is called to retrieve the actual metadata, including `comment` and `updated_at`. This also makes the cache key accurate (previously `updated_at` was always `None` for direct lookups, making cache invalidation impossible).
2	Empty tables returned `None` or `{}` for `column_details`	The `_collect_stats_for_ref` method only built `column_details` inside the stats loop (Step 3+). If the table had zero rows and `union_queries` was empty, it returned `{}` instead of the actual column schema — losing column names and types entirely.	`column_details` is now populated from `DESCRIBE TABLE` immediately after the describe step (before the stats loop). The early-return path `if not union_queries` returns the already-populated dict, so column names and types are always present regardless of row count.
3	Tables with `MAP`, `STRUCT`, `ARRAY`, or `VARIANT` columns crashed stats collection or triggered a silent `GROUP BY` error	The type classifier only checked for `"array"` and set `col_type = "array"`. Columns with types like `MAP<STRING,INT>`, `STRUCT<name:STRING,score:DOUBLE>`, or `VARIANT` fell through to the categorical branch. That branch ran `approx_count_distinct()` — which Databricks SQL rejects for MAP/STRUCT (crashing the entire stats query), and which happens to succeed for VARIANT but then assigns a `unique_count < 30`, causing `_fetch_value_counts` to attempt `GROUP BY variant_col` — which Databricks rejects because VARIANT is not an orderable type (`SQLSTATE: 42822`). Although `_fetch_value_counts` has its own `try/except`, this still produced a silent internal error and incorrect stats for those columns.	The classifier now also catches `"struct"`, `"map"`, and `"variant"` (via `"variant" in data_type`) and maps all four to `col_type = "complex"`. The `_build_column_stats_query` `complex` branch emits only `COUNT(*)` and `null_count` — no aggregations that require orderable or comparable types. Neither `approx_count_distinct` nor `GROUP BY` is ever attempted on these types.
4	`TableStatLevel.NONE` returned `column_details: None` — losing schema (especially for Views)	`get_table_info` only called `collect_column_stats` when `collect_stats=True`. For `NONE` level, `column_details` remained `None`. Callers had to parse the `ddl` string to guess the schema, which is impossible for Views (since their DDL is just the `SELECT` query, lacking column types).	`get_table_info` now calls a new private `_describe_columns()` helper (using `DESCRIBE TABLE`) when `collect_stats=False`. This guarantees accurate schema retrieval (names and types) for both tables and views with zero stats overhead. `remove_stats()` was also updated to preserve these basic details.
5	Sample data query failure had no graceful fallback	The `SELECT * LIMIT N` sample query had no error handling. For tables with complex nested types (MAP, STRUCT, VARIANT), Databricks SQL may execute the query correctly but the Python SDK can fail to deserialize the result into plain dicts. This would raise an exception that propagated up and discarded everything already collected (column schema, row count, stats).	The sample data query is now wrapped in `try/except`. On failure, a warning is logged and the method returns with empty `sample_data`, preserving the schema, row count, and stats already collected. Note: the stats `UNION ALL` query itself is intentionally not wrapped — if stats collection fails, the error propagates to the caller. Sample data is genuinely lower-stakes (nice-to-have context); stats and row count are not.
6	Unquoted table names failed for reserved words or mixed case	`get_table_ddl`, `collect_column_stats`, and `_fetch_value_counts` composed table references as bare `catalog.schema.table`. Table names that are SQL reserved words (e.g. `select`, `order`) or that contain special letters would cause SQL parse errors.	All UC table references in SQL statements now use backtick quoting: `catalog`.`schema`.`table`.

Open Discussion: Column Comments in `ColumnDetail`

Currently, the ColumnDetail model does not have a comment attribute. Column comments are only exposed to the LLM via the raw ddl string (SHOW CREATE TABLE).

While this saves tokens by avoiding duplication (the comment is only printed once in the DDL string rather than repeated in the JSON structure), it forces the LLM to cross-reference the JSON schema with the raw SQL string to understand the semantic meaning of a column.

I intentionally left this as-is to respect the original design and avoid modifying the Pydantic models. However, if the team feels that having structured access to column comments is valuable for agent performance, we could consider adding comment: Optional[str] = None to ColumnDetail and populate it from the DESCRIBE TABLE output.

Reproduction Cases

Issue 1 — Table comment always `None` on direct lookup

# Setup
# CREATE TABLE mycatalog.myschema.mytable (id BIGINT)
# COMMENT ON TABLE mycatalog.myschema.mytable IS 'My important table'

from databricks_tools_core.sql.table_stats import get_table_details
from databricks_tools_core.sql.sql_utils.models import TableStatLevel

result = get_table_details(
    "mycatalog", "myschema",
    table_names=["mytable"],          # exact name → triggers direct lookup path
    table_stat_level=TableStatLevel.SIMPLE,
)

# BEFORE: result.tables[0].comment was always None
# AFTER:  result.tables[0].comment == "My important table"
print(result.tables[0].comment)

Design note: The original code only retrieved comment when listing all tables (via client.tables.list()), which populates the field from the SDK response. For direct lookups it short-circuited that step. Now a single client.tables.get() per table fills both comment and updated_at. This is a minor extra API call per table (not per column), and it also fixes cache invalidation for direct lookups — previously updated_at=None meant the cache key was always stale-proof but also never validated.

Issue 2 — Empty tables return `None`/`{}` for `column_details`

# Setup
# CREATE TABLE mycatalog.myschema.mytable (
#   user_id BIGINT,
#   status  STRING,
#   score   DOUBLE
# )
# -- Table is empty (zero rows)

from databricks_tools_core.sql.table_stats import get_table_details
from databricks_tools_core.sql.sql_utils.models import TableStatLevel

result = get_table_details(
    "mycatalog", "myschema",
    table_names=["mytable"],
    table_stat_level=TableStatLevel.SIMPLE,
)

t = result.tables[0]

# BEFORE: t.column_details was {} (empty dict) — schema invisible to callers
# AFTER:
#   t.column_details == {
#     "user_id": ColumnDetail(name="user_id", data_type="bigint", ...),
#     "status":  ColumnDetail(name="status",  data_type="string",  ...),
#     "score":   ColumnDetail(name="score",   data_type="double",  ...),
#   }
#   t.total_rows == 0
print(list(t.column_details.keys()))   # ['user_id', 'status', 'score']
print(t.total_rows)                    # 0

Design note: The previous code built column_details entries inside the iteration that also accumulated union_queries. If there were no rows—or the table had only complex-type columns (pre-fix 3)—union_queries was empty and the method returned {} before any column_details were ever stored. The fix separates the "populate column schema" step (cheap, always runs) from the "build stats query" step (only runs if there are queryable columns and rows).

Issue 3 — `MAP`, `STRUCT`, `ARRAY`, and `VARIANT` columns cause stats failures

# CREATE TABLE mycatalog.myschema.mytable (
#   id       BIGINT,
#   metadata MAP<STRING, INT>,
#   payload  STRUCT<name: STRING, score: DOUBLE>,
#   tags     ARRAY<STRING>,
#   label    STRING
# )
# INSERT INTO mycatalog.myschema.mytable VALUES
#   (1, map('a',1), struct('Alice',9.5), array('x','y'), 'row1')

from databricks_tools_core.sql.table_stats import get_table_details
from databricks_tools_core.sql.sql_utils.models import TableStatLevel

result = get_table_details(
    "mycatalog", "myschema",
    table_names=["mytable"],
    table_stat_level=TableStatLevel.SIMPLE,
)
t = result.tables[0]

# BEFORE (MAP/STRUCT): t.error was set — the entire UNION ALL failed because
#   approx_count_distinct(MAP<...>) is not supported. All columns lost stats.
#
# BEFORE (VARIANT, low cardinality — e.g. only 3 distinct JSON shapes):
#   approx_count_distinct(VARIANT) succeeds and returns 3. Because 3 < MAX_CATEGORICAL_VALUES,
#   _fetch_value_counts attempted GROUP BY variant_col → SQLSTATE 42822 (not orderable).
#   The try/except in _fetch_value_counts silently absorbed the error.
#   No visible failure, but unique_count and value_counts were wrong.
#
# AFTER (all four types):
#   metadata → ColumnDetail(data_type="map<string,int>",  unique_count=None, null_count=0)
#   payload  → ColumnDetail(data_type="struct<...>",       unique_count=None, null_count=0)
#   tags     → ColumnDetail(data_type="array<string>",     unique_count=None, null_count=0)
#   label    → ColumnDetail(data_type="string", unique_count=1, ...)  ← full stats
for name, col in t.column_details.items():
    print(name, col.data_type, col.total_count, col.unique_count)

Design note: All four types now route to col_type = "complex" via a single check (is_array or is_struct_or_map, where is_struct_or_map covers struct, map, and variant). Only COUNT(*) and null_count are emitted — aggregations that work on any type. The distinction between MAP/STRUCT (hard crash) and VARIANT (silent wrong result) is an implementation detail; the fix and the correct classification are the same for all.

Issue 4 — `TableStatLevel.NONE` returns `column_details: None` (Fixes schema for Views)

# Setup
# CREATE TABLE mycatalog.myschema.mytable (
#   order_id BIGINT,
#   customer STRING,
#   amount   DECIMAL(10,2)
# )

from databricks_tools_core.sql.table_stats import get_table_details
from databricks_tools_core.sql.sql_utils.models import TableStatLevel

result = get_table_details(
    "mycatalog", "myschema",
    table_names=["mytable"],
    table_stat_level=TableStatLevel.NONE,     # fast, no stats
)

t = result.tables[0]

# BEFORE:
#   t.column_details was None — completely invisible schema
#   t.ddl was present, but callers using column_details got nothing
#
# AFTER:
#   t.column_details == {
#     "order_id": ColumnDetail(name="order_id", data_type="bigint"),
#     "customer": ColumnDetail(name="customer", data_type="string"),
#     "amount":   ColumnDetail(name="amount",   data_type="decimal(10,2)"),
#   }
#   t.total_rows is None   ← no COUNT(*) issued
#   t.sample_data is None  ← no SELECT issued
print(list(t.column_details.keys()))    # ['order_id', 'customer', 'amount']
print(t.total_rows)                     # None — no stats queries run

Design note: NONE is intended for fast lookups. Previously, callers had to rely on the ddl field (SHOW CREATE TABLE) to infer the schema. This was a critical flaw for Views, because a view's DDL only contains the underlying SELECT query, completely hiding the column data types. By using DESCRIBE TABLE instead, we guarantee accurate schema retrieval for both tables and views with zero stats overhead.

Note on MCP Server Integration: To fully benefit from this lightweight mode, the MCP server implementation (databricks_mcp_server/tools/sql.py) was also updated to use result.model_dump(exclude_none=True). This prevents the JSON output from being cluttered with dozens of null stat fields, keeping the context window usage efficient.

Issues 5 & 6 — Sample data graceful fallback and unquoted identifiers

# Issue 6: table with a reserved SQL keyword as name
# CREATE TABLE mycatalog.myschema.`select` (id BIGINT, `order` STRING)
# INSERT INTO mycatalog.myschema.`select` VALUES (1, 'first')

from databricks_tools_core.sql.table_stats import get_table_details
from databricks_tools_core.sql.sql_utils.models import TableStatLevel

# BEFORE (unquoted identifiers):
#   SHOW CREATE TABLE mycatalog.myschema.select  ← SQL parse error
#   result.tables[0].error was set, ddl was ""

# AFTER (backtick quoting):
#   SHOW CREATE TABLE `mycatalog`.`myschema`.`select`  ← succeeds
result = get_table_details(
    "mycatalog", "myschema",
    table_names=["select"],
    table_stat_level=TableStatLevel.SIMPLE,
)
print(result.tables[0].ddl)   # non-empty CREATE TABLE statement
print(result.tables[0].error) # None

Design note on sample data resilience: The SELECT * LIMIT N sample query is wrapped in try/except. For tables with deeply nested complex types, Databricks SQL may execute the query but the SDK can fail to deserialize the rows into plain Python dicts. Sample data is genuinely lower-stakes — it's display context — so a warning + empty sample_data is the right trade-off here. The stats UNION ALL query is intentionally not wrapped: if stats collection fails, the error propagates to the caller. The schema, row count, and stats are the facts; samples are optional context. Silently degrading facts would be worse than a clear error.

- Restore min/max parsing for date columns in _parse_stats_results else branch, which was lost during the refactor from constructing new ColumnDetail objects to updating them in-place. - Remove redundant column_details[col_name] assignment inside the stats loop that overwrote the identical entry already created in the pre-stats-loop DESCRIBE TABLE population step. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Move try/except inside the loop so a single table failing tables.get() (e.g. typo, missing permissions) doesn't abort the entire request. Falls back to minimal metadata (comment=None, updated_at=None) for that table while still proceeding with stats collection for the rest. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

MigQ2 · 2026-02-26T16:38:07Z

I saw you commits and it seems when there is a failure in a table it will now silently return empty info.

Don't you think this is risky? I would probably prefer an explicit error in the tool call

Or at least be explicit on the output of that table that an error ocurred rather than silently returning empty info

WDYT @calreynolds ?

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Instead of silently falling back to empty metadata (comment=None, updated_at=None) and proceeding with stats collection, failed tables now appear in the result with a populated error field. This makes failures visible to callers while still returning results for the tables that succeeded. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

calreynolds · 2026-02-26T17:50:06Z

I saw you commits and it seems when there is a failure in a table it will now silently return empty info.

Don't you think this is risky? I would probably prefer an explicit error in the tool call

Or at least be explicit on the output of that table that an error ocurred rather than silently returning empty info

WDYT @calreynolds ?

Good call 👍 . Adjusted!

Various fixes in get_table_details()

3e7505a

MigQ2 mentioned this pull request Feb 22, 2026

MCP get_table_details fails to return table comment #160

Closed

MigQ2 and others added 3 commits February 25, 2026 00:32

Remove empty entries from column_details to save tokens

26a2c1a

calreynolds self-requested a review February 26, 2026 15:02

calreynolds previously approved these changes Feb 26, 2026

View reviewed changes

Apply ruff formatting

82fc084

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

calreynolds dismissed their stale review via 82fc084 February 26, 2026 17:45

calreynolds approved these changes Feb 26, 2026

View reviewed changes

calreynolds merged commit f4d6b21 into databricks-solutions:main Feb 26, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various fixes in `get_table_details()`#165

Various fixes in `get_table_details()`#165
calreynolds merged 6 commits intodatabricks-solutions:mainfrom
MigQ2:get-table-details-fixes

MigQ2 commented Feb 22, 2026 •

edited

Loading

Uh oh!

MigQ2 commented Feb 26, 2026 •

edited

Loading

Uh oh!

calreynolds commented Feb 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MigQ2 commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Various fixes in get_table_details()

Summary

Tracker

Open Discussion: Column Comments in ColumnDetail

Reproduction Cases

Issue 1 — Table comment always None on direct lookup

Issue 2 — Empty tables return None/{} for column_details

Issue 3 — MAP, STRUCT, ARRAY, and VARIANT columns cause stats failures

Issue 4 — TableStatLevel.NONE returns column_details: None (Fixes schema for Views)

Issues 5 & 6 — Sample data graceful fallback and unquoted identifiers

Uh oh!

MigQ2 commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

calreynolds commented Feb 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MigQ2 commented Feb 22, 2026 •

edited

Loading

Various fixes in `get_table_details()`

Open Discussion: Column Comments in `ColumnDetail`

Issue 1 — Table comment always `None` on direct lookup

Issue 2 — Empty tables return `None`/`{}` for `column_details`

Issue 3 — `MAP`, `STRUCT`, `ARRAY`, and `VARIANT` columns cause stats failures

Issue 4 — `TableStatLevel.NONE` returns `column_details: None` (Fixes schema for Views)

MigQ2 commented Feb 26, 2026 •

edited

Loading