Skip to content

DO NOT MERGE#79

Open
cmatKhan wants to merge 2 commits intoBrentLab:devfrom
cmatKhan:dealing_with_nonembedded
Open

DO NOT MERGE#79
cmatKhan wants to merge 2 commits intoBrentLab:devfrom
cmatKhan:dealing_with_nonembedded

Conversation

@cmatKhan
Copy link
Member

this is here to help get Kaiwen going again on the DTO analysis

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends VirtualDB/DataCard support for (1) datasets whose sample identifier column isn’t literally sample_id, and (2) datasets whose metadata is stored in a separate HuggingFace config via applies_to, by deriving a dataset schema from the DataCard YAML and adjusting view registration accordingly.

Changes:

  • Add DataCard helpers (get_metadata_fields, get_data_col_names, get_metadata_config_name, get_dataset_schema) and a DatasetSchema dataclass to summarize embedded vs external metadata.
  • Update VirtualDB view registration to JOIN external metadata parquet views and to JOIN/enrich using a configurable sample identifier column.
  • Add/extend unit tests and update documentation/tutorial artifacts to reflect the new configuration and behaviors.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tfbpapi/virtual_db.py Implements external-metadata parquet views, configurable sample-id joins, and revised meta view construction.
tfbpapi/datacard.py Adds DatasetSchema + DataCard APIs to derive schema/metadata fields from DataCard YAML (including applies_to).
tfbpapi/models.py Adds MetadataConfig.get_sample_id_field() used by VirtualDB to resolve the sample identifier column.
tfbpapi/tests/test_virtual_db.py Updates mocks and adds tests for dynamic sample-id and external metadata JOIN behavior.
tfbpapi/tests/test_datacard.py Adds test coverage for new DataCard schema/metadata helper APIs.
docs/virtual_db_configuration.md Updates configuration docs to clarify sample identifier column semantics.
docs/tutorials/virtual_db_tutorial.ipynb Regenerates tutorial outputs reflecting new behavior/logging and column ordering.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +734 to +755
# Add derived property expressions from the VirtualDB config
prop_result = self._resolve_property_columns(repo_id, config_name)
if prop_result is not None:
derived_exprs, prop_raw_cols = prop_result
# Ensure source columns needed by expressions are selected
for col in prop_raw_cols:
add_col(col)
# Qualify source column references inside CASE WHEN expressions
if is_join:
qualified_exprs = []
for expr in derived_exprs:
for raw_col in prop_raw_cols:
q = qualify(raw_col)
if q != raw_col:
# Replace bare column name in CASE WHEN patterns
expr = expr.replace(
f"CASE {raw_col} ", f"CASE {q} "
).replace(f" {raw_col} = ", f" {q} = ")
qualified_exprs.append(expr)
derived_exprs = qualified_exprs
select_parts.extend(derived_exprs)

Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In JOIN mode, derived property expressions are appended as raw SQL strings. Only certain CASE patterns get column qualification, but simple alias expressions like "<raw_col> AS <alias>" (and arbitrary expression mappings) remain unqualified, which can cause DuckDB binder errors if the referenced column exists in both joined relations (or silently select the wrong side). Consider qualifying column references for all derived expressions in JOIN context (e.g., generate qualified expressions during _resolve_property_columns, or apply a safer identifier-aware rewrite).

Copilot uses AI. Check for mistakes.
Comment on lines +694 to +703
# Use the configured sample_id column as the join key.
# The DataCard feature intersection (schema.join_columns)
# is unreliable because a data config's feature list may
# document columns that are physically only in the metadata
# parquet (present conceptually after a join, not in the
# physical data parquet file).
from_clause = (
f"{parquet_view} d " f"JOIN {ext_meta_view} m " f"USING ({sample_col})"
)
is_join = True
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sample_col (from YAML config) is interpolated directly into SQL identifiers (USING (...), m.<col>, etc.) without identifier validation/quoting. This can break if the configured column name is a reserved word or contains characters requiring quoting, and it also expands the SQL-injection surface if configs are not fully trusted. Consider validating sample_id.field against the same SQL-identifier regex as db_name, and/or consistently quoting identifiers when constructing SQL.

Copilot uses AI. Check for mistakes.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a bad idea, need to think about it more

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments