Conversation
There was a problem hiding this comment.
Pull request overview
This PR extends VirtualDB/DataCard support for (1) datasets whose sample identifier column isn’t literally sample_id, and (2) datasets whose metadata is stored in a separate HuggingFace config via applies_to, by deriving a dataset schema from the DataCard YAML and adjusting view registration accordingly.
Changes:
- Add
DataCardhelpers (get_metadata_fields,get_data_col_names,get_metadata_config_name,get_dataset_schema) and aDatasetSchemadataclass to summarize embedded vs external metadata. - Update
VirtualDBview registration to JOIN external metadata parquet views and to JOIN/enrich using a configurable sample identifier column. - Add/extend unit tests and update documentation/tutorial artifacts to reflect the new configuration and behaviors.
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
tfbpapi/virtual_db.py |
Implements external-metadata parquet views, configurable sample-id joins, and revised meta view construction. |
tfbpapi/datacard.py |
Adds DatasetSchema + DataCard APIs to derive schema/metadata fields from DataCard YAML (including applies_to). |
tfbpapi/models.py |
Adds MetadataConfig.get_sample_id_field() used by VirtualDB to resolve the sample identifier column. |
tfbpapi/tests/test_virtual_db.py |
Updates mocks and adds tests for dynamic sample-id and external metadata JOIN behavior. |
tfbpapi/tests/test_datacard.py |
Adds test coverage for new DataCard schema/metadata helper APIs. |
docs/virtual_db_configuration.md |
Updates configuration docs to clarify sample identifier column semantics. |
docs/tutorials/virtual_db_tutorial.ipynb |
Regenerates tutorial outputs reflecting new behavior/logging and column ordering. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Add derived property expressions from the VirtualDB config | ||
| prop_result = self._resolve_property_columns(repo_id, config_name) | ||
| if prop_result is not None: | ||
| derived_exprs, prop_raw_cols = prop_result | ||
| # Ensure source columns needed by expressions are selected | ||
| for col in prop_raw_cols: | ||
| add_col(col) | ||
| # Qualify source column references inside CASE WHEN expressions | ||
| if is_join: | ||
| qualified_exprs = [] | ||
| for expr in derived_exprs: | ||
| for raw_col in prop_raw_cols: | ||
| q = qualify(raw_col) | ||
| if q != raw_col: | ||
| # Replace bare column name in CASE WHEN patterns | ||
| expr = expr.replace( | ||
| f"CASE {raw_col} ", f"CASE {q} " | ||
| ).replace(f" {raw_col} = ", f" {q} = ") | ||
| qualified_exprs.append(expr) | ||
| derived_exprs = qualified_exprs | ||
| select_parts.extend(derived_exprs) | ||
|
|
There was a problem hiding this comment.
In JOIN mode, derived property expressions are appended as raw SQL strings. Only certain CASE patterns get column qualification, but simple alias expressions like "<raw_col> AS <alias>" (and arbitrary expression mappings) remain unqualified, which can cause DuckDB binder errors if the referenced column exists in both joined relations (or silently select the wrong side). Consider qualifying column references for all derived expressions in JOIN context (e.g., generate qualified expressions during _resolve_property_columns, or apply a safer identifier-aware rewrite).
| # Use the configured sample_id column as the join key. | ||
| # The DataCard feature intersection (schema.join_columns) | ||
| # is unreliable because a data config's feature list may | ||
| # document columns that are physically only in the metadata | ||
| # parquet (present conceptually after a join, not in the | ||
| # physical data parquet file). | ||
| from_clause = ( | ||
| f"{parquet_view} d " f"JOIN {ext_meta_view} m " f"USING ({sample_col})" | ||
| ) | ||
| is_join = True |
There was a problem hiding this comment.
sample_col (from YAML config) is interpolated directly into SQL identifiers (USING (...), m.<col>, etc.) without identifier validation/quoting. This can break if the configured column name is a reserved word or contains characters requiring quoting, and it also expands the SQL-injection surface if configs are not fully trusted. Consider validating sample_id.field against the same SQL-identifier regex as db_name, and/or consistently quoting identifiers when constructing SQL.
There was a problem hiding this comment.
not a bad idea, need to think about it more
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
this is here to help get Kaiwen going again on the DTO analysis