DO NOT MERGE by cmatKhan · Pull Request #79 · BrentLab/tfbpapi

cmatKhan · 2026-02-18T19:54:49Z

this is here to help get Kaiwen going again on the DTO analysis

Copilot

Pull request overview

This PR extends VirtualDB/DataCard support for (1) datasets whose sample identifier column isn’t literally sample_id, and (2) datasets whose metadata is stored in a separate HuggingFace config via applies_to, by deriving a dataset schema from the DataCard YAML and adjusting view registration accordingly.

Changes:

Add DataCard helpers (get_metadata_fields, get_data_col_names, get_metadata_config_name, get_dataset_schema) and a DatasetSchema dataclass to summarize embedded vs external metadata.
Update VirtualDB view registration to JOIN external metadata parquet views and to JOIN/enrich using a configurable sample identifier column.
Add/extend unit tests and update documentation/tutorial artifacts to reflect the new configuration and behaviors.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`tfbpapi/virtual_db.py`	Implements external-metadata parquet views, configurable sample-id joins, and revised meta view construction.
`tfbpapi/datacard.py`	Adds `DatasetSchema` + DataCard APIs to derive schema/metadata fields from DataCard YAML (including `applies_to`).
`tfbpapi/models.py`	Adds `MetadataConfig.get_sample_id_field()` used by VirtualDB to resolve the sample identifier column.
`tfbpapi/tests/test_virtual_db.py`	Updates mocks and adds tests for dynamic sample-id and external metadata JOIN behavior.
`tfbpapi/tests/test_datacard.py`	Adds test coverage for new DataCard schema/metadata helper APIs.
`docs/virtual_db_configuration.md`	Updates configuration docs to clarify sample identifier column semantics.
`docs/tutorials/virtual_db_tutorial.ipynb`	Regenerates tutorial outputs reflecting new behavior/logging and column ordering.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tfbpapi/datacard.py

tfbpapi/virtual_db.py

Copilot · 2026-02-18T19:59:26Z

tfbpapi/virtual_db.py

+        # Add derived property expressions from the VirtualDB config
+        prop_result = self._resolve_property_columns(repo_id, config_name)
+        if prop_result is not None:
+            derived_exprs, prop_raw_cols = prop_result
+            # Ensure source columns needed by expressions are selected
+            for col in prop_raw_cols:
+                add_col(col)
+            # Qualify source column references inside CASE WHEN expressions
+            if is_join:
+                qualified_exprs = []
+                for expr in derived_exprs:
+                    for raw_col in prop_raw_cols:
+                        q = qualify(raw_col)
+                        if q != raw_col:
+                            # Replace bare column name in CASE WHEN patterns
+                            expr = expr.replace(
+                                f"CASE {raw_col} ", f"CASE {q} "
+                            ).replace(f" {raw_col} = ", f" {q} = ")
+                    qualified_exprs.append(expr)
+                derived_exprs = qualified_exprs
+            select_parts.extend(derived_exprs)
+


In JOIN mode, derived property expressions are appended as raw SQL strings. Only certain CASE patterns get column qualification, but simple alias expressions like "<raw_col> AS <alias>" (and arbitrary expression mappings) remain unqualified, which can cause DuckDB binder errors if the referenced column exists in both joined relations (or silently select the wrong side). Consider qualifying column references for all derived expressions in JOIN context (e.g., generate qualified expressions during _resolve_property_columns, or apply a safer identifier-aware rewrite).

Copilot · 2026-02-18T19:59:26Z

tfbpapi/virtual_db.py

+            # Use the configured sample_id column as the join key.
+            # The DataCard feature intersection (schema.join_columns)
+            # is unreliable because a data config's feature list may
+            # document columns that are physically only in the metadata
+            # parquet (present conceptually after a join, not in the
+            # physical data parquet file).
+            from_clause = (
+                f"{parquet_view} d " f"JOIN {ext_meta_view} m " f"USING ({sample_col})"
            )
+            is_join = True


sample_col (from YAML config) is interpolated directly into SQL identifiers (USING (...), m.<col>, etc.) without identifier validation/quoting. This can break if the configured column name is a reserved word or contains characters requiring quoting, and it also expands the SQL-injection surface if configs are not fully trusted. Consider validating sample_id.field against the same SQL-identifier regex as db_name, and/or consistently quoting identifiers when constructing SQL.

not a bad idea, need to think about it more

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

this seems to work

8939ee0

cmatKhan requested review from CalvinFang-code and Copilot February 18, 2026 19:54

Copilot started reviewing on behalf of cmatKhan February 18, 2026 19:55 View session

Copilot AI reviewed Feb 18, 2026

View reviewed changes

Update tfbpapi/datacard.py

f00c15a

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DO NOT MERGE#79

DO NOT MERGE#79
cmatKhan wants to merge 2 commits intoBrentLab:devfrom
cmatKhan:dealing_with_nonembedded

cmatKhan commented Feb 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

cmatKhan Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

cmatKhan commented Feb 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

cmatKhan Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments