A Python CLI tool for importing h5ad single-cell files into cBioPortal's ClickHouse database. This enables queries across bulk and single cell sequencing data.
- cBioPortal Integration: Direct integration with cBioPortal ClickHouse database schema
- Sample/Patient Mapping: Flexible sample/patient mapping strategies with automatic fallbacks
- SPARSE Columns: Efficient storage using ClickHouse SPARSE columns for expression matrices
- Cell Type Harmonization: Map cell types to standard ontologies (Cell Ontology, UBERON)
- Cross-Analysis Queries: Compare bulk RNA-seq vs single-cell expression data
- Comprehensive Validation: Validate mappings, gene symbols, and data quality
- Flexible Configuration: YAML-based configuration with sensible defaults
- Production Ready: Comprehensive error handling, logging, and testing
# Clone the repository
git clone https://github.com/your-org/h5ad2cbioportaldb.git
cd h5ad2cbioportaldb
# Install with uv
uv pip install -e .
# Install development dependencies
uv pip install -e ".[dev]"Copy the example configuration:
cp config.yaml.example config.yamlEdit config.yaml with your cBioPortal ClickHouse connection details:
cbioportal:
clickhouse:
host: your-clickhouse-host
port: 9000
database: your-database
username: your-username
password: your-passwordFirst, generate mapping templates to understand your data:
h5ad2cbioportaldb generate-mapping-template \
--file your_data.h5ad \
--sample-obs-column sample_id \
--patient-obs-column donor_id \
--study-id skcm_tcga \
--output-dir templates/This creates:
sample_mapping_template.csv- Map h5ad samples to cBioPortal samplespatient_mapping_template.csv- Map h5ad patients to cBioPortal patientsskcm_tcga_existing_samples.csv- Reference of existing cBioPortal samplesskcm_tcga_existing_patients.csv- Reference of existing cBioPortal patientsdataset_config.yaml- Complete configuration file with metadata and ready-to-use commands
Edit the template files to map your data:
sample_mapping.csv:
h5ad_sample_id,cbioportal_sample_id
MELANOMA_01,skcm_tcga_TCGA-BF-A1PU-01
MELANOMA_02,skcm_tcga_TCGA-BF-A1PV-01
MELANOMA_03, # Leave empty if no mapping exists
patient_mapping.csv:
h5ad_patient_id,cbioportal_patient_id
DONOR_01,skcm_tcga_TCGA-BF-A1PU
DONOR_02,skcm_tcga_TCGA-BF-A1PV
You can now use the generated config file for validation:
h5ad2cbioportaldb validate-mappings \
--config templates/dataset_config.yamlOr use individual mapping files:
h5ad2cbioportaldb validate-mappings \
--study-id skcm_tcga \
--sample-mapping sample_mapping.csv \
--patient-mapping patient_mapping.csvUsing the config file (recommended):
h5ad2cbioportaldb import prepare \
--config templates/dataset_config.yaml \
--output-dir parquets/Or with individual arguments:
h5ad2cbioportaldb import prepare \
--file your_data.h5ad \
--dataset-id sc_skcm_001 \
--study-id skcm_tcga \
--cell-type-column leiden \
--sample-obs-column sample_id \
--sample-mapping sample_mapping.csv \
--patient-obs-column donor_id \
--patient-mapping patient_mapping.csv \
--description "Single-cell RNA-seq from SKCM patients" \
--output-dir parquets/This generates compressed parquet files in the parquets/ directory.
h5ad2cbioportaldb import clickhouse \
--parquet-dir parquets/ \
--dataset-id sc_skcm_001 \
--study-id skcm_tcgaThis loads the generated parquet files into ClickHouse efficiently.
The tool uses the following mapping strategies to handle various scenarios:
- h5ad sample → existing cBioPortal sample
- Best case: Direct integration with existing bulk data
- h5ad sample → missing, but patient exists
- Action: Creates synthetic sample ID (e.g.,
PATIENT_001-SC) - Benefit: Enables patient-level analysis
- Neither sample nor patient found
- Action: Stores cells without cBioPortal links
- Use case: Exploratory analysis of new cohorts
mapping:
strategy: "flexible" # "strict", "patient_only", "flexible"
create_synthetic_samples: true
synthetic_sample_suffix: "SC"
allow_unmapped_cells: trueThe tool creates these tables in your cBioPortal database:
-- Dataset metadata
scRNA_datasets (dataset_id, name, cancer_study_identifier, ...)
-- Cell-level data with flexible mapping
scRNA_cells (dataset_id, cell_id, sample_unique_id, patient_unique_id, ...)
-- Gene mapping to cBioPortal genes
scRNA_dataset_genes (dataset_id, gene_idx, hugo_gene_symbol, entrez_gene_id)
-- Expression data using SPARSE columns
scRNA_expression_matrix (dataset_id, cell_id, gene_idx, matrix_type, count SPARSE)
-- Embeddings (UMAP, t-SNE, PCA)
scRNA_cell_embeddings (dataset_id, cell_id, embedding_type, dimension_idx, value)
-- Cell type harmonization
scRNA_cell_type_ontology (cell_type_id, cell_type_name, ontology, ...)Harmonize cell types to Cell Ontology:
# Auto-harmonization using built-in mappings
h5ad2cbioportaldb harmonize \
--dataset sc_skcm_001 \
--ontology CL
# Custom harmonization with mapping file
h5ad2cbioportaldb harmonize \
--dataset sc_skcm_001 \
--ontology CL \
--mapping-file custom_cell_types.csvCompare bulk vs single-cell expression:
h5ad2cbioportaldb query compare-expression \
--gene TP53 \
--study skcm_tcga \
--sc-dataset sc_skcm_001 \
--output tp53_comparison.csvGet cell type summary:
h5ad2cbioportaldb query cell-type-summary \
--sc-dataset sc_skcm_001Export filtered data back to h5ad:
h5ad2cbioportaldb export \
--dataset sc_skcm_001 \
--output t_cells_subset.h5ad \
--cell-types "T cell,CD4+ T cell,CD8+ T cell" \
--genes "TP53,BRCA1,EGFR"# List datasets in study
h5ad2cbioportaldb list datasets --study skcm_tcga
# List all studies
h5ad2cbioportaldb list studies
# List cell types in study
h5ad2cbioportaldb list cell-types --study skcm_tcgacbioportal:
clickhouse:
host: localhost
port: 9000
database: my_database
username: default
password: ""
secure: false
timeout: 30import:
table_prefix: "scRNA_"
auto_map_genes: true
validate_mappings: true
batch_size: 10000
max_memory_usage: "4GB"mapping:
strategy: "flexible" # "strict", "patient_only", "flexible"
create_synthetic_samples: true
synthetic_sample_suffix: "SC"
allow_unmapped_cells: true
require_patient_mapping: falsevalidation:
check_study_exists: true
warn_unmapped_genes: true
warn_missing_mappings: true
min_cells_per_sample: 10
max_genes_per_dataset: 50000# 1. Generate templates
h5ad2cbioportaldb generate-mapping-template \
--file new_study.h5ad \
--sample-obs-column sample \
--patient-obs-column patient \
--study-id new_study \
--output-dir mappings/
# 2. Complete mappings (manual step)
# Edit mappings/sample_mapping_template.csv
# Edit mappings/patient_mapping_template.csv
# 3. Validate
h5ad2cbioportaldb validate-mappings \
--config mappings/dataset_config.yaml
# 4. Import (two steps)
# Prepare parquet files
h5ad2cbioportaldb import prepare \
--config mappings/dataset_config.yaml \
--output-dir parquets/
# Load to ClickHouse
h5ad2cbioportaldb import clickhouse \
--parquet-dir parquets/ \
--dataset-id sc_new_001 \
--study-id new_study# Import into existing study with patient-level mapping (two steps)
# Prepare parquet files
h5ad2cbioportaldb import prepare \
--file additional_samples.h5ad \
--dataset-id sc_skcm_002 \
--study-id skcm_tcga \
--cell-type-column leiden \
--patient-obs-column patient_id \
--patient-mapping patient_mapping.csv \
--description "Additional single-cell samples" \
--output-dir parquets/
# Load to ClickHouse
h5ad2cbioportaldb import clickhouse \
--parquet-dir parquets/ \
--dataset-id sc_skcm_002 \
--study-id skcm_tcga
# Harmonize cell types
h5ad2cbioportaldb harmonize \
--dataset sc_skcm_002 \
--ontology CL
# Compare with bulk data
h5ad2cbioportaldb query compare-expression \
--gene TP53 \
--study skcm_tcga \
--sc-dataset sc_skcm_002 \
--output analysis/tp53_bulk_vs_sc.csvgit clone https://github.com/your-org/h5ad2cbioportaldb.git
cd h5ad2cbioportaldb
uv pip install -e ".[dev]"# Unit tests
pytest tests/unit/
# Integration tests (requires ClickHouse)
pytest tests/integration/ -m integration
# All tests
pytest
# With coverage
pytest --cov=h5ad2cbioportaldb --cov-report=html# Format code
black src/ tests/
# Lint code
ruff check src/ tests/
# Type checking (if using mypy)
mypy src/cd tests/fixtures/
python create_test_data.pyFor datasets with >100k cells or >20k genes:
import:
batch_size: 50000 # Increase batch size
max_memory_usage: "8GB" # Increase memory limit
expression:
min_expression_threshold: 0.1 # Filter low expression
compression: "zstd" # Use compression- Expression matrices: Use SPARSE columns automatically
- Batch processing: Configurable batch sizes
- Memory monitoring: Built-in memory usage tracking
- Indexed columns: All key columns are indexed
- Partitioning: Consider partitioning by study_id for large deployments
- Materialized views: Create for common query patterns
-
Connection Failed
Solution: Check ClickHouse host, port, and credentials in config.yaml -
Study Not Found
Solution: Verify study exists in cBioPortal: h5ad2cbioportaldb list studies -
Gene Mapping Issues
Solution: Use --warn-unmapped-genes to see which genes weren't found -
Memory Issues
Solution: Reduce batch_size or increase max_memory_usage in config
Enable debug logging:
h5ad2cbioportaldb --verbose import [options...]Or in config.yaml:
logging:
level: "DEBUG"
file: "import.log"Always run validation before import:
h5ad2cbioportaldb validate-mappings \
--study-id your_study \
--sample-mapping your_samples.csv \
--patient-mapping your_patients.csv- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
MIT License - see LICENSE file for details.