h5ad2cbioportaldb

⚠️ Work in Progress

A Python CLI tool for importing h5ad single-cell files into cBioPortal's ClickHouse database. This enables queries across bulk and single cell sequencing data.

Features

cBioPortal Integration: Direct integration with cBioPortal ClickHouse database schema
Sample/Patient Mapping: Flexible sample/patient mapping strategies with automatic fallbacks
SPARSE Columns: Efficient storage using ClickHouse SPARSE columns for expression matrices
Cell Type Harmonization: Map cell types to standard ontologies (Cell Ontology, UBERON)
Cross-Analysis Queries: Compare bulk RNA-seq vs single-cell expression data
Comprehensive Validation: Validate mappings, gene symbols, and data quality
Flexible Configuration: YAML-based configuration with sensible defaults
Production Ready: Comprehensive error handling, logging, and testing

Installation

Using uv (recommended)

# Clone the repository
git clone https://github.com/your-org/h5ad2cbioportaldb.git
cd h5ad2cbioportaldb

# Install with uv
uv pip install -e .

# Install development dependencies
uv pip install -e ".[dev]"

Quick Start

1. Configure Database Connection

Copy the example configuration:

cp config.yaml.example config.yaml

Edit config.yaml with your cBioPortal ClickHouse connection details:

cbioportal:
  clickhouse:
    host: your-clickhouse-host
    port: 9000
    database: your-database
    username: your-username
    password: your-password

2. Generate Mapping Templates

First, generate mapping templates to understand your data:

h5ad2cbioportaldb generate-mapping-template \
  --file your_data.h5ad \
  --sample-obs-column sample_id \
  --patient-obs-column donor_id \
  --study-id skcm_tcga \
  --output-dir templates/

This creates:

sample_mapping_template.csv - Map h5ad samples to cBioPortal samples
patient_mapping_template.csv - Map h5ad patients to cBioPortal patients
skcm_tcga_existing_samples.csv - Reference of existing cBioPortal samples
skcm_tcga_existing_patients.csv - Reference of existing cBioPortal patients
dataset_config.yaml - Complete configuration file with metadata and ready-to-use commands

3. Complete the Mappings

Edit the template files to map your data:

sample_mapping.csv:

h5ad_sample_id,cbioportal_sample_id
MELANOMA_01,skcm_tcga_TCGA-BF-A1PU-01
MELANOMA_02,skcm_tcga_TCGA-BF-A1PV-01
MELANOMA_03,  # Leave empty if no mapping exists

patient_mapping.csv:

h5ad_patient_id,cbioportal_patient_id
DONOR_01,skcm_tcga_TCGA-BF-A1PU
DONOR_02,skcm_tcga_TCGA-BF-A1PV

4. Validate Mappings

You can now use the generated config file for validation:

h5ad2cbioportaldb validate-mappings \
  --config templates/dataset_config.yaml

Or use individual mapping files:

h5ad2cbioportaldb validate-mappings \
  --study-id skcm_tcga \
  --sample-mapping sample_mapping.csv \
  --patient-mapping patient_mapping.csv

5. Import Dataset (Two-Step Process)

Step 5a: Prepare Parquet Files

Using the config file (recommended):

h5ad2cbioportaldb import prepare \
  --config templates/dataset_config.yaml \
  --output-dir parquets/

Or with individual arguments:

h5ad2cbioportaldb import prepare \
  --file your_data.h5ad \
  --dataset-id sc_skcm_001 \
  --study-id skcm_tcga \
  --cell-type-column leiden \
  --sample-obs-column sample_id \
  --sample-mapping sample_mapping.csv \
  --patient-obs-column donor_id \
  --patient-mapping patient_mapping.csv \
  --description "Single-cell RNA-seq from SKCM patients" \
  --output-dir parquets/

This generates compressed parquet files in the parquets/ directory.

Step 5b: Load to ClickHouse

h5ad2cbioportaldb import clickhouse \
  --parquet-dir parquets/ \
  --dataset-id sc_skcm_001 \
  --study-id skcm_tcga

This loads the generated parquet files into ClickHouse efficiently.

Mapping Strategies

The tool uses the following mapping strategies to handle various scenarios:

1. Direct Sample Match

h5ad sample → existing cBioPortal sample
Best case: Direct integration with existing bulk data

2. Patient-Only Match + Synthetic Samples

h5ad sample → missing, but patient exists
Action: Creates synthetic sample ID (e.g., PATIENT_001-SC)
Benefit: Enables patient-level analysis

3. No Mapping

Neither sample nor patient found
Action: Stores cells without cBioPortal links
Use case: Exploratory analysis of new cohorts

4. Configurable Behavior

mapping:
  strategy: "flexible"  # "strict", "patient_only", "flexible"
  create_synthetic_samples: true
  synthetic_sample_suffix: "SC"
  allow_unmapped_cells: true

Database Schema

The tool creates these tables in your cBioPortal database:

-- Dataset metadata
scRNA_datasets (dataset_id, name, cancer_study_identifier, ...)

-- Cell-level data with flexible mapping
scRNA_cells (dataset_id, cell_id, sample_unique_id, patient_unique_id, ...)

-- Gene mapping to cBioPortal genes
scRNA_dataset_genes (dataset_id, gene_idx, hugo_gene_symbol, entrez_gene_id)

-- Expression data using SPARSE columns
scRNA_expression_matrix (dataset_id, cell_id, gene_idx, matrix_type, count SPARSE)

-- Embeddings (UMAP, t-SNE, PCA)
scRNA_cell_embeddings (dataset_id, cell_id, embedding_type, dimension_idx, value)

-- Cell type harmonization
scRNA_cell_type_ontology (cell_type_id, cell_type_name, ontology, ...)

Advanced Usage

Cell Type Harmonization

Harmonize cell types to Cell Ontology:

# Auto-harmonization using built-in mappings
h5ad2cbioportaldb harmonize \
  --dataset sc_skcm_001 \
  --ontology CL

# Custom harmonization with mapping file
h5ad2cbioportaldb harmonize \
  --dataset sc_skcm_001 \
  --ontology CL \
  --mapping-file custom_cell_types.csv

Cross-Analysis Queries

Compare bulk vs single-cell expression:

h5ad2cbioportaldb query compare-expression \
  --gene TP53 \
  --study skcm_tcga \
  --sc-dataset sc_skcm_001 \
  --output tp53_comparison.csv

Get cell type summary:

h5ad2cbioportaldb query cell-type-summary \
  --sc-dataset sc_skcm_001

Export Subsets

Export filtered data back to h5ad:

h5ad2cbioportaldb export \
  --dataset sc_skcm_001 \
  --output t_cells_subset.h5ad \
  --cell-types "T cell,CD4+ T cell,CD8+ T cell" \
  --genes "TP53,BRCA1,EGFR"

List Data

# List datasets in study
h5ad2cbioportaldb list datasets --study skcm_tcga

# List all studies
h5ad2cbioportaldb list studies

# List cell types in study
h5ad2cbioportaldb list cell-types --study skcm_tcga

Configuration Reference

Database Configuration

cbioportal:
  clickhouse:
    host: localhost
    port: 9000
    database: my_database
    username: default
    password: ""
    secure: false
    timeout: 30

Import Settings

import:
  table_prefix: "scRNA_"
  auto_map_genes: true
  validate_mappings: true
  batch_size: 10000
  max_memory_usage: "4GB"

Mapping Configuration

mapping:
  strategy: "flexible"  # "strict", "patient_only", "flexible"
  create_synthetic_samples: true
  synthetic_sample_suffix: "SC"
  allow_unmapped_cells: true
  require_patient_mapping: false

Validation Settings

validation:
  check_study_exists: true
  warn_unmapped_genes: true
  warn_missing_mappings: true
  min_cells_per_sample: 10
  max_genes_per_dataset: 50000

Example Workflows

Workflow 1: New Study Integration

# 1. Generate templates
h5ad2cbioportaldb generate-mapping-template \
  --file new_study.h5ad \
  --sample-obs-column sample \
  --patient-obs-column patient \
  --study-id new_study \
  --output-dir mappings/

# 2. Complete mappings (manual step)
# Edit mappings/sample_mapping_template.csv
# Edit mappings/patient_mapping_template.csv

# 3. Validate
h5ad2cbioportaldb validate-mappings \
  --config mappings/dataset_config.yaml

# 4. Import (two steps)
# Prepare parquet files
h5ad2cbioportaldb import prepare \
  --config mappings/dataset_config.yaml \
  --output-dir parquets/

# Load to ClickHouse
h5ad2cbioportaldb import clickhouse \
  --parquet-dir parquets/ \
  --dataset-id sc_new_001 \
  --study-id new_study

Workflow 2: Existing Study Enhancement

# Import into existing study with patient-level mapping (two steps)
# Prepare parquet files
h5ad2cbioportaldb import prepare \
  --file additional_samples.h5ad \
  --dataset-id sc_skcm_002 \
  --study-id skcm_tcga \
  --cell-type-column leiden \
  --patient-obs-column patient_id \
  --patient-mapping patient_mapping.csv \
  --description "Additional single-cell samples" \
  --output-dir parquets/

# Load to ClickHouse
h5ad2cbioportaldb import clickhouse \
  --parquet-dir parquets/ \
  --dataset-id sc_skcm_002 \
  --study-id skcm_tcga

# Harmonize cell types
h5ad2cbioportaldb harmonize \
  --dataset sc_skcm_002 \
  --ontology CL

# Compare with bulk data
h5ad2cbioportaldb query compare-expression \
  --gene TP53 \
  --study skcm_tcga \
  --sc-dataset sc_skcm_002 \
  --output analysis/tp53_bulk_vs_sc.csv

Development

Setup Development Environment

git clone https://github.com/your-org/h5ad2cbioportaldb.git
cd h5ad2cbioportaldb
uv pip install -e ".[dev]"

Run Tests

# Unit tests
pytest tests/unit/

# Integration tests (requires ClickHouse)
pytest tests/integration/ -m integration

# All tests
pytest

# With coverage
pytest --cov=h5ad2cbioportaldb --cov-report=html

Code Quality

# Format code
black src/ tests/

# Lint code
ruff check src/ tests/

# Type checking (if using mypy)
mypy src/

Create Test Data

cd tests/fixtures/
python create_test_data.py

Performance Considerations

Large Datasets

For datasets with >100k cells or >20k genes:

import:
  batch_size: 50000  # Increase batch size
  max_memory_usage: "8GB"  # Increase memory limit

expression:
  min_expression_threshold: 0.1  # Filter low expression
  compression: "zstd"  # Use compression

Memory Usage

Expression matrices: Use SPARSE columns automatically
Batch processing: Configurable batch sizes
Memory monitoring: Built-in memory usage tracking

Query Performance

Indexed columns: All key columns are indexed
Partitioning: Consider partitioning by study_id for large deployments
Materialized views: Create for common query patterns

Troubleshooting

Common Issues

Connection Failed

Solution: Check ClickHouse host, port, and credentials in config.yaml

Study Not Found

Solution: Verify study exists in cBioPortal: h5ad2cbioportaldb list studies

Gene Mapping Issues

Solution: Use --warn-unmapped-genes to see which genes weren't found

Memory Issues

Solution: Reduce batch_size or increase max_memory_usage in config

Logging

Enable debug logging:

h5ad2cbioportaldb --verbose import [options...]

Or in config.yaml:

logging:
  level: "DEBUG"
  file: "import.log"

Validation Errors

Always run validation before import:

h5ad2cbioportaldb validate-mappings \
  --study-id your_study \
  --sample-mapping your_samples.csv \
  --patient-mapping your_patients.csv

Contributing

Fork the repository
Create a feature branch
Add tests for new functionality
Ensure all tests pass
Submit a pull request

License

MIT License - see LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
src/h5ad2cbioportaldb		src/h5ad2cbioportaldb
tests		tests
.gitignore		.gitignore
README.md		README.md
config.yaml.example		config.yaml.example
pyproject.toml		pyproject.toml

Uh oh!

cBioPortal/h5ad2cbioportaldb

Folders and files

Latest commit

History

Repository files navigation

h5ad2cbioportaldb

Features

Installation

Using uv (recommended)

Quick Start

1. Configure Database Connection

2. Generate Mapping Templates

3. Complete the Mappings

4. Validate Mappings

5. Import Dataset (Two-Step Process)

Step 5a: Prepare Parquet Files

Step 5b: Load to ClickHouse

Mapping Strategies

1. Direct Sample Match

2. Patient-Only Match + Synthetic Samples

3. No Mapping

4. Configurable Behavior

Database Schema

Advanced Usage

Cell Type Harmonization

Cross-Analysis Queries

Export Subsets

List Data

Configuration Reference

Database Configuration

Import Settings

Mapping Configuration

Validation Settings

Example Workflows

Workflow 1: New Study Integration

Workflow 2: Existing Study Enhancement

Development

Setup Development Environment

Run Tests

Code Quality

Create Test Data

Performance Considerations

Large Datasets

Memory Usage

Query Performance

Troubleshooting

Common Issues

Logging

Validation Errors

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Contributors 2

Uh oh!

Languages

Packages