Skip to content

Python CLI tool for importing h5ad single-cell files into cBioPortal's ClickHouse database

Notifications You must be signed in to change notification settings

cBioPortal/h5ad2cbioportaldb

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

h5ad2cbioportaldb

⚠️ Work in Progress

A Python CLI tool for importing h5ad single-cell files into cBioPortal's ClickHouse database. This enables queries across bulk and single cell sequencing data.

Features

  • cBioPortal Integration: Direct integration with cBioPortal ClickHouse database schema
  • Sample/Patient Mapping: Flexible sample/patient mapping strategies with automatic fallbacks
  • SPARSE Columns: Efficient storage using ClickHouse SPARSE columns for expression matrices
  • Cell Type Harmonization: Map cell types to standard ontologies (Cell Ontology, UBERON)
  • Cross-Analysis Queries: Compare bulk RNA-seq vs single-cell expression data
  • Comprehensive Validation: Validate mappings, gene symbols, and data quality
  • Flexible Configuration: YAML-based configuration with sensible defaults
  • Production Ready: Comprehensive error handling, logging, and testing

Installation

Using uv (recommended)

# Clone the repository
git clone https://github.com/your-org/h5ad2cbioportaldb.git
cd h5ad2cbioportaldb

# Install with uv
uv pip install -e .

# Install development dependencies
uv pip install -e ".[dev]"

Quick Start

1. Configure Database Connection

Copy the example configuration:

cp config.yaml.example config.yaml

Edit config.yaml with your cBioPortal ClickHouse connection details:

cbioportal:
  clickhouse:
    host: your-clickhouse-host
    port: 9000
    database: your-database
    username: your-username
    password: your-password

2. Generate Mapping Templates

First, generate mapping templates to understand your data:

h5ad2cbioportaldb generate-mapping-template \
  --file your_data.h5ad \
  --sample-obs-column sample_id \
  --patient-obs-column donor_id \
  --study-id skcm_tcga \
  --output-dir templates/

This creates:

  • sample_mapping_template.csv - Map h5ad samples to cBioPortal samples
  • patient_mapping_template.csv - Map h5ad patients to cBioPortal patients
  • skcm_tcga_existing_samples.csv - Reference of existing cBioPortal samples
  • skcm_tcga_existing_patients.csv - Reference of existing cBioPortal patients
  • dataset_config.yaml - Complete configuration file with metadata and ready-to-use commands

3. Complete the Mappings

Edit the template files to map your data:

sample_mapping.csv:

h5ad_sample_id,cbioportal_sample_id
MELANOMA_01,skcm_tcga_TCGA-BF-A1PU-01
MELANOMA_02,skcm_tcga_TCGA-BF-A1PV-01
MELANOMA_03,  # Leave empty if no mapping exists

patient_mapping.csv:

h5ad_patient_id,cbioportal_patient_id
DONOR_01,skcm_tcga_TCGA-BF-A1PU
DONOR_02,skcm_tcga_TCGA-BF-A1PV

4. Validate Mappings

You can now use the generated config file for validation:

h5ad2cbioportaldb validate-mappings \
  --config templates/dataset_config.yaml

Or use individual mapping files:

h5ad2cbioportaldb validate-mappings \
  --study-id skcm_tcga \
  --sample-mapping sample_mapping.csv \
  --patient-mapping patient_mapping.csv

5. Import Dataset (Two-Step Process)

Step 5a: Prepare Parquet Files

Using the config file (recommended):

h5ad2cbioportaldb import prepare \
  --config templates/dataset_config.yaml \
  --output-dir parquets/

Or with individual arguments:

h5ad2cbioportaldb import prepare \
  --file your_data.h5ad \
  --dataset-id sc_skcm_001 \
  --study-id skcm_tcga \
  --cell-type-column leiden \
  --sample-obs-column sample_id \
  --sample-mapping sample_mapping.csv \
  --patient-obs-column donor_id \
  --patient-mapping patient_mapping.csv \
  --description "Single-cell RNA-seq from SKCM patients" \
  --output-dir parquets/

This generates compressed parquet files in the parquets/ directory.

Step 5b: Load to ClickHouse

h5ad2cbioportaldb import clickhouse \
  --parquet-dir parquets/ \
  --dataset-id sc_skcm_001 \
  --study-id skcm_tcga

This loads the generated parquet files into ClickHouse efficiently.

Mapping Strategies

The tool uses the following mapping strategies to handle various scenarios:

1. Direct Sample Match

  • h5ad sample → existing cBioPortal sample
  • Best case: Direct integration with existing bulk data

2. Patient-Only Match + Synthetic Samples

  • h5ad sample → missing, but patient exists
  • Action: Creates synthetic sample ID (e.g., PATIENT_001-SC)
  • Benefit: Enables patient-level analysis

3. No Mapping

  • Neither sample nor patient found
  • Action: Stores cells without cBioPortal links
  • Use case: Exploratory analysis of new cohorts

4. Configurable Behavior

mapping:
  strategy: "flexible"  # "strict", "patient_only", "flexible"
  create_synthetic_samples: true
  synthetic_sample_suffix: "SC"
  allow_unmapped_cells: true

Database Schema

The tool creates these tables in your cBioPortal database:

-- Dataset metadata
scRNA_datasets (dataset_id, name, cancer_study_identifier, ...)

-- Cell-level data with flexible mapping
scRNA_cells (dataset_id, cell_id, sample_unique_id, patient_unique_id, ...)

-- Gene mapping to cBioPortal genes
scRNA_dataset_genes (dataset_id, gene_idx, hugo_gene_symbol, entrez_gene_id)

-- Expression data using SPARSE columns
scRNA_expression_matrix (dataset_id, cell_id, gene_idx, matrix_type, count SPARSE)

-- Embeddings (UMAP, t-SNE, PCA)
scRNA_cell_embeddings (dataset_id, cell_id, embedding_type, dimension_idx, value)

-- Cell type harmonization
scRNA_cell_type_ontology (cell_type_id, cell_type_name, ontology, ...)

Advanced Usage

Cell Type Harmonization

Harmonize cell types to Cell Ontology:

# Auto-harmonization using built-in mappings
h5ad2cbioportaldb harmonize \
  --dataset sc_skcm_001 \
  --ontology CL

# Custom harmonization with mapping file
h5ad2cbioportaldb harmonize \
  --dataset sc_skcm_001 \
  --ontology CL \
  --mapping-file custom_cell_types.csv

Cross-Analysis Queries

Compare bulk vs single-cell expression:

h5ad2cbioportaldb query compare-expression \
  --gene TP53 \
  --study skcm_tcga \
  --sc-dataset sc_skcm_001 \
  --output tp53_comparison.csv

Get cell type summary:

h5ad2cbioportaldb query cell-type-summary \
  --sc-dataset sc_skcm_001

Export Subsets

Export filtered data back to h5ad:

h5ad2cbioportaldb export \
  --dataset sc_skcm_001 \
  --output t_cells_subset.h5ad \
  --cell-types "T cell,CD4+ T cell,CD8+ T cell" \
  --genes "TP53,BRCA1,EGFR"

List Data

# List datasets in study
h5ad2cbioportaldb list datasets --study skcm_tcga

# List all studies
h5ad2cbioportaldb list studies

# List cell types in study
h5ad2cbioportaldb list cell-types --study skcm_tcga

Configuration Reference

Database Configuration

cbioportal:
  clickhouse:
    host: localhost
    port: 9000
    database: my_database
    username: default
    password: ""
    secure: false
    timeout: 30

Import Settings

import:
  table_prefix: "scRNA_"
  auto_map_genes: true
  validate_mappings: true
  batch_size: 10000
  max_memory_usage: "4GB"

Mapping Configuration

mapping:
  strategy: "flexible"  # "strict", "patient_only", "flexible"
  create_synthetic_samples: true
  synthetic_sample_suffix: "SC"
  allow_unmapped_cells: true
  require_patient_mapping: false

Validation Settings

validation:
  check_study_exists: true
  warn_unmapped_genes: true
  warn_missing_mappings: true
  min_cells_per_sample: 10
  max_genes_per_dataset: 50000

Example Workflows

Workflow 1: New Study Integration

# 1. Generate templates
h5ad2cbioportaldb generate-mapping-template \
  --file new_study.h5ad \
  --sample-obs-column sample \
  --patient-obs-column patient \
  --study-id new_study \
  --output-dir mappings/

# 2. Complete mappings (manual step)
# Edit mappings/sample_mapping_template.csv
# Edit mappings/patient_mapping_template.csv

# 3. Validate
h5ad2cbioportaldb validate-mappings \
  --config mappings/dataset_config.yaml

# 4. Import (two steps)
# Prepare parquet files
h5ad2cbioportaldb import prepare \
  --config mappings/dataset_config.yaml \
  --output-dir parquets/

# Load to ClickHouse
h5ad2cbioportaldb import clickhouse \
  --parquet-dir parquets/ \
  --dataset-id sc_new_001 \
  --study-id new_study

Workflow 2: Existing Study Enhancement

# Import into existing study with patient-level mapping (two steps)
# Prepare parquet files
h5ad2cbioportaldb import prepare \
  --file additional_samples.h5ad \
  --dataset-id sc_skcm_002 \
  --study-id skcm_tcga \
  --cell-type-column leiden \
  --patient-obs-column patient_id \
  --patient-mapping patient_mapping.csv \
  --description "Additional single-cell samples" \
  --output-dir parquets/

# Load to ClickHouse
h5ad2cbioportaldb import clickhouse \
  --parquet-dir parquets/ \
  --dataset-id sc_skcm_002 \
  --study-id skcm_tcga

# Harmonize cell types
h5ad2cbioportaldb harmonize \
  --dataset sc_skcm_002 \
  --ontology CL

# Compare with bulk data
h5ad2cbioportaldb query compare-expression \
  --gene TP53 \
  --study skcm_tcga \
  --sc-dataset sc_skcm_002 \
  --output analysis/tp53_bulk_vs_sc.csv

Development

Setup Development Environment

git clone https://github.com/your-org/h5ad2cbioportaldb.git
cd h5ad2cbioportaldb
uv pip install -e ".[dev]"

Run Tests

# Unit tests
pytest tests/unit/

# Integration tests (requires ClickHouse)
pytest tests/integration/ -m integration

# All tests
pytest

# With coverage
pytest --cov=h5ad2cbioportaldb --cov-report=html

Code Quality

# Format code
black src/ tests/

# Lint code
ruff check src/ tests/

# Type checking (if using mypy)
mypy src/

Create Test Data

cd tests/fixtures/
python create_test_data.py

Performance Considerations

Large Datasets

For datasets with >100k cells or >20k genes:

import:
  batch_size: 50000  # Increase batch size
  max_memory_usage: "8GB"  # Increase memory limit

expression:
  min_expression_threshold: 0.1  # Filter low expression
  compression: "zstd"  # Use compression

Memory Usage

  • Expression matrices: Use SPARSE columns automatically
  • Batch processing: Configurable batch sizes
  • Memory monitoring: Built-in memory usage tracking

Query Performance

  • Indexed columns: All key columns are indexed
  • Partitioning: Consider partitioning by study_id for large deployments
  • Materialized views: Create for common query patterns

Troubleshooting

Common Issues

  1. Connection Failed

    Solution: Check ClickHouse host, port, and credentials in config.yaml
    
  2. Study Not Found

    Solution: Verify study exists in cBioPortal: h5ad2cbioportaldb list studies
    
  3. Gene Mapping Issues

    Solution: Use --warn-unmapped-genes to see which genes weren't found
    
  4. Memory Issues

    Solution: Reduce batch_size or increase max_memory_usage in config
    

Logging

Enable debug logging:

h5ad2cbioportaldb --verbose import [options...]

Or in config.yaml:

logging:
  level: "DEBUG"
  file: "import.log"

Validation Errors

Always run validation before import:

h5ad2cbioportaldb validate-mappings \
  --study-id your_study \
  --sample-mapping your_samples.csv \
  --patient-mapping your_patients.csv

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass
  5. Submit a pull request

License

MIT License - see LICENSE file for details.

About

Python CLI tool for importing h5ad single-cell files into cBioPortal's ClickHouse database

Resources

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

Packages

No packages published

Contributors 2

  •  
  •  

Languages