Skip to content

Roche/sample-matching-workflow

Repository files navigation

Sample Matching Workflow v2.0.0

CWL workflow for quantifying sample-relatedness and detecting incorrectly paired sequencing datasets from different donors (sample-swap). This workflow extracts fingerprints from BAM files and compares them either pairwise (using a CSV file) or all-vs-all.

Overview

This workflow has been refactored to work with standard cwl-runner tools, designed for local or standard HPC execution environments.

Features

  • Fingerprint Extraction: Uses Picard ExtractFingerprint to generate VCF files from BAM files
  • Flexible Comparison: Supports both pairwise comparison (via CSV file) and all-vs-all comparison

Requirements

  • CWL runner (e.g., cwltool, toil-cwl-runner)
  • Docker (all tools are containerized)

Docker Images Used

All Docker images are publicly available:

  • broadinstitute/picard:3.4.0 - For Picard tools (ExtractFingerprint, CrosscheckFingerprints)
  • python:3.11-slim - For Python-based utility scripts

Workflow Structure

main.cwl                              # Main workflow
├── generate_scatter_initial          # Scans BAM directory for files
├── extract_fingerprint               # Extracts fingerprint VCF from each BAM
├── files2directory                   # Collects VCF files into a directory
├── generate_scatter_pairs            # Pairs files according to CSV (if provided)
└── crosscheck_fingerprints          # Compares fingerprints
    ├── pairs mode (multiple runs)    # One comparison per pair  
    └── allvall mode (single run)     # Single all-vs-all comparison

Inputs

Required Inputs

  • bam_directory (Directory): Directory containing BAM files to compare
  • reference_genome (File): Reference genome FASTA file. The following secondary files must be present in the same directory as the FASTA:
    • .fai index (e.g., reference.fasta.fai) — generated with samtools faidx
    • .dict sequence dictionary (e.g., reference.dict) — generated with picard CreateSequenceDictionary
  • haplotype_map (File): Haplotype map file for Picard fingerprinting
  • regex_split (string, optional): Regex pattern to extract sample names from BAM filenames
    • If not provided: Uses basename (everything before .bam extension)
    • If provided: Splits filename using the pattern and takes the first part

Optional Inputs

  • pairs_csv (File): CSV file with two columns (no header): left_sample,right_sample
    • Each entry can be a bare sample name (e.g., sample1) or a filename with extension (e.g., sample1.bam)
    • Must be under 64 KiB (CWL loadContents limitation)
    • If provided: performs pairwise comparisons only
    • If omitted: performs all-vs-all comparison
  • cpu (int): Number of CPU cores to allocate for crosscheck_fingerprints step
    • Default: 16 cores
  • ram (int): Amount of RAM in MiB to allocate for crosscheck_fingerprints step
    • Default: 38000 MiB (38 GB)

Outputs

  • fingerprints: Extracted fingerprint VCF files for each input BAM
  • crosscheck_metrics_pairs: Crosscheck metrics for each pair (only if pairs_csv provided)
  • crosscheck_metrics_allvall: Single crosscheck metrics file for all-vs-all (only if pairs_csv not provided)

Usage Examples

Example 1: Pairwise Comparison

Create a CSV file pairs.csv with pairs to compare (bare sample names or filenames with extensions):

sample1,sample2
sample3,sample4

Create an input YAML file inputs.yml:

bam_directory:
  class: Directory
  path: /path/to/bam/files

pairs_csv:
  class: File
  path: pairs.csv

reference_genome:
  class: File
  path: /path/to/reference.fasta

haplotype_map:
  class: File
  path: /path/to/haplotype_map.txt

regex_split: "[_\\.]"

cpu: 20       # Optional: Increase CPU for faster processing
ram: 50000    # Optional: Increase RAM for large datasets

Run the workflow:

cwltool main.cwl inputs.yml

Example 2: All-vs-All Comparison

Create an input YAML file inputs_allvall.yml (without pairs_csv):

bam_directory:
  class: Directory
  path: /path/to/bam/files

reference_genome:
  class: File
  path: /path/to/reference.fasta

haplotype_map:
  class: File
  path: /path/to/haplotype_map.txt

regex_split: "[_\\.]"

# cpu: 16      # Optional: Customize CPU allocation
# ram: 38000   # Optional: Customize RAM allocation

Run the workflow:

cwltool main.cwl inputs_allvall.yml

Sample Name Extraction

The regex_split parameter is optional and controls how sample names are extracted from BAM filenames:

Default behavior (regex_split not provided):

  • Uses the entire basename before .bam as the sample name
  • Example: Sample123.bam → sample name is Sample123

Custom regex (regex_split provided):

  • Splits the filename using the provided regex pattern
  • Uses the first part as the sample name
  • Common patterns:
    • "[_\\.]" - Split on underscore or dot
    • "_" - Split only on underscore
    • "\\." - Split only on dot
    • "-" - Split on hyphen

Examples:

  • Filename: Sample123_L001_R1.bam
    • No regex: sample name = Sample123_L001_R1
    • Regex "[_\\.]": sample name = Sample123
    • Regex "_": sample name = Sample123

Component Details

extract_fingerprint

Runs Picard ExtractFingerprint on each BAM file to generate a VCF file.

Resources:

  • CPU: 1 core (default)
  • RAM: 4000 MiB (default)
  • Time limit: 2.5 hours

crosscheck_fingerprints

Runs Picard CrosscheckFingerprints to compare fingerprints.

Resources:

  • CPU: 16 cores (default, customizable via workflow input)
  • RAM: 38000 MiB (default, customizable via workflow input)
  • Time limit: 2.5 hours

Modes:

  • Pairs mode: Compares specific pairs from CSV file (scattered execution)
  • All-vs-all mode: Compares all samples against each other (single execution)

File Organization

The workflow expects BAM files to be organized in a directory structure. The generate_scatter component will recursively find all .bam files.

Example directory structure:

bam_directory/
├── sample1.bam
├── sample2.bam
├── sample3.bam
└── subdirectory/
    ├── sample4.bam
    └── sample5.bam

About

a bioinformatics framework designed to verify the genetic identity of multi-omics samples by detecting inter-individual contamination and sample swaps

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors