Sample Matching Workflow v2.0.0

CWL workflow for quantifying sample-relatedness and detecting incorrectly paired sequencing datasets from different donors (sample-swap). This workflow extracts fingerprints from BAM files and compares them either pairwise (using a CSV file) or all-vs-all.

Overview

This workflow has been refactored to work with standard cwl-runner tools, designed for local or standard HPC execution environments.

Features

Fingerprint Extraction: Uses Picard ExtractFingerprint to generate VCF files from BAM files
Flexible Comparison: Supports both pairwise comparison (via CSV file) and all-vs-all comparison

Requirements

CWL runner (e.g., cwltool, toil-cwl-runner)
Docker (all tools are containerized)

Docker Images Used

All Docker images are publicly available:

broadinstitute/picard:3.4.0 - For Picard tools (ExtractFingerprint, CrosscheckFingerprints)
python:3.11-slim - For Python-based utility scripts

Workflow Structure

main.cwl                              # Main workflow
├── generate_scatter_initial          # Scans BAM directory for files
├── extract_fingerprint               # Extracts fingerprint VCF from each BAM
├── files2directory                   # Collects VCF files into a directory
├── generate_scatter_pairs            # Pairs files according to CSV (if provided)
└── crosscheck_fingerprints          # Compares fingerprints
    ├── pairs mode (multiple runs)    # One comparison per pair  
    └── allvall mode (single run)     # Single all-vs-all comparison

Inputs

Required Inputs

bam_directory (Directory): Directory containing BAM files to compare
reference_genome (File): Reference genome FASTA file. The following secondary files must be present in the same directory as the FASTA:
- .fai index (e.g., reference.fasta.fai) — generated with samtools faidx
- .dict sequence dictionary (e.g., reference.dict) — generated with picard CreateSequenceDictionary
haplotype_map (File): Haplotype map file for Picard fingerprinting
regex_split (string, optional): Regex pattern to extract sample names from BAM filenames
- If not provided: Uses basename (everything before .bam extension)
- If provided: Splits filename using the pattern and takes the first part

Optional Inputs

pairs_csv (File): CSV file with two columns (no header): left_sample,right_sample
- Each entry can be a bare sample name (e.g., sample1) or a filename with extension (e.g., sample1.bam)
- Must be under 64 KiB (CWL loadContents limitation)
- If provided: performs pairwise comparisons only
- If omitted: performs all-vs-all comparison
cpu (int): Number of CPU cores to allocate for crosscheck_fingerprints step
- Default: 16 cores
ram (int): Amount of RAM in MiB to allocate for crosscheck_fingerprints step
- Default: 38000 MiB (38 GB)

Outputs

fingerprints: Extracted fingerprint VCF files for each input BAM
crosscheck_metrics_pairs: Crosscheck metrics for each pair (only if pairs_csv provided)
crosscheck_metrics_allvall: Single crosscheck metrics file for all-vs-all (only if pairs_csv not provided)

Usage Examples

Example 1: Pairwise Comparison

Create a CSV file pairs.csv with pairs to compare (bare sample names or filenames with extensions):

sample1,sample2
sample3,sample4

Create an input YAML file inputs.yml:

bam_directory:
  class: Directory
  path: /path/to/bam/files

pairs_csv:
  class: File
  path: pairs.csv

reference_genome:
  class: File
  path: /path/to/reference.fasta

haplotype_map:
  class: File
  path: /path/to/haplotype_map.txt

regex_split: "[_\\.]"

cpu: 20       # Optional: Increase CPU for faster processing
ram: 50000    # Optional: Increase RAM for large datasets

Run the workflow:

cwltool main.cwl inputs.yml

Example 2: All-vs-All Comparison

Create an input YAML file inputs_allvall.yml (without pairs_csv):

bam_directory:
  class: Directory
  path: /path/to/bam/files

reference_genome:
  class: File
  path: /path/to/reference.fasta

haplotype_map:
  class: File
  path: /path/to/haplotype_map.txt

regex_split: "[_\\.]"

# cpu: 16      # Optional: Customize CPU allocation
# ram: 38000   # Optional: Customize RAM allocation

Run the workflow:

cwltool main.cwl inputs_allvall.yml

Sample Name Extraction

The regex_split parameter is optional and controls how sample names are extracted from BAM filenames:

Default behavior (regex_split not provided):

Uses the entire basename before .bam as the sample name
Example: Sample123.bam → sample name is Sample123

Custom regex (regex_split provided):

Splits the filename using the provided regex pattern
Uses the first part as the sample name
Common patterns:
- "[_\\.]" - Split on underscore or dot
- "_" - Split only on underscore
- "\\." - Split only on dot
- "-" - Split on hyphen

Examples:

Filename: Sample123_L001_R1.bam
- No regex: sample name = Sample123_L001_R1
- Regex "[_\\.]": sample name = Sample123
- Regex "_": sample name = Sample123

Component Details

extract_fingerprint

Runs Picard ExtractFingerprint on each BAM file to generate a VCF file.

Resources:

CPU: 1 core (default)
RAM: 4000 MiB (default)
Time limit: 2.5 hours

crosscheck_fingerprints

Runs Picard CrosscheckFingerprints to compare fingerprints.

Resources:

CPU: 16 cores (default, customizable via workflow input)
RAM: 38000 MiB (default, customizable via workflow input)
Time limit: 2.5 hours

Modes:

Pairs mode: Compares specific pairs from CSV file (scattered execution)
All-vs-all mode: Compares all samples against each other (single execution)

File Organization

The workflow expects BAM files to be organized in a directory structure. The generate_scatter component will recursively find all .bam files.

Example directory structure:

bam_directory/
├── sample1.bam
├── sample2.bam
├── sample3.bam
└── subdirectory/
    ├── sample4.bam
    └── sample5.bam

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
components		components
.gitignore		.gitignore
COMMUNICATION.md		COMMUNICATION.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
example_pairs.csv		example_pairs.csv
inputs_allvall_example.yml		inputs_allvall_example.yml
inputs_pairwise_example.yml		inputs_pairwise_example.yml
main.cwl		main.cwl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sample Matching Workflow v2.0.0

Overview

Features

Requirements

Docker Images Used

Workflow Structure

Inputs

Required Inputs

Optional Inputs

Outputs

Usage Examples

Example 1: Pairwise Comparison

Example 2: All-vs-All Comparison

Sample Name Extraction

Component Details

extract_fingerprint

crosscheck_fingerprints

File Organization

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Sample Matching Workflow v2.0.0

Overview

Features

Requirements

Docker Images Used

Workflow Structure

Inputs

Required Inputs

Optional Inputs

Outputs

Usage Examples

Example 1: Pairwise Comparison

Example 2: All-vs-All Comparison

Sample Name Extraction

Component Details

extract_fingerprint

crosscheck_fingerprints

File Organization

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages