CWL workflow for quantifying sample-relatedness and detecting incorrectly paired sequencing datasets from different donors (sample-swap). This workflow extracts fingerprints from BAM files and compares them either pairwise (using a CSV file) or all-vs-all.
This workflow has been refactored to work with standard cwl-runner tools, designed for local or standard HPC execution environments.
- Fingerprint Extraction: Uses Picard ExtractFingerprint to generate VCF files from BAM files
- Flexible Comparison: Supports both pairwise comparison (via CSV file) and all-vs-all comparison
- CWL runner (e.g.,
cwltool,toil-cwl-runner) - Docker (all tools are containerized)
All Docker images are publicly available:
broadinstitute/picard:3.4.0- For Picard tools (ExtractFingerprint, CrosscheckFingerprints)python:3.11-slim- For Python-based utility scripts
main.cwl # Main workflow
├── generate_scatter_initial # Scans BAM directory for files
├── extract_fingerprint # Extracts fingerprint VCF from each BAM
├── files2directory # Collects VCF files into a directory
├── generate_scatter_pairs # Pairs files according to CSV (if provided)
└── crosscheck_fingerprints # Compares fingerprints
├── pairs mode (multiple runs) # One comparison per pair
└── allvall mode (single run) # Single all-vs-all comparison
- bam_directory (Directory): Directory containing BAM files to compare
- reference_genome (File): Reference genome FASTA file. The following secondary files must be present in the same directory as the FASTA:
.faiindex (e.g.,reference.fasta.fai) — generated withsamtools faidx.dictsequence dictionary (e.g.,reference.dict) — generated withpicard CreateSequenceDictionary
- haplotype_map (File): Haplotype map file for Picard fingerprinting
- regex_split (string, optional): Regex pattern to extract sample names from BAM filenames
- If not provided: Uses basename (everything before .bam extension)
- If provided: Splits filename using the pattern and takes the first part
- pairs_csv (File): CSV file with two columns (no header):
left_sample,right_sample- Each entry can be a bare sample name (e.g.,
sample1) or a filename with extension (e.g.,sample1.bam) - Must be under 64 KiB (CWL
loadContentslimitation) - If provided: performs pairwise comparisons only
- If omitted: performs all-vs-all comparison
- Each entry can be a bare sample name (e.g.,
- cpu (int): Number of CPU cores to allocate for crosscheck_fingerprints step
- Default: 16 cores
- ram (int): Amount of RAM in MiB to allocate for crosscheck_fingerprints step
- Default: 38000 MiB (38 GB)
- fingerprints: Extracted fingerprint VCF files for each input BAM
- crosscheck_metrics_pairs: Crosscheck metrics for each pair (only if pairs_csv provided)
- crosscheck_metrics_allvall: Single crosscheck metrics file for all-vs-all (only if pairs_csv not provided)
Create a CSV file pairs.csv with pairs to compare (bare sample names or filenames with extensions):
sample1,sample2
sample3,sample4Create an input YAML file inputs.yml:
bam_directory:
class: Directory
path: /path/to/bam/files
pairs_csv:
class: File
path: pairs.csv
reference_genome:
class: File
path: /path/to/reference.fasta
haplotype_map:
class: File
path: /path/to/haplotype_map.txt
regex_split: "[_\\.]"
cpu: 20 # Optional: Increase CPU for faster processing
ram: 50000 # Optional: Increase RAM for large datasetsRun the workflow:
cwltool main.cwl inputs.ymlCreate an input YAML file inputs_allvall.yml (without pairs_csv):
bam_directory:
class: Directory
path: /path/to/bam/files
reference_genome:
class: File
path: /path/to/reference.fasta
haplotype_map:
class: File
path: /path/to/haplotype_map.txt
regex_split: "[_\\.]"
# cpu: 16 # Optional: Customize CPU allocation
# ram: 38000 # Optional: Customize RAM allocationRun the workflow:
cwltool main.cwl inputs_allvall.ymlThe regex_split parameter is optional and controls how sample names are extracted from BAM filenames:
Default behavior (regex_split not provided):
- Uses the entire basename before
.bamas the sample name - Example:
Sample123.bam→ sample name isSample123
Custom regex (regex_split provided):
- Splits the filename using the provided regex pattern
- Uses the first part as the sample name
- Common patterns:
"[_\\.]"- Split on underscore or dot"_"- Split only on underscore"\\."- Split only on dot"-"- Split on hyphen
Examples:
- Filename:
Sample123_L001_R1.bam- No regex: sample name =
Sample123_L001_R1 - Regex
"[_\\.]": sample name =Sample123 - Regex
"_": sample name =Sample123
- No regex: sample name =
Runs Picard ExtractFingerprint on each BAM file to generate a VCF file.
Resources:
- CPU: 1 core (default)
- RAM: 4000 MiB (default)
- Time limit: 2.5 hours
Runs Picard CrosscheckFingerprints to compare fingerprints.
Resources:
- CPU: 16 cores (default, customizable via workflow input)
- RAM: 38000 MiB (default, customizable via workflow input)
- Time limit: 2.5 hours
Modes:
- Pairs mode: Compares specific pairs from CSV file (scattered execution)
- All-vs-all mode: Compares all samples against each other (single execution)
The workflow expects BAM files to be organized in a directory structure. The generate_scatter component will recursively find all .bam files.
Example directory structure:
bam_directory/
├── sample1.bam
├── sample2.bam
├── sample3.bam
└── subdirectory/
├── sample4.bam
└── sample5.bam