This module is designed to function as both a standalone MAG short-read quality control pipeline as well as a component of the larger CAMP metagenome analysis pipeline. As such, it is both self-contained (ex. instructions included for the setup of a versioned environment, etc.), and seamlessly compatible with other CAMP modules (ex. ingests and spawns standardized input/output config files, etc.).
There are two filtration steps in the module- i) for general poor quality (Phred scores, length, Ns, adapters, polyG/X) and ii) for host reads- followed by a sequencing error correction step. The properties of the QC-ed FastQs are summarized in aggregate by a MultiQC report.
If you don't already have conda handy, we recommend installing miniforge, which is a minimal conda installer that, by default, installs packages from open-source community-driven channels such as conda-forge.
# If you don't already have conda on your system...
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.shRun the following command to initialize Conda for your shell. This will configure your shell to recognize conda activate.
conda initRestart your terminal or run:
source ~/.bashrc # For bash users
source ~/.zshrc # For zsh users- Clone repo from Github.
git clone https://github.com/Meta-CAMP/camp_short-read-quality-control- Automated setup by downloading and indexing host reference genome(s). In the
setup.shscript, indexing is done using 20 threads. The user can modify this parameter based on the capacity of their machine.
This step also automatically updates configs/parameters.yaml and test_data/parameters.yaml, which are needed in running the module and the test respectively.
cd camp_short-read-quality-control/
source setup.shAlternatively, one can choose to manually download and index reference genomes and update the host_reference_database section in the respective parameters.yaml.
For instance, the human reference genome (GRCh38) can be downloaded and indexed as follows:
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/GCA_000001405.15_GRCh38_genomic.fna.gz
bowtie2-build --threads 20 GCA_000001405.15_GRCh38_genomic.fna.gz GCA_000001405.15_GRCh38_genomicAs of now, setup.sh provides two download options: human and mouse.
- Make sure the installed pipeline works correctly. With 10 threads and a maximum of 40 GB allocated for a command (
tadpole), the test dataset should finish in approximately 3 minutes.
# Run tests on the included sample dataset
conda activate camp
python camp_short-read-quality-control/workflow/short-read-quality-control.py testInput: /path/to/samples.csv provided by the user.
Output: 1) An output config file summarizing the locations of the error-corrected FastQs, 2) summary statistics about the dataset after each error correction step indicating how many reads and/or bases were pruned, and 3) the MultiQC report summarizing the properties for the QC-ed FastQ. For more information, see the demo test output directory in test_data/test_out.
-
/path/to/work/dir/short_read_qc/final_reports/samples.csvfor ingestion by the next module -
/path/to/work/dir/short_read_qc/final_reports/read_stats.csv -
Optional:
/path/to/work/dir/short_read_qc/final_reports/*_multiqc_report.html, where*is 'pre' or 'post'-module
└── workflow
├── Snakefile
├── short-read-quality-control.py
├── utils.py
├── __init__.py
└── ext/
└── scripts/
workflow/short-read-quality-control.py: Click-based CLI that wraps thesnakemakeand other commands for clean management of parameters, resources, and environment variables.workflow/Snakefile: Thesnakemakepipeline.workflow/utils.py: Sample ingestion and work directory setup functions, and other utility functions used in the pipeline and the CLI.ext/: External programs, scripts, and small auxiliary files that are not conda-compatible but used in the workflow.
-
Make your own
samples.csvbased on the template inconfigs/samples.csv.ingest_samplesinworkflow/utils.pyexpects Illumina reads in FastQ (may be gzipped) formsamples.csvrequires either absolute paths or paths relative to the directory that the module is being run in
-
Update the relevant parameters in
configs/parameters.yaml. -
Update the computational resources available to the pipeline in
configs/resources.yaml.
To run CAMP on the command line, use the following, where /path/to/work/dir is replaced with the absolute path of your chosen working directory, and /path/to/samples.csv is replaced with your copy of samples.csv.
- The default number of cores available to Snakemake is 1 which is enough for test data, but should probably be adjusted to 10+ for a real dataset.
- Relative or absolute paths to the Snakefile and/or the working directory (if you're running elsewhere) are accepted!
- The parameters and resource config YAMLs can also be customized.
conda activate camp
python /path/to/camp_short-read-quality-control/workflow/short-read-quality-control.py \
(-c number_of_cores_allocated) \
(-p /path/to/parameters.yaml) \
(-r /path/to/resources.yaml) \
-d /path/to/work/dir \
-s /path/to/samples.csvTo run CAMP on a job submission cluster (for now, only Slurm is supported), use the following.
- --slurm is an optional flag that submits all rules in the Snakemake pipeline as sbatch jobs.
- In Slurm mode, the -c flag refers to the maximum number of sbatch jobs submitted in parallel, not the pool of cores available to run the jobs. Each job will request the number of cores specified by threads in configs/resources/slurm.yaml.
conda activate camp
sbatch -J jobname -o jobname.log << "EOF"
#!/bin/bash
python /path/to/camp_short-read-quality-control/workflow/short-read-quality-control.py --slurm \
(-c max_number_of_parallel_jobs_submitted) \
(-p /path/to/parameters.yaml) \
(-r /path/to/resources.yaml) \
-d /path/to/work/dir \
-s /path/to/samples.csv
EOF-
To quality-check the processed FastQs, download and compare the collated MultiQC reports, which can be found at
/path/to/work/dir/short_read_qc/final_reports/*_multiqc_report/html. Multiple rounds of preprocessing may be needed to fully get rid of low-quality bases, adapters, and duplicated sequences.- For example, the dataset I worked with required an additional round of
fastpto trim 10 low-quality bases from the 5' and 4 low-quality bases from the 3' end respectively. - I recommend creating a new directory, which I've called
/path/to/work/dir/short_read_qc/5_retrimmingand placing reprocessed reads inside them. - Afterwards, I reran FastQC and MultiQC and collated summary statistics (ex. numbers of reads, etc.) from the reprocessed datasets manually. I also updated the location of the reprocessed reads in
/path/to/work/dir/short_read_qc/final_reports/samples.csvto/path/to/work/dir/short_read_qc/5_retrimming.
- For example, the dataset I worked with required an additional round of
-
Each module includes a built-in Jupyter notebook designed to summarize data efficiently and visually. The following visualization generated by Dataviz.ipynb using the included test dataset. The test dataset is a toy human gut microbiome i.e.: simulated Illumina short reads covering 4 bacterial reference genomes from the Unified Human Gut Genome at 10X coverage per genome.
To quickly visualize where the greatest sequence losses are happening, plot grouped bar graph(s) of the number of reads and bases remaining after each quality control step in each sample, and follow the instructions in the Jupyter notebook:
jupyter notebook &
- After checking over
final_reports/and making sure you have everything you need, you can delete all intermediate files to save space.
python3 /path/to/camp_short-read-quality-control/workflow/short-read-quality-control.py cleanup \
-d /path/to/work/dir \
-s /path/to/samples.csv- If for some reason the module keeps failing, CAMP can print a script containing all of the remaining commands that can be run manually.
python3 /path/to/camp_short-read-quality-control/workflow/short-read-quality-control.py --dry_run \
-d /path/to/work/dir \
-s /path/to/samples.csv- This package was created with Cookiecutter as a simplified version of the project template.
- Free software: MIT
- Documentation: https://camp-documentation.readthedocs.io/en/latest/short-read-quality-control.html