This project provides scripts to demonstrate running Nextflow pipelines (specifically nf-core pipelines using Docker) in an environment without internet access, using a pre-populated cache stored in an S3 bucket.
To enable running Nextflow pipelines on an "offline" machine (e.g., an EC2 instance in a private subnet with no internet gateway) by:
- Using an "online" machine to download the pipeline assets and generate a list of required Docker images.
- Using the "online" machine again with the generated list to pull the Docker images and save them to a shared S3 location.
- Using the offline machine to load the assets and images from S3 and run the pipeline with the
-offlineflag.
- AWS Account & S3 Bucket: You need an AWS account and an S3 bucket accessible by both the online and offline machines.
- S3 Mount: The S3 bucket must be mounted on both the online and offline machines at the same path:
/mnt/s3(This path is configurable in the scripts).- Tools like
s3fs-fuseor the AWS Mountpoint for S3 can be used for this.
- Tools like
- Online Machine: An internet-connected machine (e.g., EC2 instance) with:
bashNextflowinstalled.nf-coretools installed (pip install nf-core).Dockerinstalled and running.jqinstalled (for parsing the JSON image list, e.g.,sudo apt-get install jqorsudo yum install jq).- (Optional:
AWS CLIif using S3 sync within scripts, though current scripts assume direct write to mount point for images).
- Offline Machine: A machine without internet access, but with access to the mounted S3 bucket (
/mnt/s3), and with:bashNextflowinstalled (can be transferred via S3 if necessary).Dockerinstalled and running (can be transferred via S3 if necessary).
This script prepares the pipeline assets and generates a list of required Docker images.
Usage:
# Ensure your S3 bucket is mounted at /mnt/s3
# Navigate to the project directory
cd /path/to/nextflow-offline
# Run the script
./scripts/setup_online_cache.shWhat it does:
- Configuration: Reads pipeline (
nf-core/scrnaseq) and S3 mount point (/mnt/s3) for assets from variables. - Creates Directories: Ensures the asset cache directory (
/mnt/s3/nextflow-offline-cache/assets/) and local list directory (./pipeline_lists/) exist. - Downloads Pipeline Assets: Uses
nf-core downloadto fetch the pipeline code, configuration, and test data into/mnt/s3/nextflow-offline-cache/assets/. - Generates Image List: Uses
nextflow inspectfor the specified pipeline and profile (docker) to generate a JSON file (./pipeline_lists/<pipeline_name>.list.json) containing the URIs of all required Docker containers. - Outputs Next Step: Prints the command needed to run the image fetching script using the generated list.
This script reads the generated JSON list, pulls the Docker images, and saves them to the designated S3 image cache directory.
Usage (run after setup_online_cache.sh):
# Ensure your S3 bucket is mounted at /mnt/s3
# Navigate to the project directory
cd /path/to/nextflow-offline
# Make the script executable if you haven't already
# chmod +x ./scripts/fetch_and_save_images.sh
# Run the script, providing the list file and the target image directory
# (Use the exact command printed by the previous script)
./scripts/fetch_and_save_images.sh "./pipeline_lists/scrnaseq.list.json" "/mnt/s3/pipe/images"What it does:
- Parses List: Reads the specified JSON file (e.g.,
./pipeline_lists/scrnaseq.list.json) usingjqto extract unique container image URIs. - Ensures Directory: Creates the target image directory (
/mnt/s3/pipe/images) if it doesn't exist. - Pulls & Saves Images: For each unique image URI:
- Pulls the image using
docker pull. - Sanitizes the image URI into a valid filename (replacing
/and:with_). - Saves the pulled image as a
.tgzfile (e.g.,quay.io_biocontainers_fastqc_0.12.1--hdfd78af_0.tgz) directly into the target directory (/mnt/s3/pipe/images).
- Pulls the image using
After this script completes successfully, the /mnt/s3/pipe/images directory should contain the required Docker images saved as .tgz files.
This script runs the Nextflow pipeline using the assets and images prepared by the online instance scripts.
Usage:
# Ensure your S3 bucket is mounted at /mnt/s3
# Ensure Nextflow and Docker are installed
# Navigate to the project directory (can be copied via S3)
cd /path/to/nextflow-offline
# Run the script
./scripts/run_nextflow_offline.shWhat it does:
- Configuration: Reads S3 mount point, pipeline name, asset cache path (
/mnt/s3/nextflow-offline-cache/assets/), and image cache path (/mnt/s3/pipe/images) from variables. - Locates Assets: Finds the downloaded pipeline workflow (
main.nf) and a test samplesheet within the asset cache directory. - Loads Images: Iterates through all
.tgzfiles in the image cache directory (/mnt/s3/pipe/images) and loads them into the local Docker daemon usingdocker load. - Runs Nextflow: Executes the
nextflow runcommand:- Targets the
main.nfscript found in the assets. - Uses
-profile docker. - Uses the automatically located test
--inputsheet. - Specifies local
--outdirand-work-dir. - Includes
-c config/cache_override.config. - Critically, uses the
-offlineflag. - Uses
-resume.
- Targets the
- Checks Result: Exits with 0 if Nextflow completes successfully, otherwise exits with Nextflow's error code.
scripts/setup_online_cache.sh: Contains variables forPIPELINE,PROFILE, assetS3_MOUNT_POINT, etc. Generates the image list file.scripts/fetch_and_save_images.sh: Takes image list file and output directory as arguments.scripts/run_nextflow_offline.sh: Contains variables for asset and image cache paths onS3_MOUNT_POINT,PIPELINE_NAME, output/work directories.config/cache_override.config: A Nextflow configuration file used via-c. Currently minimal, but can be used to override specific settings for the offline environment if needed.
- Error Handling: Add more robust error checking and dependency validation.
- Configuration: Make paths and pipeline names command-line arguments.
- Plugins: Handle offline Nextflow plugins.
- ECR: Explore using AWS ECR instead of saving/loading
.tgzfiles. - Singularity: Adapt the process for Singularity containers.