Skip to content

kumarlab-compomics/PhasedHapAssembly-nonB

Repository files navigation

PhasedHapAssembly-nonB

Overview of haplotype-resolved non-B DNA annotation

This repository contains analysis scripts used to characterize the landscape of non-B DNA motifs across 65 high-quality, phased human haplotype assemblies generated by the Human Genomic Structural Variation Consortium (HGSVC).

Using telomere-to-telomere (T2T) genome assemblies, this project systematically annotates six major classes of non-B DNA motifs—including inverted repeats (IR), mirror repeats (MR), direct repeats (DR), A-phased repeats (APR), G-quadruplexes (G4), and Z-DNA—and examines their distribution, predicted stability, and enrichment across diverse genomic contexts.

Analyses focus on regions of structural and functional complexity, including centromeres, segmental duplications, structural variant breakpoints, mobile element insertions, and candidate cis-regulatory elements. Together, these scripts support the results presented in the associated manuscript.

Scope and Use

This repository is not intended to function as a standalone software tool or automated pipeline.

Instead, it contains the scripts and analysis logic used during an exploratory, multi-stage research project. Scripts were executed independently or in small groups depending on the specific analysis (e.g., motif annotation, stability modeling, enrichment testing, visualization).

Tools Used

This project integrates several established tools for non-B DNA motif annotation, structural stability prediction, and centromere profiling.

  • Quadron
    Predicts G-quadruplex (G4)–forming sequences and assigns stability scores (Q-scores) based on polymerase stalling signatures derived from experimental G4-seq data.
    GitHub: https://github.com/aleksahak/Quadron

  • Non-B gfa (Non-B Motif Search Tool)
    Annotates canonical non-B DNA motif classes, including inverted repeats (IR), mirror repeats (MR), direct repeats (DR), A-phased repeats (APR), and Z-DNA, using sequence-based pattern definitions.
    GitHub: https://github.com/abcsFrederick/non-B_gfa

  • seqfold
    Computes minimum free energy (MFE) estimates for inverted repeat sequences, providing a thermodynamic proxy for cruciform-forming stability.
    GitHub: https://github.com/Lattice-Automation/seqfold

  • Genomic Centromere Profiling (GCP)
    Used for centromere-specific analyses, including CENP-B box annotation, alpha-satellite monomer classification, and higher-order repeat (HOR) organization.
    GitHub: https://github.com/GiuntaLab/GCP-Centeny

Haplotype Assembly Data Utilized for annotations

Annotation Tool Alignment Reference Batch Number of Haplotypes Data Source
nBMST & Quadron CHM13v2.0 & GRCh38 20230818_verkko_batch1 76 Verkko Batch 1
nBMST & Quadron CHM13v2.0 & GRCh38 20230927_verkko_batch2 30 Verkko Batch 2
nBMST & Quadron CHM13v2.0 & GRCh38 20240201_verkko_batch3 24 Verkko Batch 3
Total 130

1. preprocessing_aligned_assemblies

Scripts used to preprocess Verkko Phased Haplotype Assemblies (Batch 1,2,3) aligned to both T2T-CHM13v2.0 and GRCh38 at the WHOLE GENOME LEVEL. This is the first script that should be run in order to obtain fasta files for non-B annotation scripts below.

  1. phased_haplotype_alignments_20230818_verkko_batch1.sh - Script to take phased haplotype assemblies (BAM files) aligned to reference (chm13 or hg38), filter alignments to obtain ONLY primary reads, then split the bam file by chromosome and write each chromosome bam to a fasta file

2. annotation_scripts_wholegenome

Scripts related to annotating and formatting haplotype level fasta files with non-B gfa and Quadron tools. Most scripts were built for running on high performance computing clusters as they utilize array jobs to speed up the processing time of individual haplotype.

  1. find_nonb_motifs_haplotype_array.sh - Script to run non-B gfa tool (APR, MR, DR, IR, Z) on fastas produced from aligned assemblies (genome-wide)
  2. nonoverlapping_motifs.py - Script to collapse and merge overlapping intervals for non-B gfa annotations
  3. run_quadron.sh - Script to run Quadron tool (G4s) on fastas produced from aligned assemblies
  4. process_quadron_files.py - Script to convert Quadron produced .txt files into .csv files for easier manipulation
  5. quadron_to_bed.py - Script to merge and collapse overlapping G4 annotations and convert quadron .csv files in .bed files
  6. process_gquad_beds_quadron.py - Script to produce genome-wide collapsed quadron G4 metrics across all haplotypes
  7. process_bed_files.py - Script to produce genome-wide collapsed non-B gfa annotation metrics across all haplotypes

3. IR_free_energy_processing_seqfold

Scripts related to obtaining free energy prediction for IRs using Seqfold tool.

  1. MEIs_and_SVs/process_free_eneries_chunks_array.py - Script to calculate predicted free energy for inverted repeats using Seqfold from MEI and SV annotations (use non-B gfa annotation tsvs)
  2. whole_genome/process_free_energies_chunks_array_wholegenome.py - Script to calculate predicted free energy for inverted repeats using Seqfold from MEI and SV annotations (use non-B gfa annotation tsvs)

4. centromere_processing_scripts

Scripts related to annotating non-B DNA structures in completely assembled centromeres using non-B gfa and Quadron tools. Most scripts were built for running on high performance computing clusters as they utilize array jobs to speed up the processing time of individual haplotype.

  1. create_centromere_bed.py - Script to produce bed files for haplotype centromeres
  2. extract_haplotype_centromeres.sh - Script to extract completely and accurately assembled centromeres from full haplotype fastas files (not alignment produced fastas)
  3. find_nonb_motifs_haplotype_array.sh - Script to run non-B gfa tool (APR, MR, DR, IR, Z) on fastas produced from aligned assemblies (In Centromeres)
  4. nonoverlapping_motifs.py - Script to collapse and merge overlapping intervals for non-B gfa annotations
  5. process_bed_files_chr.py - Script to produce centromere level non-B gfa annotation metrics for each haplotype (results outputted in single.csv)

5. flank_extraction_processing_scripts

Scripts related to extracting 2000bp flanks surrounding structural variants (SVs) and mobile element insertions (MEIs) from individual haplotypes.

  1. extract_flanks_newest_arrayjob.sh - Script to extract haplotype specific 2000bp flanking regions around SVs and MEIs into a .csv file
  2. flanks_to_fasta.py - Script to turn .csv file containing flank information into .fasta file for non-B gfa and Quadron annotations
  3. add_metadata.py - Script to append SV and MEI metadata to flank information

6. analysis_scripts

  1. all_analysis_notebooks - Contains Jupyter Notebooks (.ipynb) related to all analyses performed over the course of this project
  2. CDKN1A_cruciform_analysis - Code and associated files used to produce data pertaining to the CDKN1A inverted repeat stability analysis
  3. flanking_sv_mei_analysis - Code to create density plots for non-B DNA motifs located in the flanking regions around structural variant breakpoints.

7. main_figures_and_supplementary_notebooks

  1. HGSVC_nonB_figures.ipynb - Contains Jupyter Notebooks (.ipynb) related to analyses that made it into the main paper.
  2. HGSVC_nonB_supplementary_figures - Contains Jupyter Notebooks (.ipynb) related to analyses that made it into the Supplementaries.

About

Non-B DNA motif annotation across high quality phased haplotype assemblies

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors