This repository contains analysis scripts used to characterize the landscape of non-B DNA motifs across 65 high-quality, phased human haplotype assemblies generated by the Human Genomic Structural Variation Consortium (HGSVC).
Using telomere-to-telomere (T2T) genome assemblies, this project systematically annotates six major classes of non-B DNA motifs—including inverted repeats (IR), mirror repeats (MR), direct repeats (DR), A-phased repeats (APR), G-quadruplexes (G4), and Z-DNA—and examines their distribution, predicted stability, and enrichment across diverse genomic contexts.
Analyses focus on regions of structural and functional complexity, including centromeres, segmental duplications, structural variant breakpoints, mobile element insertions, and candidate cis-regulatory elements. Together, these scripts support the results presented in the associated manuscript.
This repository is not intended to function as a standalone software tool or automated pipeline.
Instead, it contains the scripts and analysis logic used during an exploratory, multi-stage research project. Scripts were executed independently or in small groups depending on the specific analysis (e.g., motif annotation, stability modeling, enrichment testing, visualization).
This project integrates several established tools for non-B DNA motif annotation, structural stability prediction, and centromere profiling.
-
Quadron
Predicts G-quadruplex (G4)–forming sequences and assigns stability scores (Q-scores) based on polymerase stalling signatures derived from experimental G4-seq data.
GitHub: https://github.com/aleksahak/Quadron -
Non-B gfa (Non-B Motif Search Tool)
Annotates canonical non-B DNA motif classes, including inverted repeats (IR), mirror repeats (MR), direct repeats (DR), A-phased repeats (APR), and Z-DNA, using sequence-based pattern definitions.
GitHub: https://github.com/abcsFrederick/non-B_gfa -
seqfold
Computes minimum free energy (MFE) estimates for inverted repeat sequences, providing a thermodynamic proxy for cruciform-forming stability.
GitHub: https://github.com/Lattice-Automation/seqfold -
Genomic Centromere Profiling (GCP)
Used for centromere-specific analyses, including CENP-B box annotation, alpha-satellite monomer classification, and higher-order repeat (HOR) organization.
GitHub: https://github.com/GiuntaLab/GCP-Centeny
| Annotation Tool | Alignment Reference | Batch | Number of Haplotypes | Data Source |
|---|---|---|---|---|
| nBMST & Quadron | CHM13v2.0 & GRCh38 | 20230818_verkko_batch1 | 76 | Verkko Batch 1 |
| nBMST & Quadron | CHM13v2.0 & GRCh38 | 20230927_verkko_batch2 | 30 | Verkko Batch 2 |
| nBMST & Quadron | CHM13v2.0 & GRCh38 | 20240201_verkko_batch3 | 24 | Verkko Batch 3 |
| Total | 130 |
Scripts used to preprocess Verkko Phased Haplotype Assemblies (Batch 1,2,3) aligned to both T2T-CHM13v2.0 and GRCh38 at the WHOLE GENOME LEVEL. This is the first script that should be run in order to obtain fasta files for non-B annotation scripts below.
phased_haplotype_alignments_20230818_verkko_batch1.sh- Script to take phased haplotype assemblies (BAM files) aligned to reference (chm13 or hg38), filter alignments to obtain ONLY primary reads, then split the bam file by chromosome and write each chromosome bam to a fasta file
Scripts related to annotating and formatting haplotype level fasta files with non-B gfa and Quadron tools. Most scripts were built for running on high performance computing clusters as they utilize array jobs to speed up the processing time of individual haplotype.
find_nonb_motifs_haplotype_array.sh- Script to run non-B gfa tool (APR, MR, DR, IR, Z) on fastas produced from aligned assemblies (genome-wide)nonoverlapping_motifs.py- Script to collapse and merge overlapping intervals for non-B gfa annotationsrun_quadron.sh- Script to run Quadron tool (G4s) on fastas produced from aligned assembliesprocess_quadron_files.py- Script to convert Quadron produced .txt files into .csv files for easier manipulationquadron_to_bed.py- Script to merge and collapse overlapping G4 annotations and convert quadron .csv files in .bed filesprocess_gquad_beds_quadron.py- Script to produce genome-wide collapsed quadron G4 metrics across all haplotypesprocess_bed_files.py- Script to produce genome-wide collapsed non-B gfa annotation metrics across all haplotypes
Scripts related to obtaining free energy prediction for IRs using Seqfold tool.
MEIs_and_SVs/process_free_eneries_chunks_array.py- Script to calculate predicted free energy for inverted repeats using Seqfold from MEI and SV annotations (use non-B gfa annotation tsvs)whole_genome/process_free_energies_chunks_array_wholegenome.py- Script to calculate predicted free energy for inverted repeats using Seqfold from MEI and SV annotations (use non-B gfa annotation tsvs)
Scripts related to annotating non-B DNA structures in completely assembled centromeres using non-B gfa and Quadron tools. Most scripts were built for running on high performance computing clusters as they utilize array jobs to speed up the processing time of individual haplotype.
create_centromere_bed.py- Script to produce bed files for haplotype centromeresextract_haplotype_centromeres.sh- Script to extract completely and accurately assembled centromeres from full haplotype fastas files (not alignment produced fastas)find_nonb_motifs_haplotype_array.sh- Script to run non-B gfa tool (APR, MR, DR, IR, Z) on fastas produced from aligned assemblies (In Centromeres)nonoverlapping_motifs.py- Script to collapse and merge overlapping intervals for non-B gfa annotationsprocess_bed_files_chr.py- Script to produce centromere level non-B gfa annotation metrics for each haplotype (results outputted in single.csv)
Scripts related to extracting 2000bp flanks surrounding structural variants (SVs) and mobile element insertions (MEIs) from individual haplotypes.
extract_flanks_newest_arrayjob.sh- Script to extract haplotype specific 2000bp flanking regions around SVs and MEIs into a .csv fileflanks_to_fasta.py- Script to turn .csv file containing flank information into .fasta file for non-B gfa and Quadron annotationsadd_metadata.py- Script to append SV and MEI metadata to flank information
all_analysis_notebooks- Contains Jupyter Notebooks (.ipynb) related to all analyses performed over the course of this projectCDKN1A_cruciform_analysis- Code and associated files used to produce data pertaining to the CDKN1A inverted repeat stability analysisflanking_sv_mei_analysis- Code to create density plots for non-B DNA motifs located in the flanking regions around structural variant breakpoints.
HGSVC_nonB_figures.ipynb- Contains Jupyter Notebooks (.ipynb) related to analyses that made it into the main paper.HGSVC_nonB_supplementary_figures- Contains Jupyter Notebooks (.ipynb) related to analyses that made it into the Supplementaries.
