PhasedHapAssembly-nonB

This repository contains analysis scripts used to characterize the landscape of non-B DNA motifs across 65 high-quality, phased human haplotype assemblies generated by the Human Genomic Structural Variation Consortium (HGSVC).

Using telomere-to-telomere (T2T) genome assemblies, this project systematically annotates six major classes of non-B DNA motifs—including inverted repeats (IR), mirror repeats (MR), direct repeats (DR), A-phased repeats (APR), G-quadruplexes (G4), and Z-DNA—and examines their distribution, predicted stability, and enrichment across diverse genomic contexts.

Analyses focus on regions of structural and functional complexity, including centromeres, segmental duplications, structural variant breakpoints, mobile element insertions, and candidate cis-regulatory elements. Together, these scripts support the results presented in the associated manuscript.

Scope and Use

This repository is not intended to function as a standalone software tool or automated pipeline.

Instead, it contains the scripts and analysis logic used during an exploratory, multi-stage research project. Scripts were executed independently or in small groups depending on the specific analysis (e.g., motif annotation, stability modeling, enrichment testing, visualization).

Tools Used

This project integrates several established tools for non-B DNA motif annotation, structural stability prediction, and centromere profiling.

Quadron
Predicts G-quadruplex (G4)–forming sequences and assigns stability scores (Q-scores) based on polymerase stalling signatures derived from experimental G4-seq data.
GitHub: https://github.com/aleksahak/Quadron
Non-B gfa (Non-B Motif Search Tool)
Annotates canonical non-B DNA motif classes, including inverted repeats (IR), mirror repeats (MR), direct repeats (DR), A-phased repeats (APR), and Z-DNA, using sequence-based pattern definitions.
GitHub: https://github.com/abcsFrederick/non-B_gfa
seqfold
Computes minimum free energy (MFE) estimates for inverted repeat sequences, providing a thermodynamic proxy for cruciform-forming stability.
GitHub: https://github.com/Lattice-Automation/seqfold
Genomic Centromere Profiling (GCP)
Used for centromere-specific analyses, including CENP-B box annotation, alpha-satellite monomer classification, and higher-order repeat (HOR) organization.
GitHub: https://github.com/GiuntaLab/GCP-Centeny

Haplotype Assembly Data Utilized for annotations

Annotation Tool	Alignment Reference	Batch	Number of Haplotypes	Data Source
nBMST & Quadron	CHM13v2.0 & GRCh38	20230818_verkko_batch1	76	Verkko Batch 1
nBMST & Quadron	CHM13v2.0 & GRCh38	20230927_verkko_batch2	30	Verkko Batch 2
nBMST & Quadron	CHM13v2.0 & GRCh38	20240201_verkko_batch3	24	Verkko Batch 3
Total			130

1. preprocessing_aligned_assemblies

Scripts used to preprocess Verkko Phased Haplotype Assemblies (Batch 1,2,3) aligned to both T2T-CHM13v2.0 and GRCh38 at the WHOLE GENOME LEVEL. This is the first script that should be run in order to obtain fasta files for non-B annotation scripts below.

phased_haplotype_alignments_20230818_verkko_batch1.sh - Script to take phased haplotype assemblies (BAM files) aligned to reference (chm13 or hg38), filter alignments to obtain ONLY primary reads, then split the bam file by chromosome and write each chromosome bam to a fasta file

2. annotation_scripts_wholegenome

Scripts related to annotating and formatting haplotype level fasta files with non-B gfa and Quadron tools. Most scripts were built for running on high performance computing clusters as they utilize array jobs to speed up the processing time of individual haplotype.

find_nonb_motifs_haplotype_array.sh - Script to run non-B gfa tool (APR, MR, DR, IR, Z) on fastas produced from aligned assemblies (genome-wide)
nonoverlapping_motifs.py - Script to collapse and merge overlapping intervals for non-B gfa annotations
run_quadron.sh - Script to run Quadron tool (G4s) on fastas produced from aligned assemblies
process_quadron_files.py - Script to convert Quadron produced .txt files into .csv files for easier manipulation
quadron_to_bed.py - Script to merge and collapse overlapping G4 annotations and convert quadron .csv files in .bed files
process_gquad_beds_quadron.py - Script to produce genome-wide collapsed quadron G4 metrics across all haplotypes
process_bed_files.py - Script to produce genome-wide collapsed non-B gfa annotation metrics across all haplotypes

3. IR_free_energy_processing_seqfold

Scripts related to obtaining free energy prediction for IRs using Seqfold tool.

MEIs_and_SVs/process_free_eneries_chunks_array.py - Script to calculate predicted free energy for inverted repeats using Seqfold from MEI and SV annotations (use non-B gfa annotation tsvs)
whole_genome/process_free_energies_chunks_array_wholegenome.py - Script to calculate predicted free energy for inverted repeats using Seqfold from MEI and SV annotations (use non-B gfa annotation tsvs)

4. centromere_processing_scripts

Scripts related to annotating non-B DNA structures in completely assembled centromeres using non-B gfa and Quadron tools. Most scripts were built for running on high performance computing clusters as they utilize array jobs to speed up the processing time of individual haplotype.

create_centromere_bed.py - Script to produce bed files for haplotype centromeres
extract_haplotype_centromeres.sh - Script to extract completely and accurately assembled centromeres from full haplotype fastas files (not alignment produced fastas)
find_nonb_motifs_haplotype_array.sh - Script to run non-B gfa tool (APR, MR, DR, IR, Z) on fastas produced from aligned assemblies (In Centromeres)
nonoverlapping_motifs.py - Script to collapse and merge overlapping intervals for non-B gfa annotations
process_bed_files_chr.py - Script to produce centromere level non-B gfa annotation metrics for each haplotype (results outputted in single.csv)

5. flank_extraction_processing_scripts

Scripts related to extracting 2000bp flanks surrounding structural variants (SVs) and mobile element insertions (MEIs) from individual haplotypes.

extract_flanks_newest_arrayjob.sh - Script to extract haplotype specific 2000bp flanking regions around SVs and MEIs into a .csv file
flanks_to_fasta.py - Script to turn .csv file containing flank information into .fasta file for non-B gfa and Quadron annotations
add_metadata.py - Script to append SV and MEI metadata to flank information

6. analysis_scripts

all_analysis_notebooks - Contains Jupyter Notebooks (.ipynb) related to all analyses performed over the course of this project
CDKN1A_cruciform_analysis - Code and associated files used to produce data pertaining to the CDKN1A inverted repeat stability analysis
flanking_sv_mei_analysis - Code to create density plots for non-B DNA motifs located in the flanking regions around structural variant breakpoints.

7. main_figures_and_supplementary_notebooks

HGSVC_nonB_figures.ipynb - Contains Jupyter Notebooks (.ipynb) related to analyses that made it into the main paper.
HGSVC_nonB_supplementary_figures - Contains Jupyter Notebooks (.ipynb) related to analyses that made it into the Supplementaries.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PhasedHapAssembly-nonB

Scope and Use

Tools Used

Haplotype Assembly Data Utilized for annotations

1. preprocessing_aligned_assemblies

2. annotation_scripts_wholegenome

3. IR_free_energy_processing_seqfold

4. centromere_processing_scripts

5. flank_extraction_processing_scripts

6. analysis_scripts

7. main_figures_and_supplementary_notebooks

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
IR_free_energy_processing_seqfold		IR_free_energy_processing_seqfold
analysis_scripts		analysis_scripts
annotation_scripts_wholegenome		annotation_scripts_wholegenome
centromere_processing_scripts		centromere_processing_scripts
figures		figures
flank_extraction_processing_scripts		flank_extraction_processing_scripts
main_figures_and_supplementary_notebooks		main_figures_and_supplementary_notebooks
preprocessing_aligned_assemblies		preprocessing_aligned_assemblies
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

PhasedHapAssembly-nonB

Scope and Use

Tools Used

Haplotype Assembly Data Utilized for annotations

1. preprocessing_aligned_assemblies

2. annotation_scripts_wholegenome

3. IR_free_energy_processing_seqfold

4. centromere_processing_scripts

5. flank_extraction_processing_scripts

6. analysis_scripts

7. main_figures_and_supplementary_notebooks

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages