SPfast

Ultra-fast and highly sensitive protein structure alignment with segment-level representations and block-sparse optimization.

Demo notebooks

Notebook	Data	Description
Structure search	afdb-clu.db (6 GB) BFVD.db (670 MB)	Search a structure database of predicted cluster representatives
PFAM annotation	afdb-clu-annot.db (2 GB)	Annotate protein function by structure search over 375k curated AFDB clusters

Note: SPfast is designed for multi-core CPUs and is not optimized for Colab.

Setup

Commands below assume you are in the repository root.

Create an environment for preprocessing structures (utils/idealize.py) and PyMOL bindings:

conda create -n spfast python=3.8 -c conda-forge
conda activate spfast
conda install -c conda-forge pybind11 scikit-learn biopython numpy

Install the Python extension module (SPlib):

pip install .

Note: SPlib Python bindings are useful for preprocessing and the PyMOL plugin. High-throughput searches should use the compiled binaries below.

Build command-line binaries:

make -C src gnu

This produces:

src/SPfast.gnu
src/prepare_bin.gnu
src/extract_bin.gnu

Install DSSP for external secondary-structure labels:

wget https://github.com/PDB-REDO/dssp/releases/download/v4.4.0/mkdssp-4.4.0-linux-x64
chmod +x mkdssp-4.4.0-linux-x64

Paper results were produced with dssp-2.0.4-linux-amd64.

Quick start (example data)

This reproduces the full pipeline on files under example/.

The required starting inputs are a directory of structure files and a directory of corresponding DSSP secondary structure annotation files.

Generate .ideal files from structures:

python utils/idealize.py example/example_list \
  --sdir example/structures \
  --dssdir example/DSSP \
  --odir example/ideal \
  --structure_suffix ent

Choose one search input format:

Option A (default): convert .ideal files into per-structure .ideal.bin files:

src/prepare_bin.gnu -qlist example/ideal example/example_list .ideal

Option B (optional): pack structures into a single database (.db + .db.index):

src/prepare_bin.gnu -qlist example/ideal example/example_list .ideal -tdb example/example.db > example/example.db.index

Optional utility (Option B only): extract entries back out of the packed database:

src/extract_bin.gnu example/example.db example/ideal -q d1qo0d_.ideal
src/extract_bin.gnu example/example.db example/ideal -qlist example/example_list .ideal

Search modes

All commands assume paths from repository root.

All-vs-all between two lists:

src/SPfast.gnu -qlist example/ideal example/example_list .ideal.bin \
  -tlist example/ideal example/example_list .ideal.bin

Query against packed database:

src/SPfast.gnu -q example/ideal/d1ktga_.ideal.bin -tdb example/example.db

List of explicit pairs (query target per line):

src/SPfast.gnu -plist example/example_pairs -idir example/ideal

Unique pairwise all-vs-all within one list (N*(N-1)/2 comparisons):

src/SPfast.gnu -pairlist example/ideal example/example_list .ideal.bin

Pairwise single comparison:

src/SPfast.gnu example/ideal/d1ktga_.ideal.bin example/ideal/d1xria_.ideal.bin

Important parameters

Most impactful sensitivity controls (roughly): -ssprefcut >> -coarsecut > -finalgap0 > -converge > -segcut ~ -riters.

-SPscore: use original SPscore parameters instead of optimized defaults.
-ssprefcut (default -1): threshold for SS-segment prefiltering; most useful with -singledom.
-coarsecut (default -1): coarse segment-based score cutoff.
-singledom: assume single-domain proteins; enables stricter SS-based prefiltering.
-finalgap0 (default 0.2): final-stage gap-open penalty (must be > 0).
-converge (default 0.05): stop criterion for iterative alignment/superposition refinement.
-segcut (default 5.0): maximum RMSD for seed fragments (higher = more sensitive, slower).
-riters (default 1): number of refinement iterations.
-fast: speed-focused preset (-converge 0.9 -coarsecut 5.5 -segcut 4.0).

Reporting options

-reportcutoff X: print only results with score >= X (for SPscore output).

`-iprint 1` (single-line summary)

Prints one result line per comparison.

query target SPscore Rawscore SSscore nA nB Le SeqID Nali seeds valid_seeds coarse

Field meanings:

SPscore: effective SPscore (main ranking score).
Rawscore: unnormalized SPscore.
SSscore: Secondary structure prefilter score.
nA, nB: query/target lengths.
Le: effective aligned length.
SeqID: sequence identity (%).
Nali: number of aligned residue pairs.
seeds, valid_seeds: sampled and retained seed counts.
coarse: coarse-stage score.

`-iprint 2` (summary + transform)

Includes everything from -iprint 1, then appends a 3-line rigid transform:

t_x r11 r12 r13
t_y r21 r22 r23
t_z r31 r32 r33

Where t_* is translation and r** is the rotation matrix.

`-iprint 3` (summary + transform + alignment)

Includes everything from -iprint 2, then appends a 3-line sequence alignment block:

<query_start> <query_aligned_sequence> <query_end>
             <match_quality_markers>
<target_start> <target_aligned_sequence> <target_end>

Marker legend:

: aligned pair distance <= 4 A
; aligned pair distance < 5 A
. aligned pair distance < 8 A
space: no close structural match at that position

PyMOL plugin

SPfast.py provides a pairwise alignment command for PyMOL (similar usage to align/cealign).

Install SPlib first (pip install .).
Install SPfast.py through the PyMOL Plugin Manager from this repository URL.
Run from the PyMOL terminal:

SPfast stationary_selection, moving_selection

References

If you use this tool, please cite:

Litfin, T, Zhou, Y, von Itzstein, M. (2025). Ultra-fast and highly sensitive protein structure alignment with segment-level representations and block-sparse optimization. bioRxiv. doi:10.1101/2025.03.14.643159.

Source code is adapted from:

Yang, Y, Zhan, J, Zhao, H, Zhou, Y. (2012). A new size-independent score for pairwise protein structure alignment and its application to structure classification and nucleic-acid binding prediction. Proteins. 80(8), 2080-2088.

Optimum rotations are computed using:

Theobald, D. (2005). Rapid calculation of RMSD using a quaternion-based characteristic polynomial. Acta Crystallographica A. 61(4), 478-480.
Liu, P, Agrafiotis, D, Theobald, D. (2009). Fast determination of the optimal rotational matrix for macromolecular superpositions. Journal of Computational Chemistry. 31(7), 1561-1563.

Reference clustering data:

Barrio-Hernandez, I., Yeo, J., Janes, J. et al. (2023). Clustering predicted structures at the scale of the known protein universe. Nature. 622, 637-645.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.github		.github
example		example
notebooks		notebooks
src		src
utils		utils
.gitignore		.gitignore
README.md		README.md
SPfast.py		SPfast.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SPfast

Contents

Demo notebooks

Setup

Quick start (example data)

Search modes

Important parameters

Reporting options

`-iprint 1` (single-line summary)

`-iprint 2` (summary + transform)

`-iprint 3` (summary + transform + alignment)

PyMOL plugin

References

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SPfast

Contents

Demo notebooks

Setup

Quick start (example data)

Search modes

Important parameters

Reporting options

-iprint 1 (single-line summary)

-iprint 2 (summary + transform)

-iprint 3 (summary + transform + alignment)

PyMOL plugin

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`-iprint 1` (single-line summary)

`-iprint 2` (summary + transform)

`-iprint 3` (summary + transform + alignment)

Packages