Ultra-fast and highly sensitive protein structure alignment with segment-level representations and block-sparse optimization.
- Demo notebooks
- Setup
- Quick start (example data)
- Search modes
- Important parameters
- Reporting options
- PyMOL plugin
- References
| Notebook | Data | Description |
|---|---|---|
| Structure search | afdb-clu.db (6 GB) BFVD.db (670 MB) |
Search a structure database of predicted cluster representatives |
| PFAM annotation | afdb-clu-annot.db (2 GB) | Annotate protein function by structure search over 375k curated AFDB clusters |
Note: SPfast is designed for multi-core CPUs and is not optimized for Colab.
Commands below assume you are in the repository root.
- Create an environment for preprocessing structures (
utils/idealize.py) and PyMOL bindings:
conda create -n spfast python=3.8 -c conda-forge
conda activate spfast
conda install -c conda-forge pybind11 scikit-learn biopython numpy- Install the Python extension module (
SPlib):
pip install .Note: SPlib Python bindings are useful for preprocessing and the PyMOL plugin. High-throughput searches should use the compiled binaries below.
- Build command-line binaries:
make -C src gnuThis produces:
src/SPfast.gnusrc/prepare_bin.gnusrc/extract_bin.gnu
- Install DSSP for external secondary-structure labels:
wget https://github.com/PDB-REDO/dssp/releases/download/v4.4.0/mkdssp-4.4.0-linux-x64
chmod +x mkdssp-4.4.0-linux-x64Paper results were produced with dssp-2.0.4-linux-amd64.
This reproduces the full pipeline on files under example/.
The required starting inputs are a directory of structure files and a directory of corresponding DSSP secondary structure annotation files.
- Generate
.idealfiles from structures:
python utils/idealize.py example/example_list \
--sdir example/structures \
--dssdir example/DSSP \
--odir example/ideal \
--structure_suffix ent- Choose one search input format:
Option A (default): convert .ideal files into per-structure .ideal.bin files:
src/prepare_bin.gnu -qlist example/ideal example/example_list .idealOption B (optional): pack structures into a single database (.db + .db.index):
src/prepare_bin.gnu -qlist example/ideal example/example_list .ideal -tdb example/example.db > example/example.db.index- Optional utility (Option B only): extract entries back out of the packed database:
src/extract_bin.gnu example/example.db example/ideal -q d1qo0d_.ideal
src/extract_bin.gnu example/example.db example/ideal -qlist example/example_list .idealAll commands assume paths from repository root.
All-vs-all between two lists:
src/SPfast.gnu -qlist example/ideal example/example_list .ideal.bin \
-tlist example/ideal example/example_list .ideal.binQuery against packed database:
src/SPfast.gnu -q example/ideal/d1ktga_.ideal.bin -tdb example/example.dbList of explicit pairs (query target per line):
src/SPfast.gnu -plist example/example_pairs -idir example/idealUnique pairwise all-vs-all within one list (N*(N-1)/2 comparisons):
src/SPfast.gnu -pairlist example/ideal example/example_list .ideal.binPairwise single comparison:
src/SPfast.gnu example/ideal/d1ktga_.ideal.bin example/ideal/d1xria_.ideal.binMost impactful sensitivity controls (roughly): -ssprefcut >> -coarsecut > -finalgap0 > -converge > -segcut ~ -riters.
-SPscore: use original SPscore parameters instead of optimized defaults.-ssprefcut(default-1): threshold for SS-segment prefiltering; most useful with-singledom.-coarsecut(default-1): coarse segment-based score cutoff.-singledom: assume single-domain proteins; enables stricter SS-based prefiltering.-finalgap0(default0.2): final-stage gap-open penalty (must be > 0).-converge(default0.05): stop criterion for iterative alignment/superposition refinement.-segcut(default5.0): maximum RMSD for seed fragments (higher = more sensitive, slower).-riters(default1): number of refinement iterations.-fast: speed-focused preset (-converge 0.9 -coarsecut 5.5 -segcut 4.0).
-reportcutoff X: print only results with score >=X(for SPscore output).
Prints one result line per comparison.
query target SPscore Rawscore SSscore nA nB Le SeqID Nali seeds valid_seeds coarse
Field meanings:
SPscore: effective SPscore (main ranking score).Rawscore: unnormalized SPscore.SSscore: Secondary structure prefilter score.nA,nB: query/target lengths.Le: effective aligned length.SeqID: sequence identity (%).Nali: number of aligned residue pairs.seeds,valid_seeds: sampled and retained seed counts.coarse: coarse-stage score.
Includes everything from -iprint 1, then appends a 3-line rigid transform:
t_x r11 r12 r13
t_y r21 r22 r23
t_z r31 r32 r33
Where t_* is translation and r** is the rotation matrix.
Includes everything from -iprint 2, then appends a 3-line sequence alignment block:
<query_start> <query_aligned_sequence> <query_end>
<match_quality_markers>
<target_start> <target_aligned_sequence> <target_end>
Marker legend:
:aligned pair distance <= 4 A;aligned pair distance < 5 A.aligned pair distance < 8 A- space: no close structural match at that position
SPfast.py provides a pairwise alignment command for PyMOL (similar usage to align/cealign).
- Install
SPlibfirst (pip install .). - Install
SPfast.pythrough the PyMOL Plugin Manager from this repository URL. - Run from the PyMOL terminal:
SPfast stationary_selection, moving_selection
If you use this tool, please cite:
- Litfin, T, Zhou, Y, von Itzstein, M. (2025). Ultra-fast and highly sensitive protein structure alignment with segment-level representations and block-sparse optimization. bioRxiv. doi:10.1101/2025.03.14.643159.
Source code is adapted from:
- Yang, Y, Zhan, J, Zhao, H, Zhou, Y. (2012). A new size-independent score for pairwise protein structure alignment and its application to structure classification and nucleic-acid binding prediction. Proteins. 80(8), 2080-2088.
Optimum rotations are computed using:
- Theobald, D. (2005). Rapid calculation of RMSD using a quaternion-based characteristic polynomial. Acta Crystallographica A. 61(4), 478-480.
- Liu, P, Agrafiotis, D, Theobald, D. (2009). Fast determination of the optimal rotational matrix for macromolecular superpositions. Journal of Computational Chemistry. 31(7), 1561-1563.
Reference clustering data:
- Barrio-Hernandez, I., Yeo, J., Janes, J. et al. (2023). Clustering predicted structures at the scale of the known protein universe. Nature. 622, 637-645.
