AltaiR: alignment-free and temporal analysis of multi-FASTA data (C toolkit)
AltaiR is a fast, alignment-free toolkit for temporal analysis and characterization of multi-FASTA datasets, targeting large-scale collections such as genomes and proteomes.
It is particularly useful for scenarios with many sequences collected over time (e.g., epidemic/pandemic datasets), where alignment-based workflows can be slow, brittle, or unnecessary for the desired analyses. AltaiR is implemented in multi-threaded C, is highly flexible, and is designed to run without external dependencies (core toolkit). It accepts any sequence(s) in (multi-)FASTA format.
- ⚡ High speed (multi-threaded C implementation)
- 🧩 High flexibility (multiple independent analysis modules)
- 🧬 Alignment-free methods (compression-based and word-based analyses)
- 📦 No external dependencies for the core toolkit
- 🗂️ Works with any (multi-)FASTA input (DNA/RNA/protein)
- 🧰 Commands
- ⚙️ Installation
- 🚀 Quickstart
- 🧾 Help and parameters
- 🧪 Reproducing experiments (pipelines)
- 📖 Citation
- 🐞 Issues
- 📜 License
AltaiR provides a single entry point (AltaiR) with six subcommands:
- 📉
average— moving average filter for a float column in a CSV file (column index is a parameter) - 🧹
filter— filter FASTA records by alphabet, completeness, length, CG content, presence/absence of string patterns - 📊
frequency— compute alphabet frequencies per FASTA record (optionally with alphabet filtering) - 🧮
nc— compute Normalized Compression (NC) per FASTA record (configurable compression level) - 📐
ncd— compute Normalized Compression Distance (NCD) for each record relative to a reference - 🧩
raw— compute Relative Absent Words (RAWs) with CG% estimation per RAW
Tip: each subcommand has its own
-h/--helpdescribing the expected inputs/outputs.
Create a dedicated environment and install from Bioconda:
mamba create -n altair -c conda-forge -c bioconda altair-mf
conda activate altair
AltaiR -hTo install into an existing environment:
conda install -y -c bioconda altair-mfRequirements: cmake, git, and a C compiler toolchain.
sudo apt-get install -y cmake git build-essential
git clone https://github.com/cobilab/altair.git
cd altair
cmake -S src -B build
cmake --build build -j
./build/AltaiR -hAlternative in-tree build (minimal setups):
cd altair/src cmake . make
Some scripts in pipelines/ require the GTO toolkit.
Conda:
conda install -c cobilab gto --yesManual:
git clone https://github.com/cobilab/gto.git
cd gto/src/
make
export PATH="$HOME/gto/bin:$PATH"AltaiR -hAltaiR average -h
AltaiR filter -h
AltaiR frequency -h
AltaiR nc -h
AltaiR ncd -h
AltaiR raw -hTop-level help:
AltaiR
# or
AltaiR -hPer-subcommand help:
AltaiR average -h
AltaiR filter -h
AltaiR frequency -h
AltaiR nc -h
AltaiR ncd -h
AltaiR raw -hAssuming AltaiR is compiled and you are working under pipelines/.
If you built with the in-tree method:
cp ../src/AltaiR .If you built with the out-of-tree build/ directory:
cp ../build/AltaiR .Some steps require
python3,bash, and (optionally)gto(see “Additional tools”).
python3 Histogram.py
bash Filter.sh 29885 29921bash Simulation.sh
bash Similarity.sh ORIGINAL.fa
bash SimProfile.sh sim-data.csv 2 0 1.2
mv NCDProfilesim-data.csv.pdf NCD_P1.pdfpython3 tree.py sim-data.csv -N 50bash ComplexitySars.sh
python3 CompProfileSars.py comp-data.csv sorted_output.fa 0.961 0.9617
mv NCProfilecomp-data.csv.pdf NC.pdfbash FrequencySars.sh
python3 combine_freq_and_date.py
mv base_frequencies_plot.pdf Freq.pdfbash RawSars.sh
python3 RawSarsProfile.py sorted_output.fa
mv relativeSingularityProfile.pdf RAWProfiles.pdfIf you use AltaiR in your research, please cite:
Silva, Jorge M., Armando J. Pinho, and Diogo Pratas. “AltaiR: a C toolkit for alignment-free and temporal analysis of multi-FASTA data.” GigaScience 13 (2024): giae086.
DOI: 10.1093/gigascience/giae086
Please report bugs and feature requests via GitHub Issues:
AltaiR is licensed under GNU GPL v3. See LICENSE. More information: http://www.gnu.org/licenses/gpl-3.0.html
