GitHub - cobilab/altair: AltaiR: a C toolkit for alignment-free and temporal analysis of multi-FASTA data

AltaiR: alignment-free and temporal analysis of multi-FASTA data (C toolkit)

✨ What is AltaiR?

AltaiR is a fast, alignment-free toolkit for temporal analysis and characterization of multi-FASTA datasets, targeting large-scale collections such as genomes and proteomes.

It is particularly useful for scenarios with many sequences collected over time (e.g., epidemic/pandemic datasets), where alignment-based workflows can be slow, brittle, or unnecessary for the desired analyses. AltaiR is implemented in multi-threaded C, is highly flexible, and is designed to run without external dependencies (core toolkit). It accepts any sequence(s) in (multi-)FASTA format.

✅ Highlights

⚡ High speed (multi-threaded C implementation)
🧩 High flexibility (multiple independent analysis modules)
🧬 Alignment-free methods (compression-based and word-based analyses)
📦 No external dependencies for the core toolkit
🗂️ Works with any (multi-)FASTA input (DNA/RNA/protein)

📌 Contents

🧰 Commands
⚙️ Installation
🚀 Quickstart
🧾 Help and parameters
🧪 Reproducing experiments (pipelines)
📖 Citation
🐞 Issues
📜 License

🧰 Commands

AltaiR provides a single entry point (AltaiR) with six subcommands:

📉 average — moving average filter for a float column in a CSV file (column index is a parameter)
🧹 filter — filter FASTA records by alphabet, completeness, length, CG content, presence/absence of string patterns
📊 frequency — compute alphabet frequencies per FASTA record (optionally with alphabet filtering)
🧮 nc — compute Normalized Compression (NC) per FASTA record (configurable compression level)
📐 ncd — compute Normalized Compression Distance (NCD) for each record relative to a reference
🧩 raw — compute Relative Absent Words (RAWs) with CG% estimation per RAW

Tip: each subcommand has its own -h/--help describing the expected inputs/outputs.

⚙️ Installation

🐍 Option A — Conda (recommended)

Create a dedicated environment and install from Bioconda:

mamba create -n altair -c conda-forge -c bioconda altair-mf
conda activate altair
AltaiR -h

To install into an existing environment:

conda install -y -c bioconda altair-mf

🛠️ Option B — Build from source (CMake)

Requirements: cmake, git, and a C compiler toolchain.

sudo apt-get install -y cmake git build-essential
git clone https://github.com/cobilab/altair.git
cd altair
cmake -S src -B build
cmake --build build -j
./build/AltaiR -h

Alternative in-tree build (minimal setups):
cd altair/src
cmake .
make

🧩 Optional — Additional tools for pipeline scripts (GTO)

Some scripts in pipelines/ require the GTO toolkit.

Conda:

conda install -c cobilab gto --yes

Manual:

git clone https://github.com/cobilab/gto.git
cd gto/src/
make
export PATH="$HOME/gto/bin:$PATH"

🚀 Quickstart

1) Sanity check

AltaiR -h

2) Discover available modules

AltaiR average -h
AltaiR filter -h
AltaiR frequency -h
AltaiR nc -h
AltaiR ncd -h
AltaiR raw -h

🧾 Help and parameters

Top-level help:

AltaiR
# or
AltaiR -h

Per-subcommand help:

AltaiR average -h
AltaiR filter -h
AltaiR frequency -h
AltaiR nc -h
AltaiR ncd -h
AltaiR raw -h

🧪 Reproducing experiments (pipelines)

Assuming AltaiR is compiled and you are working under pipelines/.

Make AltaiR available locally

If you built with the in-tree method:

cp ../src/AltaiR .

If you built with the out-of-tree build/ directory:

cp ../build/AltaiR .

Some steps require python3, bash, and (optionally) gto (see “Additional tools”).

🧹 Filtering sequences

python3 Histogram.py
bash Filter.sh 29885 29921

🔗 Similarity profiles (NCD)

bash Simulation.sh
bash Similarity.sh ORIGINAL.fa
bash SimProfile.sh sim-data.csv 2 0 1.2
mv NCDProfilesim-data.csv.pdf NCD_P1.pdf

🌳 Phylogenetic tree construction

python3 tree.py sim-data.csv -N 50

🧠 Complexity profiles (NC)

bash ComplexitySars.sh
python3 CompProfileSars.py comp-data.csv sorted_output.fa 0.961 0.9617
mv NCProfilecomp-data.csv.pdf NC.pdf

📊 Frequency profiles

bash FrequencySars.sh
python3 combine_freq_and_date.py
mv base_frequencies_plot.pdf Freq.pdf

🧩 Relative singularity (RAWs) profiles

bash RawSars.sh
python3 RawSarsProfile.py sorted_output.fa
mv relativeSingularityProfile.pdf RAWProfiles.pdf

📖 Citation

If you use AltaiR in your research, please cite:

Silva, Jorge M., Armando J. Pinho, and Diogo Pratas. “AltaiR: a C toolkit for alignment-free and temporal analysis of multi-FASTA data.” GigaScience 13 (2024): giae086.

DOI: 10.1093/gigascience/giae086

🐞 Issues

Please report bugs and feature requests via GitHub Issues:

https://github.com/cobilab/altair/issues

📜 License

AltaiR is licensed under GNU GPL v3. See LICENSE. More information: http://www.gnu.org/licenses/gpl-3.0.html

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
imgs		imgs
pipelines		pipelines
src		src
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

✨ What is AltaiR?

✅ Highlights

📌 Contents

🧰 Commands

⚙️ Installation

🐍 Option A — Conda (recommended)

🛠️ Option B — Build from source (CMake)

🧩 Optional — Additional tools for pipeline scripts (GTO)

🚀 Quickstart

1) Sanity check

2) Discover available modules

🧾 Help and parameters

🧪 Reproducing experiments (pipelines)

Make AltaiR available locally

🧹 Filtering sequences

🔗 Similarity profiles (NCD)

🌳 Phylogenetic tree construction

🧠 Complexity profiles (NC)

📊 Frequency profiles

🧩 Relative singularity (RAWs) profiles

📖 Citation

🐞 Issues

📜 License

About

Uh oh!

Releases 2

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

✨ What is AltaiR?

✅ Highlights

📌 Contents

🧰 Commands

⚙️ Installation

🐍 Option A — Conda (recommended)

🛠️ Option B — Build from source (CMake)

🧩 Optional — Additional tools for pipeline scripts (GTO)

🚀 Quickstart

1) Sanity check

2) Discover available modules

🧾 Help and parameters

🧪 Reproducing experiments (pipelines)

Make AltaiR available locally

🧹 Filtering sequences

🔗 Similarity profiles (NCD)

🌳 Phylogenetic tree construction

🧠 Complexity profiles (NC)

📊 Frequency profiles

🧩 Relative singularity (RAWs) profiles

📖 Citation

🐞 Issues

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages