Skip to content

cobilab/altair

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

85 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Conda License: GPL v3 Speed HF AFM

AltaiR

AltaiR: alignment-free and temporal analysis of multi-FASTA data (C toolkit)


✨ What is AltaiR?

AltaiR is a fast, alignment-free toolkit for temporal analysis and characterization of multi-FASTA datasets, targeting large-scale collections such as genomes and proteomes.

It is particularly useful for scenarios with many sequences collected over time (e.g., epidemic/pandemic datasets), where alignment-based workflows can be slow, brittle, or unnecessary for the desired analyses. AltaiR is implemented in multi-threaded C, is highly flexible, and is designed to run without external dependencies (core toolkit). It accepts any sequence(s) in (multi-)FASTA format.

✅ Highlights

  • High speed (multi-threaded C implementation)
  • 🧩 High flexibility (multiple independent analysis modules)
  • 🧬 Alignment-free methods (compression-based and word-based analyses)
  • 📦 No external dependencies for the core toolkit
  • 🗂️ Works with any (multi-)FASTA input (DNA/RNA/protein)

📌 Contents


🧰 Commands

AltaiR provides a single entry point (AltaiR) with six subcommands:

  • 📉 average — moving average filter for a float column in a CSV file (column index is a parameter)
  • 🧹 filter — filter FASTA records by alphabet, completeness, length, CG content, presence/absence of string patterns
  • 📊 frequency — compute alphabet frequencies per FASTA record (optionally with alphabet filtering)
  • 🧮 nc — compute Normalized Compression (NC) per FASTA record (configurable compression level)
  • 📐 ncd — compute Normalized Compression Distance (NCD) for each record relative to a reference
  • 🧩 raw — compute Relative Absent Words (RAWs) with CG% estimation per RAW

Tip: each subcommand has its own -h/--help describing the expected inputs/outputs.


⚙️ Installation

🐍 Option A — Conda (recommended)

Create a dedicated environment and install from Bioconda:

mamba create -n altair -c conda-forge -c bioconda altair-mf
conda activate altair
AltaiR -h

To install into an existing environment:

conda install -y -c bioconda altair-mf

🛠️ Option B — Build from source (CMake)

Requirements: cmake, git, and a C compiler toolchain.

sudo apt-get install -y cmake git build-essential
git clone https://github.com/cobilab/altair.git
cd altair
cmake -S src -B build
cmake --build build -j
./build/AltaiR -h

Alternative in-tree build (minimal setups):

cd altair/src
cmake .
make

🧩 Optional — Additional tools for pipeline scripts (GTO)

Some scripts in pipelines/ require the GTO toolkit.

Conda:

conda install -c cobilab gto --yes

Manual:

git clone https://github.com/cobilab/gto.git
cd gto/src/
make
export PATH="$HOME/gto/bin:$PATH"

🚀 Quickstart

1) Sanity check

AltaiR -h

2) Discover available modules

AltaiR average -h
AltaiR filter -h
AltaiR frequency -h
AltaiR nc -h
AltaiR ncd -h
AltaiR raw -h

🧾 Help and parameters

Top-level help:

AltaiR
# or
AltaiR -h

Per-subcommand help:

AltaiR average -h
AltaiR filter -h
AltaiR frequency -h
AltaiR nc -h
AltaiR ncd -h
AltaiR raw -h

🧪 Reproducing experiments (pipelines)

Assuming AltaiR is compiled and you are working under pipelines/.

Make AltaiR available locally

If you built with the in-tree method:

cp ../src/AltaiR .

If you built with the out-of-tree build/ directory:

cp ../build/AltaiR .

Some steps require python3, bash, and (optionally) gto (see “Additional tools”).

🧹 Filtering sequences

python3 Histogram.py
bash Filter.sh 29885 29921

🔗 Similarity profiles (NCD)

bash Simulation.sh
bash Similarity.sh ORIGINAL.fa
bash SimProfile.sh sim-data.csv 2 0 1.2
mv NCDProfilesim-data.csv.pdf NCD_P1.pdf

🌳 Phylogenetic tree construction

python3 tree.py sim-data.csv -N 50

🧠 Complexity profiles (NC)

bash ComplexitySars.sh
python3 CompProfileSars.py comp-data.csv sorted_output.fa 0.961 0.9617
mv NCProfilecomp-data.csv.pdf NC.pdf

📊 Frequency profiles

bash FrequencySars.sh
python3 combine_freq_and_date.py
mv base_frequencies_plot.pdf Freq.pdf

🧩 Relative singularity (RAWs) profiles

bash RawSars.sh
python3 RawSarsProfile.py sorted_output.fa
mv relativeSingularityProfile.pdf RAWProfiles.pdf

📖 Citation

If you use AltaiR in your research, please cite:

Silva, Jorge M., Armando J. Pinho, and Diogo Pratas. “AltaiR: a C toolkit for alignment-free and temporal analysis of multi-FASTA data.” GigaScience 13 (2024): giae086.

DOI: 10.1093/gigascience/giae086


🐞 Issues

Please report bugs and feature requests via GitHub Issues:


📜 License

AltaiR is licensed under GNU GPL v3. See LICENSE. More information: http://www.gnu.org/licenses/gpl-3.0.html

About

AltaiR: a C toolkit for alignment-free and temporal analysis of multi-FASTA data

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors