Skip to content

openbioseq/seqx

Repository files navigation

seqx

seqx is an agent-friendly CLI for FASTA/FASTQ sequence processing.

It is designed around streaming I/O, predictable command behavior, and low-memory execution for large files.

Installation

pypi

pip install seqx

cargo

cargo install seqx

prebuilt binaries

Prebuilt binaries for Linux and macOS are available on the releases page

Quick Start

# Show help
seqx --help

# Show guide (agent-friendly help)
seqx guide
seqx guide filter

# Basic stats
seqx stats -i input.fa

# Convert FASTA -> FASTQ
seqx convert -i input.fa -T fastq -o output.fq

# Filter short sequences
seqx filter -i input.fa --min-len 100 -o filtered.fa

Commands

stats

seqx stats -i input.fa
seqx stats -i input.fa --gc
seqx stats -i input.fq --qual --min-len 50

convert

seqx convert -i input.fa -T fastq -Q 30 -o output.fq
seqx convert -i input.fq -T fasta -o output.fa

filter

seqx filter -i input.fa --min-len 100 --max-len 2000
seqx filter -i input.fa --pattern "ATG.*TAA"
seqx filter -i input.fa --exclude-pattern "N{10,}"
seqx filter -i input.fa --id-file ids.txt
seqx filter -i input.fq --min-qual 30

extract

seqx extract -i input.fa --id seq1
seqx extract -i input.fa --id-file ids.txt
seqx extract -i input.fa --range 1:100
seqx extract -i input.fa --bed regions.bed -F 20

search

seqx search -i input.fa "ATG"
seqx search -i input.fa "ATG.*TAA" --regex
seqx search -i input.fa "ATG" --mismatches 1 --threads 8
seqx search -i input.fa "ATG" --bed --strand

modify

seqx modify -i input.fa --upper
seqx modify -i input.fa --lower
seqx modify -i input.fa --slice 10:200
seqx modify -i input.fa --remove-gaps
seqx modify -i input.fa --reverse-complement

sample

seqx sample -i input.fa --count 1000 --seed 42
seqx sample -i input.fa --fraction 0.1

sort

seqx sort -i input.fa --by-name
seqx sort -i input.fa --by-len --desc
seqx sort -i input.fa --by-gc --max-memory 256 --threads 8

dedup

seqx dedup -i input.fa
seqx dedup -i input.fa --by-id
seqx dedup -i input.fa --prefix 12 --ignore-case
seqx dedup -i input.fa --buckets 256 --threads 8

merge

seqx merge a.fa b.fa c.fa -o merged.fa
seqx merge a.fa b.fa c.fa --add-prefix --sep ":" -o merged_with_source.fa

split

seqx split -i input.fa --parts 10 -o out_dir
seqx split -i input.fa --chunk-size 1000 -o out_dir
seqx split -i input.fa --by-id -o out_dir --prefix seq

compress

# Compress using pigz if available, otherwise built-in
seqx compress -i input.fa
seqx compress -i input.fa -o output.fa.gz -l 9

# Decompress
seqx compress -d -i input.fa.gz
seqx compress -d -i input.fa.gz -o output.fa

# Use stdin/stdout
cat input.fa | seqx compress > output.fa.gz
cat input.fa.gz | seqx compress -d > output.fa

# Force built-in implementation
seqx compress -i input.fa --no-pigz

guide

# List all commands
seqx guide

# Show detailed help for a specific command
seqx guide filter
seqx guide compress

# Output in JSON format (for programmatic use)
seqx guide --format json
seqx guide filter --format json

# Output in Markdown format
seqx guide --format markdown

Behavior Notes

  • Input defaults to stdin where supported.
  • Output defaults to stdout where supported.
  • Format detection is extension-based (.fa/.fasta/.fq/.fastq, optional .gz).
  • FASTA/FASTQ parsing uses noodles.
  • extract currently supports FASTA extraction only.

Nucleotide vs Protein Behavior

  • Protein FASTA records are supported by all commands.
  • Nucleotide-only operations are explicitly guarded:
    • filter --gc-min/--gc-max
    • modify --reverse-complement
    • reverse-complement matching in search (enabled only when both record and pattern are nucleotide)

Performance Model

  • sort: external chunk sort + mmap merge, configurable with --max-memory and --threads.
  • dedup: disk bucket partitioning + per-bucket dedup + stable merge, configurable with --buckets and --threads.
  • split --parts: two-pass streaming split (stdin may be materialized to a temp file).
  • compress: uses pigz if available, otherwise uses gzp (parallel gzip in Rust) with automatic thread detection.
  • Temp binary record paths use packed_seq_io (2-bit packing for A/C/G/T when applicable).

Bench Script

./scripts/bench_packed_io.sh

# Custom workload
N_RECORDS=1000000 SEQ_LEN=200 DUP_RATE=40 ./scripts/bench_packed_io.sh

Developer Docs

License

MIT

About

Agent-friendly sequence processing tool.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors