GitHub - mozilla-ai/encoderfile: Distribute and run transformer encoders with a single file.

Project logo

🚀 Overview

Encoderfile packages transformer encoders—optionally with classification heads—into a single, self-contained executable. No Python runtime, no dependencies, no network calls. Just a fast, portable binary that runs anywhere.

While Llamafile focuses on generative models, Encoderfile is purpose-built for encoder architectures with optional classification heads. It supports embedding, sequence classification, and token classification models—covering most encoder-based NLP tasks, from text similarity to classification and tagging—all within one compact binary.

Under the hood, Encoderfile uses ONNX Runtime for inference, ensuring compatibility with a wide range of transformer architectures.

Why?

Smaller footprint: a single binary measured in tens-to-hundreds of megabytes, not gigabytes of runtime and packages
Compliance-friendly: deterministic, offline, security-boundary-safe
Integration-ready: drop into existing systems as a CLI, microservice, or API without refactoring your stack

Encoderfiles can run as:

REST API
gRPC microservice
CLI for batch processing
MCP server (Model Context Protocol)

flowchart LR
    %% Styling
    classDef asset fill:#e1f5fe,stroke:#01579b,stroke-width:2px,color:#000;
    classDef tool fill:#fff8e1,stroke:#ff6f00,stroke-width:2px,stroke-dasharray: 5 5,color:#000;
    classDef process fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#000;
    classDef artifact fill:#f5f5f5,stroke:#616161,stroke-width:2px,color:#000;
    classDef service fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px,color:#000;
    classDef client fill:#e3f2fd,stroke:#0277bd,stroke-width:2px,stroke-dasharray: 5 5,color:#000;

    subgraph Inputs ["1. Input Assets"]
        direction TB
        Onnx["ONNX Model<br/>(.onnx)"]:::asset
        Tok["Tokenizer Data<br/>(tokenizer.json)"]:::asset
        Config["Runtime Config<br/>(config.yml)"]:::asset
    end

    style Inputs fill:#e3f2fd,stroke:#0277bd,stroke-width:2px,stroke-dasharray: 5 5,color:#01579b

    subgraph Compile ["2. Compile Phase"]
        Compiler["Encoderfile Compiler<br/>(CLI Tool)"]:::asset
    end

    style Compile fill:#e3f2fd,stroke:#0277bd,stroke-width:2px,stroke-dasharray: 5 5,color:#01579b

    subgraph Build ["3. Build Phase"]
        direction TB
        Builder["Wrapper Process<br/>(Embeds Assets + Runtime)"]:::process
    end

    style Build fill:#fff8e1,stroke:#ff8f00,stroke-width:2px,color:#e65100

    subgraph Output ["4. Artifact"]
        Binary["Single Binary Executable<br/>(Static File)"]:::artifact
    end
    style Output fill:#fafafa,stroke:#546e7a,stroke-width:2px,stroke-dasharray: 5 5,color:#546e7a

    subgraph Runtime ["5. Runtime Phase"]
        direction TB
        %% Added fa:fa-server icons
        Grpc["fa:fa-server gRPC Server<br/>(Protobuf)"]:::service
        Http["fa:fa-server HTTP Server<br/>(JSON)"]:::service
        MCP["fa:fa-server MCP Server<br/>(MCP)"]:::service
        %% Added fa:fa-cloud icon
        Client["fa:fa-cloud Client Apps /<br/>MCP Agent"]:::client
    end
    style Runtime fill:#f1f8e9,stroke:#2e7d32,stroke-width:2px,color:#1b5e20


    %% Connections
    Onnx & Tok & Config --> Builder
    Compiler -.->|"Orchestrates"| Builder
    Builder -->|"Outputs"| Binary
    
    %% Runtime Connections
    Binary -.->|"Executes"| Grpc
    Binary -.->|"Executes"| Http
    Grpc & Http & MCP-->|"Responds to"| Client

Supported Architectures

Encoderfile supports the following Hugging Face model classes (and their ONNX-exported equivalents):

Task	Supported classes	Examples models
Embeddings / Feature Extraction	`AutoModel`, `AutoModelForMaskedLM`	`bert-base-uncased`, `distilbert-base-uncased`
Sequence Classification	`AutoModelForSequenceClassification`	`distilbert-base-uncased-finetuned-sst-2-english`, `roberta-large-mnli`
Token Classification	`AutoModelForTokenClassification`	`dslim/bert-base-NER`, `bert-base-cased-finetuned-conll03-english`

✅ All architectures must be encoder-only transformers — no decoders, no encoder–decoder hybrids (so no T5, no BART).
⚙️ Models must have ONNX-exported weights (path/to/your/model/model.onnx).
🧠 The ONNX graph input must include input_ids and optionally attention_mask.
🚫 Models relying on generation heads (AutoModelForSeq2SeqLM, AutoModelForCausalLM, etc.) are not supported.
XLNet, Transfomer XL, and derivative architectures are not yet supported.

📦 Installation

Option 1: Download Pre-built CLI Tool (Recommended)

Download the encoderfile CLI tool to build your own model binaries:

curl -fsSL https://raw.githubusercontent.com/mozilla-ai/encoderfile/main/install.sh | sh
chmod +x encoderfile

Note for Windows users: Pre-built binaries are not available for Windows. Please see BUILDING.md for instructions on building from source.

Move the binary to a location in your PATH:

# Linux/macOS
sudo mv encoderfile /usr/local/bin/

# Or add to your user bin
mkdir -p ~/.local/bin
mv encoderfile ~/.local/bin/

Option 2: Build CLI Tool from Source

See BUILDING.md for detailed instructions on building the CLI tool from source.

Quick build:

cargo build --bin encoderfile --release
./target/release/encoderfile --help

🚀 Quick Start

Step 1: Prepare Your Model

First, you need an ONNX-exported model. Export any HuggingFace model:

# Install optimum for ONNX export
pip install optimum[exporters]

# Export a sentiment analysis model
optimum-cli export onnx \
  --model distilbert-base-uncased-finetuned-sst-2-english \
  --task text-classification \
  ./sentiment-model

Step 2: Create Configuration File

Create sentiment-config.yml:

encoderfile:
  name: sentiment-analyzer
  path: ./sentiment-model
  model_type: sequence_classification
  output_path: ./build/sentiment-analyzer.encoderfile

Step 3: Build Your Encoderfile

Use the downloaded encoderfile CLI tool:

encoderfile build -f sentiment-config.yml

This creates a self-contained binary at ./build/sentiment-analyzer.encoderfile.

Step 4: Run Your Model

Start the server:

./build/sentiment-analyzer.encoderfile serve

The server will start on http://localhost:8080 by default.

Making Predictions

Sentiment Analysis:

curl -X POST http://localhost:8080/predict \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": [
      "This is the cutest cat ever!",
      "Boring video, waste of time",
      "These cats are so funny!"
    ]
  }'

Response:

{
  "results": [
    {
      "logits": [0.00021549065, 0.9997845],
      "scores": [0.00021549074, 0.9997845],
      "predicted_index": 1,
      "predicted_label": "POSITIVE"
    },
    {
      "logits": [0.9998148, 0.00018516644],
      "scores": [0.9998148, 0.0001851664],
      "predicted_index": 0,
      "predicted_label": "NEGATIVE"
    },
    {
      "logits": [0.00014975034, 0.9998503],
      "scores": [0.00014975043, 0.9998503],
      "predicted_index": 1,
      "predicted_label": "POSITIVE"
    }
  ],
  "model_id": "sentiment-analyzer"
}

Embeddings:

curl -X POST http://localhost:8080/predict \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": ["Hello world"],
    "normalize": true
  }'

Token Classification (NER):

curl -X POST http://localhost:8080/predict \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": ["Apple Inc. is located in Cupertino, California"]
  }'

🎯 Usage Modes

1. REST API Server

Start an HTTP server (default port 8080):

./my-model.encoderfile serve

Custom configuration:

./my-model.encoderfile serve \
  --http-port 3000 \
  --http-hostname 0.0.0.0

Disable gRPC (HTTP only):

./my-model.encoderfile serve --disable-grpc

2. gRPC Server

Start with default gRPC server (port 50051):

./my-model.encoderfile serve

gRPC only (no HTTP):

./my-model.encoderfile serve --disable-http

Custom gRPC configuration:

./my-model.encoderfile serve \
  --grpc-port 50052 \
  --grpc-hostname localhost

3. CLI Inference

Run one-off inference without starting a server:

# Single input
./my-model.encoderfile infer "This is a test sentence"

# Multiple inputs
./my-model.encoderfile infer "First text" "Second text" "Third text"

# Save output to file
./my-model.encoderfile infer "Test input" -o results.json

4. MCP Server

Run as a Model Context Protocol server:

./my-model.encoderfile mcp --hostname 0.0.0.0 --port 9100

🔧 Server Configuration

Port Configuration

# Custom HTTP port
./my-model.encoderfile serve --http-port 3000

# Custom gRPC port
./my-model.encoderfile serve --grpc-port 50052

# Both
./my-model.encoderfile serve --http-port 3000 --grpc-port 50052

Hostname Configuration

./my-model.encoderfile serve \
  --http-hostname 127.0.0.1 \
  --grpc-hostname localhost

Service Selection

# HTTP only
./my-model.encoderfile serve --disable-grpc

# gRPC only
./my-model.encoderfile serve --disable-http

📚 Documentation

Getting Started Guide - Step-by-step tutorial
Building Guide - Build encoderfiles from ONNX models
CLI Reference - Complete command-line documentation
API Reference - REST, gRPC, and MCP API docs

🛠️ Building Custom Encoderfiles

Once you have the encoderfile CLI tool installed, you can build binaries from any compatible HuggingFace model.

See BUILDING.md for detailed instructions including:

How to export models to ONNX format
Configuration file options
Advanced features (Lua transforms, custom paths, etc.)
Troubleshooting tips

Quick workflow:

Export your model to ONNX: optimum-cli export onnx ...
Create a config file: config.yml
Build the binary: encoderfile build -f config.yml
Deploy anywhere: ./build/my-model.encoderfile serve

🤝 Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

Development Setup

# Clone the repository
git clone https://github.com/mozilla-ai/encoderfile.git
cd encoderfile

# Set up development environment
make setup

# Run tests
make test

# Build documentation - Check command with Raz
make docs-serve

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

🙏 Acknowledgments

Built with ONNX Runtime
Inspired by Llamafile
Powered by the Hugging Face model ecosystem

💬 Community

Discord - Join our community
GitHub Issues - Report bugs or request features
GitHub Discussions - Ask questions and share ideas

Name		Name	Last commit message	Last commit date
Latest commit History 350 Commits
.cargo		.cargo
.github		.github
.vscode		.vscode
docs		docs
encoderfile-core		encoderfile-core
encoderfile-utils		encoderfile-utils
encoderfile		encoderfile
examples		examples
schemas		schemas
scripts		scripts
transforms		transforms
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
THIRDPARTY.md		THIRDPARTY.md
about.hbs		about.hbs
about.toml		about.toml
codecov.yml		codecov.yml
docker-compose.yml		docker-compose.yml
install.sh		install.sh
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
test_config.yml		test_config.yml
uv.lock		uv.lock

License

mozilla-ai/encoderfile

Folders and files

Latest commit

History

Repository files navigation

🚀 Overview

Supported Architectures

📦 Installation

Option 1: Download Pre-built CLI Tool (Recommended)

Option 2: Build CLI Tool from Source

🚀 Quick Start

Step 1: Prepare Your Model

Step 2: Create Configuration File

Step 3: Build Your Encoderfile

Step 4: Run Your Model

Making Predictions

🎯 Usage Modes

1. REST API Server

2. gRPC Server

3. CLI Inference

4. MCP Server

🔧 Server Configuration

Port Configuration

Hostname Configuration

Service Selection

📚 Documentation

🛠️ Building Custom Encoderfiles

🤝 Contributing

Development Setup

📄 License

🙏 Acknowledgments

💬 Community

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors 5

Uh oh!

Languages

Packages