LLMc is a language-model–powered compressor for natural language text. It encodes token ranks instead of raw token IDs and stores them with a compact format.
The core idea of LLMc is rank-based encoding. During inference, the LLM provides a probability distribution over possible next tokens. In most cases, the true next token ranks among the top few candidates. Instead of storing the token identity, LLMc stores its rank within the distribution. These ranks are small integers and therefore are compact to encode.
llm-compression.1.mp4
We recommend using uv to manage the virtual environment.
uv venv -p 3.11
# We use a modified version of vLLM and batch_invariant_ops as backend.
git submodule update --init --recursive --depth 1
export VLLM_USE_PRECOMPILED=1
uv syncThe CLI exposes three subcommands. The examples below mirror the arguments defined in llmc/entrypoint/cli.py.
Write a .llmc binary (varint+brotli):
llmc compress input.txt output.llmc \
--model Qwen/Qwen3-8B \
--threshold 256 \
--chunk-size 4096 \
--gpu-mem 0.5Notes:
--thresholdand--chunk-sizeare required.output.llmcis a raw byte stream.
Turn a .llmc file back into text:
llmc decompress output.llmc restored.txt \
--model Qwen/Qwen3-8B \
--threshold 256 \
--chunk-size 4096 \
--gpu-mem 0.5Run the FastAPI server and web frontend:
llmc serve \
--model Qwen/Qwen3-8B \
--max-threshold 256 \
--max-chunk-size 4096 \
--gpu-mem 0.5 \
--host 0.0.0.0 --port 8000The server enforces upper bounds via --max-threshold and --max-chunk-size. The web UI sends per-request threshold and chunk_size that must be ≤ these limits.
Open the browser at http://<host>:8000 after starting the server. Enter Threshold and Chunk size for each operation.
Compression ratio with comparison to various traditional algorithms.
- Built on a custom vLLM fork and
batch_invariant_ops. - Thanks to the open-source ecosystem for models and tooling.


