GitHub - microsoft/llm-42: Fast Deterministic LLM Inference

LLM-42: Enabling Determinism in LLM Inference with Verified Speculation

LLM-42 enables deterministic LLM inference via a decode–verify–rollback protocol, without rewriting GPU kernels. Built on SGLang v0.5.3.

Raja Gond‡, Aditya K Kamath†, Ramachandran Ramjee‡, Ashish Panwar‡
‡Microsoft Research †University of Washington

How it works

Standard LLM serving is non-deterministic: dynamic batching changes GPU reduction orders, producing different outputs across runs. LLM-42 fixes this with a lightweight verify-rollback loop:

Decode — generate tokens using fast, unmodified kernels with dynamic batching.
Verify — replay a window of tokens under a fixed-shape schedule to check consistency.
Rollback — on mismatch, discard inconsistent tokens and resume from the last verified position.

Only requests marked is_deterministic=True incur verification; the rest run at full speed.

Quick start

# Create and attach to a GPU-enabled Docker container (uses lmsysorg/sglang:v0.5.4)
./run_container.sh create && ./run_container.sh attach

# Inside the container: workspace is mounted at /workspace
cd /workspace
apt update; apt upgrade -y
git config --global --add safe.directory /workspace

# Build sgl-kernel and install sglang in editable mode
./build_all.sh

# Authenticate with Hugging Face to download gated models (e.g., Llama)
huggingface-cli login --token <HF_TOKEN>

# Terminal 1: Start the LLM-42 server (waits for model to load)
bash llm42_benchmarks/basic/launch_server.sh

# Terminal 2: Once the server is ready, run the determinism-check client
python3 llm42_benchmarks/basic/client.py

Configuration

Flag	Default	Description
`--enable-llm42`	`0`	Set to `3` to enable LLM-42 DVR
`--llm42-window-size`	`64`	Tokens decoded before verification
`--llm42-verify-batch-size`	`8`	Requests per verification batch (grouped verification)

Additional flags for benchmarking: --enable-deterministic-inference 2 (global batch-invariant baseline), --llm42-skip-mismatch (mismatch rate control / synthetic mismatch injection).

Hardware

4× NVIDIA H100 PCIe (80 GB HBM3), 64-core CPU, ~1.65 TB DRAM.

Project Structure

├── python/sglang/
│   ├── srt/
│   │   ├── llm42/              # Core LLM-42 decode–verify–rollback logic
│   │   ├── batch_invariant_ops/ # Batch-invariant kernel wrappers
│   │   ├── layers/             # Model layers (attention, MoE, etc.)
│   │   ├── models/             # Supported model architectures
│   │   ├── managers/           # Request scheduling & memory management
│   │   └── sampling/           # Sampling strategies
│   ├── launch_server.py        # Server entry point
│   └── bench_serving.py        # Serving benchmark client
├── sgl-kernel/                 # Custom CUDA/Triton kernels
│   ├── csrc/                   # C++/CUDA sources
│   └── python/                 # Python bindings
├── llm42_benchmarks/           # LLM-42 benchmark scripts
├── llm42-plots/                # Plotting scripts for paper figures
├── benchmark/                  # Upstream SGLang benchmarks
├── docker/                     # Dockerfiles & Kubernetes manifests
├── scripts/                    # CI, utility, and helper scripts
├── build_all.sh                # Build sgl-kernel + install sglang
└── run_container.sh            # Create/attach to dev container

Citation

@article{gond2025llm42,
  title   = {{LLM-42}: Enabling Determinism in {LLM} Inference with Verified Speculation},
  author  = {Gond, Raja and Kamath, Aditya K and Ramjee, Ramachandran and Panwar, Ashish},
  journal = {arXiv preprint arXiv:2601.17768},
  year    = {2026},
  url     = {https://arxiv.org/abs/2601.17768}
}

License

This project is licensed under the terms in the LICENSE file. It is built on SGLang, which is licensed under the Apache License 2.0.

Trademark Notice

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Name		Name	Last commit message	Last commit date
Latest commit History 5,850 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
3rdparty/amd		3rdparty/amd
assets		assets
benchmark		benchmark
docker		docker
docs		docs
examples		examples
llm42-plots		llm42-plots
llm42_benchmarks		llm42_benchmarks
output		output
python		python
scripts		scripts
sgl-kernel		sgl-kernel
sgl-router		sgl-router
test		test
test_batch_invariance		test_batch_invariance
.clang-format-ignore		.clang-format-ignore
.editorconfig		.editorconfig
.gitignore		.gitignore
.isort.cfg		.isort.cfg
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
build_all.sh		build_all.sh
package-lock.json		package-lock.json
run_container.sh		run_container.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM-42: Enabling Determinism in LLM Inference with Verified Speculation

How it works

Quick start

Configuration

Hardware

Project Structure

Citation

License

Trademark Notice

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM-42: Enabling Determinism in LLM Inference with Verified Speculation

How it works

Quick start

Configuration

Hardware

Project Structure

Citation

License

Trademark Notice

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages