A tool for systematically benchmarking Lean 4 "hammer" tactics (grind, simp_all, aesop, omega, etc.) across Mathlib and downstream projects to measure their effectiveness at replacing existing proof tactics.
- Python 3.10+
- Git
- Elan/Lake (Lean toolchain)
cd ~/hammer-bench
# Set up Python virtual environment (required)
python3 -m venv .venv
.venv/bin/pip install -r requirements.txt
# Initialize mathlib4 worktree
./bin/hammer-bench initThe bin/hammer-bench script automatically uses .venv if it exists.
This clones mathlib4 to ~/hammer-bench/worktrees/mathlib4/.
Edit queue.yaml:
source: mathlib4@master
queue:
- grind@quick_testThen run:
./bin/hammer-bench run# List completed runs
./bin/hammer-bench list
# Show details for a run
./bin/hammer-bench show <run_id>
# Compare two runs
./bin/hammer-bench compare <run1> <run2>The queue is a YAML file (queue.yaml):
# Source repository and ref
source: mathlib4-nightly-testing@hammer_measurements
queue:
# String shorthand: preset@targets:provider/fraction
- grind@quick_test
- omega@logic/100
# Or explicit format
- preset: simp_all
targets: algebra_group
provider: sineQuaNon
fraction: 50
# Completed runs (managed automatically)
completed:
- preset: grind
targets: quick_test
completed_at: 2025-12-04T10:30:00
run_id: 2025-12-04T10-30-00_grind_abc1234Two formats are supported:
-
Short name (from
config/repos.yaml):source: mathlib4@master source: mathlib4-nightly-testing@hammer_measurements
-
Full GitHub path:
source: leanprover-community/[email protected] source: my-org/my-lean-project@main
Short names must be defined in config/repos.yaml.
Presets define which tactic to run and build parameters:
grind:
customTactic: grind
fraction: 1
timing_mode: true
build_timeout_hours: 6
description: Run grind at every proof step
omega:
customTactic: omega
fraction: 1
timing_mode: true
build_timeout_hours: 6
grind_suggestions:
customTactic: grind +suggestions
fraction: 1
timing_mode: true
build_timeout_hours: 8All presets use the generic TRY_AT_EACH_STEP_* mechanism - no Mathlib code changes needed per tactic.
Target collections for partial builds:
all:
description: Full Mathlib build (default)
targets:
- Mathlib
quick_test:
description: Minimal targets for quick testing
targets:
- Mathlib.Logic.Basic
logic:
description: Core logic modules
targets:
- Mathlib.Logic.Basic
- Mathlib.Logic.Function.Basic
- Mathlib.Logic.Relation
algebra_group:
description: Group theory basics
targets:
- Mathlib.Algebra.Group.Basic
- Mathlib.Algebra.Group.DefsRepository definitions for short names:
repos:
mathlib4:
url: https://github.com/leanprover-community/mathlib4.git
default_ref: master
patch_file: Mathlib/Init.lean
mathlib4-nightly-testing:
url: https://github.com/leanprover-community/mathlib4-nightly-testing.git
default_ref: nightly-testing
patch_file: Mathlib/Init.lean
# Add downstream projects:
# my-project:
# url: https://github.com/user/my-lean-project.git
# patch_file: MyProject/Init.leanThe patch_file is used when applying custom suggestion providers.
Suggestion providers for +suggestions variants:
providers:
default:
command: null
description: Use default suggestion provider
sineQuaNon:
command: Lean.LibrarySuggestions.sineQuaNonSelector
description: Use the Sine Qua Non suggestion selector
disabled:
command: "fun _ _ => pure #[]"
description: Disable suggestions entirely# Initialize (clone mathlib4)
hammer-bench init
# Queue management
hammer-bench queue # List queue
hammer-bench queue add grind@quick_test
hammer-bench queue clear
# Run benchmarks
hammer-bench run # Process entire queue
hammer-bench run --once # Process one entry
hammer-bench run --dry-run # Show what would run
# View results
hammer-bench list # List runs
hammer-bench show <run_id> # Show run details
hammer-bench compare <r1> <r2> # Compare runs
hammer-bench validate <r1> <r2> # Check consistency
# Repository management
hammer-bench check-base # Show current checkout
hammer-bench rebase <tag> # Checkout different ref
# Testing
hammer-bench selftest # Run self-tests
hammer-bench selftest --dry-runEach run creates a directory in ~/hammer-bench/runs/<run_id>/:
metadata.json- Configuration, timing, machine info, sourcemessages.jsonl- All "can be replaced with" messagesbuild.log.gz- Compressed build output
{
"run_id": "2025-12-04T10-30-00_grind_abc1234",
"machine": "hostname",
"base_commit": "abc1234...",
"base_ref": "master",
"lean_toolchain": "leanprover/lean4:v4.14.0",
"source": {"repo": "mathlib4", "ref": "master"},
"config": {...},
"status": "completed",
"duration_seconds": 3600,
"message_count": 1234
}{"file": "Mathlib/Data/List/Basic.lean", "row": 123, "col": 5, "original": "simp", "replacement": "grind", "time_ms": 45}~/hammer-bench/
├── bin/hammer-bench # Entry point script
├── config/
│ ├── presets.yaml # Tactic configurations
│ ├── providers.yaml # Suggestion providers
│ ├── repos.yaml # Repository definitions
│ └── targets.yaml # Target collections
├── queue.yaml # Run queue
├── runs/ # Benchmark results (gitignored)
├── worktrees/ # Repository checkouts (gitignored)
│ ├── mathlib4/
│ └── mathlib4-nightly-testing/
├── scripts/ # Python implementation
└── tests/
├── expected/ # Expected outputs for selftest
└── README.md
hammer-bench can target any Lean project downstream of Mathlib:
-
Add the repository to
config/repos.yaml:repos: my-project: url: https://github.com/user/my-lean-project.git default_ref: main patch_file: MyProject/Init.lean # or null
-
Use it in your queue:
source: my-project@main queue: - grind
The project must depend on Mathlib to have the tactic analysis linters available.
See MATHLIB.md for the changes required in Mathlib to support this tool. These add a generic tryAtEachStepFromEnv linter that reads the tactic to run from environment variables.
Run the test suite to verify the tool works:
hammer-bench selftestThis checks out a known commit and runs small benchmarks, comparing against expected outputs.
MIT License