This is the official code repository for the paper:
MADE: Benchmark Environments for Closed-Loop Materials Discovery Arxiv Preprint NeurIPS AI4Mat 2025 Workshop Paper
MADE (MAterials Discovery Environments) provides dynamic benchmark environments for evaluating end-to-end autonomous materials discovery pipelines. MADE simulates closed-loop discovery campaigns where agents propose, evaluate, and refine candidate materials under constrained oracle budgets.
- Closed-loop evaluation: Agents iteratively propose structures and receive feedback
- Modular agents: Compose generators, planners, scorers, and filters
- Flexible environments: Define your own convex hull discovery tasks with any oracle.
- Discovery metrics: AF, EF, AUDC, mSUN for comparing strategies
Install dependencies using uv:
uv syncSet up API keys or add to .env file:
export MATERIALS_PROJECT_API_KEY=your_key
export WANDB_API_KEY=your_key # optional
export ANTHROPIC_API_KEY=your_key # for Anthropic agents
export OPENAI_API_KEY=your_key # for OpenAI agentsAdd these API keys to Modal secrets too if running on Modal.
Update the wandb entity in configs/logger/wandb.yaml to your wandb username/team if you want to save progress to wandb. Otherwise, set use_wandb: false.
# Run a single benchmark locally (Li-O system, random agent, ORB oracle, 3 episodes, 5 queries)
uv run scripts/run_benchmark.py
# Run on Modal (parallel episodes)
uv run scripts/run_benchmark.py experiment.infra=modal
# Run with custom config
uv run scripts/run_benchmark.py dataset.elements='[Fe,O]' experiment.num_episodes=5- Environment (
ConvexHullEnvironment): Defines discovery task on a phase diagram - Oracle (ORB, MACE, Analytic): Evaluates formation energy of proposed structures
- Agent: Pipeline or Orchestrator for that proposes structures for evaluation. See
src/made/agents/README.mdfor more details on available agents and extending.
All environment, oracle, and agent components are defined via Hydra config files in configs/. These can be combined in a variety of ways to create different agents and environments.
We provide the configs used to run the baseline experiments in the paper in the agents config folder. These can be run using the scripts in scripts/. For example:
# Local (sequential)
uv run scripts/run_baseline_experiments.py \
--agent-configs "random_generator_baseline chemeleon_generative_baseline" \
--systems-file ./data/systems_10_mp_20/systems_ternary_n10_maxatoms20_intermetallic_smact.json
# Modal (parallel)
uv run modal run --detach scripts/run_baseline_experiments_modal.py \
--agent-configs "random_generator_baseline chemeleon_generative_baseline" \
--systems-file ./data/systems_10_mp_20/systems_ternary_n10_maxatoms20_intermetallic_smact.jsonto run the random generator and chemeleon generative baseline on ternary intermetallic systems.
This will save results to ./results/baselines/, or on a Modal volume if running on Modal.
results/<timestamp>-<oracle>-<agent>/
├── .hydra/ # Hydra config files
│ ├── config.yaml # Full resolved config
│ └── overrides.yaml # CLI overrides used
├── trajectories/
│ ├── episode_000.json # Full trajectory for episode 0
│ ├── episode_001.json # Full trajectory for episode 1
│ └── phase_diagram_*.png # Phase diagram visualizations
├── summary/
│ ├── summary.json # Aggregated metrics (mean/std across episodes)
│ ├── episodes.json # Per-episode metrics
│ ├── episodes.csv # Per-episode metrics (CSV format)
│ └── phase_diagram_gt.png # Ground truth phase diagram
└── run_benchmark.log # Execution log
results/baselines_<date>/
└── <agent_config>_<systems_file>_<N>systems_<B>queries_<T>stabilitymeV/
├── experiment_metadata.json # Experiment configuration
├── progress.json # Progress tracking (status, completed systems)
├── overall_summary/ # Aggregated metrics across all systems
│ ├── summary.json # Summary statistics (mean/std/sem)
│ └── per_system_summary.csv # Per-system breakdown
└── systems/
└── <system_id>/ # e.g., Co-Mg-Na
├── trajectories/
│ ├── episode_000.json
│ ├── episode_001.json
│ └── phase_diagram_episode_*.png
└── summary/
├── summary.json
├── episodes.json
├── episodes.csv
└── phase_diagram_gt.png
See notebooks/basic_analysis.ipynb for a basic example of loading and analyzing results from a single benchmark run. notebooks/results_analysis_utils.py contains utility functions for loading and analyzing results from a single benchmark run and comparing baseline experiments.
We provide the scripts to generate the systems used in the baseline experiments in scripts/generate_systems.py. For example:
uv run scripts/generate_systems.py --output-dir ./data/systems_10_mp_20 --filter-by-smact --system-sizes [3,4,5] --only-intermetallicsto generate the systems used in the baseline experiments.
MADE is designed to be extensible. You can create custom components by subclassing the base classes and adding Hydra configs:
- Oracles: Subclass
Oraclefrommade.oracles.base, implementevaluate(structure) -> dict - Environments: Subclass
Environmentfrommade.envs.base, implementreset(),step(),get_state() - Agents: See
src/made/agents/README.mdfor detailed documentation on creating new agents and components (planners, generators, filters, scorers)
MIT License - see LICENSE
If you use MADE in your research, please cite our paper:
@misc{malik2026made,
title={MADE: Benchmark Environments for Closed-Loop Materials Discovery},
author={Shreshth A Malik and Tiarnan Doherty and Panagiotis Tigas and Muhammed Razzak and Stephen J. Roberts and Aron Walsh and Yarin Gal},
year={2026},
eprint={2601.20996},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2601.20996},
}