MADE: MAterials Discovery Environments

This is the official code repository for the paper:

MADE: Benchmark Environments for Closed-Loop Materials Discovery Arxiv Preprint NeurIPS AI4Mat 2025 Workshop Paper

Overview

MADE (MAterials Discovery Environments) provides dynamic benchmark environments for evaluating end-to-end autonomous materials discovery pipelines. MADE simulates closed-loop discovery campaigns where agents propose, evaluate, and refine candidate materials under constrained oracle budgets.

Key Features

Closed-loop evaluation: Agents iteratively propose structures and receive feedback
Modular agents: Compose generators, planners, scorers, and filters
Flexible environments: Define your own convex hull discovery tasks with any oracle.
Discovery metrics: AF, EF, AUDC, mSUN for comparing strategies

Installation

Install dependencies using uv:

uv sync

Set up API keys or add to .env file:

export MATERIALS_PROJECT_API_KEY=your_key
export WANDB_API_KEY=your_key  # optional
export ANTHROPIC_API_KEY=your_key  # for Anthropic agents
export OPENAI_API_KEY=your_key  # for OpenAI agents

Add these API keys to Modal secrets too if running on Modal.

Update the wandb entity in configs/logger/wandb.yaml to your wandb username/team if you want to save progress to wandb. Otherwise, set use_wandb: false.

Quick Start

# Run a single benchmark locally (Li-O system, random agent, ORB oracle, 3 episodes, 5 queries)
uv run scripts/run_benchmark.py

# Run on Modal (parallel episodes)
uv run scripts/run_benchmark.py experiment.infra=modal

# Run with custom config
uv run scripts/run_benchmark.py dataset.elements='[Fe,O]' experiment.num_episodes=5

Architecture

Environment (ConvexHullEnvironment): Defines discovery task on a phase diagram
Oracle (ORB, MACE, Analytic): Evaluates formation energy of proposed structures
Agent: Pipeline or Orchestrator for that proposes structures for evaluation. See src/made/agents/README.md for more details on available agents and extending.

All environment, oracle, and agent components are defined via Hydra config files in configs/. These can be combined in a variety of ways to create different agents and environments.

Running Baseline Experiments

We provide the configs used to run the baseline experiments in the paper in the agents config folder. These can be run using the scripts in scripts/. For example:

# Local (sequential)
uv run scripts/run_baseline_experiments.py \
    --agent-configs "random_generator_baseline chemeleon_generative_baseline" \
    --systems-file ./data/systems_10_mp_20/systems_ternary_n10_maxatoms20_intermetallic_smact.json

# Modal (parallel)
uv run modal run --detach scripts/run_baseline_experiments_modal.py \
    --agent-configs "random_generator_baseline chemeleon_generative_baseline" \
    --systems-file ./data/systems_10_mp_20/systems_ternary_n10_maxatoms20_intermetallic_smact.json

to run the random generator and chemeleon generative baseline on ternary intermetallic systems.

This will save results to ./results/baselines/, or on a Modal volume if running on Modal.

Results Format

Single Run (`run_benchmark.py`)

results/<timestamp>-<oracle>-<agent>/
├── .hydra/                      # Hydra config files
│   ├── config.yaml              # Full resolved config
│   └── overrides.yaml           # CLI overrides used
├── trajectories/
│   ├── episode_000.json         # Full trajectory for episode 0
│   ├── episode_001.json         # Full trajectory for episode 1
│   └── phase_diagram_*.png      # Phase diagram visualizations
├── summary/
│   ├── summary.json             # Aggregated metrics (mean/std across episodes)
│   ├── episodes.json            # Per-episode metrics
│   ├── episodes.csv             # Per-episode metrics (CSV format)
│   └── phase_diagram_gt.png     # Ground truth phase diagram
└── run_benchmark.log            # Execution log

Baseline Experiments (`run_baseline_experiments.py`)

results/baselines_<date>/
└── <agent_config>_<systems_file>_<N>systems_<B>queries_<T>stabilitymeV/
    ├── experiment_metadata.json   # Experiment configuration
    ├── progress.json              # Progress tracking (status, completed systems)
    ├── overall_summary/           # Aggregated metrics across all systems
    │   ├── summary.json           # Summary statistics (mean/std/sem)
    │   └── per_system_summary.csv # Per-system breakdown
    └── systems/
        └── <system_id>/           # e.g., Co-Mg-Na
            ├── trajectories/
            │   ├── episode_000.json
            │   ├── episode_001.json
            │   └── phase_diagram_episode_*.png
            └── summary/
                ├── summary.json
                ├── episodes.json
                ├── episodes.csv
                └── phase_diagram_gt.png

Analyzing Results

See notebooks/basic_analysis.ipynb for a basic example of loading and analyzing results from a single benchmark run. notebooks/results_analysis_utils.py contains utility functions for loading and analyzing results from a single benchmark run and comparing baseline experiments.

Generating Systems

We provide the scripts to generate the systems used in the baseline experiments in scripts/generate_systems.py. For example:

uv run scripts/generate_systems.py --output-dir ./data/systems_10_mp_20 --filter-by-smact --system-sizes [3,4,5] --only-intermetallics

to generate the systems used in the baseline experiments.

Extending MADE

MADE is designed to be extensible. You can create custom components by subclassing the base classes and adding Hydra configs:

Oracles: Subclass Oracle from made.oracles.base, implement evaluate(structure) -> dict
Environments: Subclass Environment from made.envs.base, implement reset(), step(), get_state()
Agents: See src/made/agents/README.md for detailed documentation on creating new agents and components (planners, generators, filters, scorers)

License

MIT License - see LICENSE

Citation

If you use MADE in your research, please cite our paper:

@misc{malik2026made,
      title={MADE: Benchmark Environments for Closed-Loop Materials Discovery}, 
      author={Shreshth A Malik and Tiarnan Doherty and Panagiotis Tigas and Muhammed Razzak and Stephen J. Roberts and Aron Walsh and Yarin Gal},
      year={2026},
      eprint={2601.20996},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2601.20996}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
configs		configs
data		data
notebooks		notebooks
scripts		scripts
src/made		src/made
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MADE: MAterials Discovery Environments

Overview

Key Features

Installation

Quick Start

Architecture

Running Baseline Experiments

Results Format

Single Run (`run_benchmark.py`)

Baseline Experiments (`run_baseline_experiments.py`)

Analyzing Results

Generating Systems

Extending MADE

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MADE: MAterials Discovery Environments

Overview

Key Features

Installation

Quick Start

Architecture

Running Baseline Experiments

Results Format

Single Run (run_benchmark.py)

Baseline Experiments (run_baseline_experiments.py)

Analyzing Results

Generating Systems

Extending MADE

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Single Run (`run_benchmark.py`)

Baseline Experiments (`run_baseline_experiments.py`)

Packages