The is the official codebase for Towards Robust Zero-shot Reinforcement Learning from Kexin Zheng*, Lauriane Teyssier*, Yinan Zheng, Yu Luo, Xianyuan Zhan
*Equal contribution
BREEZE is an FB-based framework that simultaneously enhances learning stability, policy extraction capability, and representation learning quality, through three key designs:
- Behavioral regularization in zero-shot RL policy learning, transforming policy optimization into a stable in-sample learning paradigm.
- Task-conditioned diffusion model policy extraction, enabling the generation of high-quality and multimodal action distributions in zero-shot RL settings.
- Attention-based architectures for representation modeling to capture the complex relationships between environmental dynamics.
BREEZE achieves the best or near-best returns with faster convergence and enhanced stability.
BREEZE within 400k steps can match or exceed baselines trained for 1M steps.- Python 3.9
- Mujoco - required by the DM Control suite. Note: While our experiments used separate MuJoCo binaries, the latest mujoco pip package now includes them.
- Wandb - for experiment tracking. Set
WANDB_API_KEYbefore launching experiments or pass--wandb_logging False.
Install dependencies
conda create -n breeze python=3.9
conda activate breeze
pip install -r requirements.txtAll experiments rely on offline datasets from ExORL. Our repository includes a script to automatically download and reformat the datasets for the tasks and algorithms below.
ExORL Download & Reformat
bash data_prepare.shDomains and Tasks
| Domain | Eval Tasks | Dimensionality | Type | Reward | Command Line Argument |
|---|---|---|---|---|---|
| Walker | stand walk run flip |
Low | Locomotion | Dense | walker |
| Quadruped | stand walk run jump |
High | Locomotion | Dense | quadruped |
| Jaco | reach_top_left reach_top_right reach_bottom_left reach_bottom_right |
High | Goal-reaching | Sparse | jaco |
Exploration Algorithms for Dataset Collection
| Exploration Algorithm | Command Line Argument |
|---|---|
| Random Network Distillation (RND) | rnd |
| Diversity is All You Need (DIAYN) | diayn |
| Active Pretraining with Successor Features (APS) | aps |
| Reinforcement Learning with Prototypical Representations (PROTO) | proto |
We provide the repository structure in repository_structure.md.
The main entry point is main_offline.py, which takes the algorithm name, domain, and exploration policy that generated the dataset. Key flags:
usage: main_offline.py <algorithm> <domain_name> <exploration_algorithm> \
--eval_tasks TASK [TASK ...] [--train_task TASK]
[--seed INT] [--learning_steps INT]
[--z_inference_steps INT]
[--wandb_logging {True,False}]
algorithm: one ofbreeze,fb,cfb,vcfb,mcfb,cql,sac,td3,sf-lap(see table below).domain_name: DMC domain (walker,quadruped,jaco,point_mass_maze, ...).exploration_algorithm: dataset source tag (proto,rnd,aps, etc.).--eval_tasks: list of downstream tasks for zero-shot evaluation.
Example
# BREEZE on Quadruped with RND exploration data
python main_offline.py breeze quadruped rnd \
--eval_tasks stand run walk jump \
--seed 42 --learning_steps 1000000Configuration defaults (network sizes, optimizers, diffusion settings, etc.) are stored in agents/<algo>/config.yaml. Override any value via CLI flags or edit the YAML.
| Algorithm | Authors | Type | Command Line Argument |
|---|---|---|---|
| Breeze | Zheng et al. (2025) | Zero-shot RL | breeze |
| FB Representations | Touati et al. (2023) | Zero-shot RL | fb |
| Conservative FB Representations (VCFB/MCFB) | Jeen et al. (2024) | Zero-shot RL | mcfb/vcfb |
| Conservative Q-learning | Kumar et al. (2020) | Single-task Offline RL | cql |
| Soft Actor-Critic (SAC) | Haarnoja et al. (2018) | Online RL | sac |
| Twin Delayed DDPG (TD3) | Fujimoto et al. (2018) | Online RL | td3 |
| Successor Features with Laplacian Eigenfunctions (SF-LAP) | Borsa et al. (2018) | Zero-shot RL | sf-lap |
We provide the domain-specific hyperparameters used in our experiments in domain_specific_hyp.md.
We thank all the contributions of prior studies:
-
This implementation is based on the Zero-Shot Reinforcement Learning from Low Quality Data codebase.
-
The implementation of Diffusion model is based on IDQL
If you find this repository helpful, please consider citing our paper:
@inproceedings{zheng2025towards,
title={Towards Robust Zero-Shot Reinforcement Learning},
author={Kexin Zheng and Lauriane Teyssier and Yinan Zheng and Yu Luo and Xianyuan Zhan},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025}
}This project is licensed under the MIT License. See LICENSE for the full text.

