MultiNet is a collaborative initiative with contributions from leading research teams at institutions like:
Need to Run Evaluations on Production Multimodal, Computer Use, or Robotics AI System? We can help!
- π 2025-13-10: Multinet v1.0 - We release our most comprehensive benchmark yet - evaluating a SoTA VLM, VLA, and generalist model on a wide variety of multimodal understanding and action datasets. Read more here
- π 2025-06-10: Paper accepted at ICML 2025! Our paper detailing the Open-Source contributions of Multinet that benefit the AI community has been accepted at the CodeML Workshop at ICML 2025! Read our paper here.
- π 2025-05-22: Multinet v0.2 - We systematically profile state-of-the-art VLAs and VLMs to understand how they perform in procedurally generated OOD game environments! Read more about our release here
- π 2024-11-08: We release the first version of MultiNet where we profiled SoTA VLMs and VLAs on real-world robotics tasks - Multinet v0.1! Check our release page for more details.
- π 2024-03-22: Introducing Multinet! A new generalist benchmark to evaluate Vision-Language & Action models. Learn more here
This repo provides the following:
- Ability to profile VLMs, VLAs, and generalist models on our generalist evaluation framework with a comprehensive coverage of open-source physical commonsense reasoning, image classification, visual question answering, control/action (RL, Robotics), gameplay, and function calling tasks
- Ability to translate control data of various formats and from various sources to a unified Tensorflow Dataset format.
- Evaluate the performance of SoTA VLMs and VLAs such as GPT-5, Pi0, and Magma in a zero-shot setting on a wide-variety of tasks detaied here.
- A general framework for mapping VLMs to other modality classes, with particular emphasis on action spaces. This framework allows one to adapt a wide range of models to multiple types of tasks or datasets for scaling effectively while reducing the amount of engineering effort required. In MultiNet v1.0, GenESIS is used to evaluate GPT 5 on the OpenX, Overcooked, PIQA, ODINW, and SQA3D datasets.
- Sample datasets and clear guidelines to test your model locally and submit for official benchmark evaluation; leaderboard results are generated by the MultiNet team.
Also related to the MultiNet effort is - a small, simple, open-source implementation of what is described in DeepMind's GATO paper. This project marks our initial step towards building a multimodal generalist action model.
conda create -n multinet python=3.10
conda activate multinet
git clone https://github.com/ManifoldRG/MultiNet.git
cd MultiNet/src
pip install -r requirements.txtcd Multinet/src/v1
python centralized_downloader.py --download <name of dataset you would like to download>To translate one file/shard of your desired control dataset (downloaded using the downloader script in this repo) to the TFDS format
cd Multinet/src/v1
python centralized_processor.py --input_dir <path to the downloaded dataset> --output_dir <directory where you would like to store the translated file>To translate multiple files/shards of your desired control dataset (downloaded using the downloader script in this repo) to the TFDS format
Note: Make sure to modify the way the multiple files are being traversed for translation in translate_multiple.py in Multinet/src/control_translation according to your local file structure.
cd Multinet/src/v1
python wrapper_centralized_processor.py --input_dir <path to the downloaded dataset> --output_dir <directory where you would like to store the translated file>We provide comprehensive evaluation guides for different models:
Magma Model Evaluation: For detailed instructions on evaluating Magma on ODINW, PIQA, SQA3D, RoboVQA, Overcooked, BFCL, and OpenX datasets, see the Magma Evaluation Guide.
Pi0 Base Model Evaluation: For detailed instructions on evaluating Pi0 Base on ODINW, PIQA, SQA3D, RoboVQA, BFCL, Overcooked, and OpenX datasets, see the Pi0 Evaluation Guide.
GPT Model Evaluation (GenESIS Framework): For detailed instructions on evaluating GPT-5 using the GenESIS framework on ODINW, PIQA, SQA3D, RoboVQA, Overcooked, and OpenX datasets, see the GenESIS Evaluation Guide.
We provide a submission toolkit and comprehensive instructions to benchmark your model on MultiNet datasets:
- Standardized Interface: Create model adapters that inherit from the base
ModelAdapterclass - Dockerized Evaluation: Reproducible evaluations in isolated containers
- Various Task Types: Support for datasets that span VQA, action prediction, function calling, and more
Quick Start:
- Create your model adapter(s) by inheriting from
ModelAdapterinsrc/eval_harness/model_adapter.py - Test your model adapter using the
scripts/eval_harness/evaluate.pyentrypoint which loads sample data in a standard format - Configure
harness_dataset_config.txtandDockerfilewith your adapter settings - Run
./build_and_run_eval_container.sh DATASET_NAMEto test containerized evaluation - Open a PR with your code
Official benchmark runs are executed by the MultiNet team using your submitted Dockerfile and adapters. Local runs operate on the provided sample datasets to validate your setup.
For complete instructions, see the Model Submission Guide.
If you're experiencing any issues, open a GitHub issue or contact [email protected] directly.
If you use MultiNet in your research, please cite our work:
ICML CodeML Paper Submission - An Open-Source Software Toolkit & Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models
@misc{guruprasad2025opensourcesoftwaretoolkit,
title={An Open-Source Software Toolkit & Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models},
author={Pranav Guruprasad and Yangyue Wang and Sudipta Chowdhury and Jaewoo Song and Harshvardhan Sikka},
year={2025},
eprint={2506.09172},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2506.09172},
}
Multinet v0.2 - Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments
@misc{guruprasad2025benchmarkingvisionlanguage,
title={Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments},
author={Pranav Guruprasad and Yangyue Wang and Sudipta Chowdhury and Harshvardhan Sikka},
year={2025},
eprint={2505.05540},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.05540},
}
Multinet v0.1 - Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks
@misc{guruprasad2024benchmarkingvisionlanguage,
title={Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks},
author={Pranav Guruprasad and Harshvardhan Sikka and Jaewoo Song and Yangyue Wang and Paul Pu Liang},
year={2024},
eprint={2411.05821},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2411.05821},
}
Multinet Vision and Dataset specification
@misc{guruprasad2024benchmarking,
author={Guruprasad, Pranav and Sikka, Harshvardhan and Song, Jaewoo and Wang, Yangyue and Liang, Paul},
title={Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks},
DOI={10.20944/preprints202411.0494.v1},
year={2024},
}

