MultiNet: A Generalist Benchmark for the Next Generation of Multimodal Models

MultiNet is a collaborative initiative with contributions from leading research teams at institutions like:

Need to Run Evaluations on Production Multimodal, Computer Use, or Robotics AI System? We can help!

📢 Updates

🌟 2025-13-10: Multinet v1.0 - We release our most comprehensive benchmark yet - evaluating a SoTA VLM, VLA, and generalist model on a wide variety of multimodal understanding and action datasets. Read more here
🏅 2025-06-10: Paper accepted at ICML 2025! Our paper detailing the Open-Source contributions of Multinet that benefit the AI community has been accepted at the CodeML Workshop at ICML 2025! Read our paper here.
🏆 2025-05-22: Multinet v0.2 - We systematically profile state-of-the-art VLAs and VLMs to understand how they perform in procedurally generated OOD game environments! Read more about our release here
🎉 2024-11-08: We release the first version of MultiNet where we profiled SoTA VLMs and VLAs on real-world robotics tasks - Multinet v0.1! Check our release page for more details.
🚀 2024-03-22: Introducing Multinet! A new generalist benchmark to evaluate Vision-Language & Action models. Learn more here

🔍 Overview

This repo provides the following:

Ability to profile VLMs, VLAs, and generalist models on our generalist evaluation framework with a comprehensive coverage of open-source physical commonsense reasoning, image classification, visual question answering, control/action (RL, Robotics), gameplay, and function calling tasks
Ability to translate control data of various formats and from various sources to a unified Tensorflow Dataset format.
Evaluate the performance of SoTA VLMs and VLAs such as GPT-5, Pi0, and Magma in a zero-shot setting on a wide-variety of tasks detaied here.
A general framework for mapping VLMs to other modality classes, with particular emphasis on action spaces. This framework allows one to adapt a wide range of models to multiple types of tasks or datasets for scaling effectively while reducing the amount of engineering effort required. In MultiNet v1.0, GenESIS is used to evaluate GPT 5 on the OpenX, Overcooked, PIQA, ODINW, and SQA3D datasets.
Sample datasets and clear guidelines to test your model locally and submit for official benchmark evaluation; leaderboard results are generated by the MultiNet team.

Also related to the MultiNet effort is - a small, simple, open-source implementation of what is described in DeepMind's GATO paper. This project marks our initial step towards building a multimodal generalist action model.

🚀 Getting Started

To set up the environment for Multinet:

conda create -n multinet python=3.10
conda activate multinet
git clone https://github.com/ManifoldRG/MultiNet.git
cd MultiNet/src
pip install -r requirements.txt

To download the datasets in v1:

cd Multinet/src/v1
python centralized_downloader.py --download <name of dataset you would like to download>

To translate one file/shard of your desired control dataset (downloaded using the downloader script in this repo) to the TFDS format

cd Multinet/src/v1
python centralized_processor.py --input_dir <path to the downloaded dataset> --output_dir <directory where you would like to store the translated file>

To translate multiple files/shards of your desired control dataset (downloaded using the downloader script in this repo) to the TFDS format

Note: Make sure to modify the way the multiple files are being traversed for translation in translate_multiple.py in Multinet/src/control_translation according to your local file structure.

cd Multinet/src/v1
python wrapper_centralized_processor.py --input_dir <path to the downloaded dataset> --output_dir <directory where you would like to store the translated file>

To evaluate models on MultiNet datasets

We provide comprehensive evaluation guides for different models:

Magma Model Evaluation: For detailed instructions on evaluating Magma on ODINW, PIQA, SQA3D, RoboVQA, Overcooked, BFCL, and OpenX datasets, see the Magma Evaluation Guide.

Pi0 Base Model Evaluation: For detailed instructions on evaluating Pi0 Base on ODINW, PIQA, SQA3D, RoboVQA, BFCL, Overcooked, and OpenX datasets, see the Pi0 Evaluation Guide.

GPT Model Evaluation (GenESIS Framework): For detailed instructions on evaluating GPT-5 using the GenESIS framework on ODINW, PIQA, SQA3D, RoboVQA, Overcooked, and OpenX datasets, see the GenESIS Evaluation Guide.

📊 Process for Submission to the MultiNet Benchmark

We provide a submission toolkit and comprehensive instructions to benchmark your model on MultiNet datasets:

Standardized Interface: Create model adapters that inherit from the base ModelAdapter class
Dockerized Evaluation: Reproducible evaluations in isolated containers
Various Task Types: Support for datasets that span VQA, action prediction, function calling, and more

Quick Start:

Create your model adapter(s) by inheriting from ModelAdapter in src/eval_harness/model_adapter.py
Test your model adapter using the scripts/eval_harness/evaluate.py entrypoint which loads sample data in a standard format
Configure harness_dataset_config.txt and Dockerfile with your adapter settings
Run ./build_and_run_eval_container.sh DATASET_NAME to test containerized evaluation
Open a PR with your code

Official benchmark runs are executed by the MultiNet team using your submitted Dockerfile and adapters. Local runs operate on the provided sample datasets to validate your setup.

For complete instructions, see the Model Submission Guide.

If you're experiencing any issues, open a GitHub issue or contact [email protected] directly.

📜 Citation

If you use MultiNet in your research, please cite our work:

ICML CodeML Paper Submission - An Open-Source Software Toolkit & Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models

@misc{guruprasad2025opensourcesoftwaretoolkit,
      title={An Open-Source Software Toolkit & Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models}, 
      author={Pranav Guruprasad and Yangyue Wang and Sudipta Chowdhury and Jaewoo Song and Harshvardhan Sikka},
      year={2025},
      eprint={2506.09172},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2506.09172}, 
      }

Multinet v0.2 - Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments

@misc{guruprasad2025benchmarkingvisionlanguage,
      title={Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments}, 
      author={Pranav Guruprasad and Yangyue Wang and Sudipta Chowdhury and Harshvardhan Sikka},
      year={2025},
      eprint={2505.05540},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.05540}, 
      }

Multinet v0.1 - Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks

@misc{guruprasad2024benchmarkingvisionlanguage,
      title={Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks}, 
      author={Pranav Guruprasad and Harshvardhan Sikka and Jaewoo Song and Yangyue Wang and Paul Pu Liang},
      year={2024},
      eprint={2411.05821},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2411.05821},
      }

Multinet Vision and Dataset specification

@misc{guruprasad2024benchmarking,
      author={Guruprasad, Pranav and Sikka, Harshvardhan and Song, Jaewoo and Wang, Yangyue and Liang, Paul},
      title={Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks},
      DOI={10.20944/preprints202411.0494.v1},
      year={2024},
      }

Name		Name	Last commit message	Last commit date
Latest commit History 1,149 Commits
assets		assets
definitions		definitions
docs		docs
results/v0_2		results/v0_2
scripts		scripts
src		src
tst		tst
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
build_and_run_eval_container.sh		build_and_run_eval_container.sh
build_and_run_eval_container_for_multiple_datasets.sh		build_and_run_eval_container_for_multiple_datasets.sh
harness_dataset_config.txt		harness_dataset_config.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MultiNet: A Generalist Benchmark for the Next Generation of Multimodal Models

MultiNet is a collaborative initiative with contributions from leading research teams at institutions like:

Need to Run Evaluations on Production Multimodal, Computer Use, or Robotics AI System? We can help!

📢 Updates

🔍 Overview

🚀 Getting Started

To set up the environment for Multinet:

To download the datasets in v1:

To translate one file/shard of your desired control dataset (downloaded using the downloader script in this repo) to the TFDS format

To translate multiple files/shards of your desired control dataset (downloaded using the downloader script in this repo) to the TFDS format

To evaluate models on MultiNet datasets

📊 Process for Submission to the MultiNet Benchmark

📜 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 14

Uh oh!

Languages

License

ManifoldRG/MultiNet

Folders and files

Latest commit

History

Repository files navigation

MultiNet: A Generalist Benchmark for the Next Generation of Multimodal Models

MultiNet is a collaborative initiative with contributions from leading research teams at institutions like:

Need to Run Evaluations on Production Multimodal, Computer Use, or Robotics AI System? We can help!

📢 Updates

🔍 Overview

🚀 Getting Started

To set up the environment for Multinet:

To download the datasets in v1:

To translate one file/shard of your desired control dataset (downloaded using the downloader script in this repo) to the TFDS format

To translate multiple files/shards of your desired control dataset (downloaded using the downloader script in this repo) to the TFDS format

To evaluate models on MultiNet datasets

📊 Process for Submission to the MultiNet Benchmark

📜 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 14

Uh oh!

Languages

Packages