Skip to content

ManifoldRG/MultiNet

Repository files navigation

Multinet Logo

MultiNet: A Generalist Benchmark for the Next Generation of Multimodal Models

Website Multinet v1.0 release Multinet v0.2 paper Multinet v0.1 paper GenESIS framework Contribute

MultiNet is a collaborative initiative with contributions from leading research teams at institutions like:

Fig Logo Manifold Research Logo MIT Logo Georgia Tech Logo Tufts Logo

Need to Run Evaluations on Production Multimodal, Computer Use, or Robotics AI System? We can help!

πŸ“’ Updates

  • 🌟 2025-13-10: Multinet v1.0 - We release our most comprehensive benchmark yet - evaluating a SoTA VLM, VLA, and generalist model on a wide variety of multimodal understanding and action datasets. Read more here
  • πŸ… 2025-06-10: Paper accepted at ICML 2025! Our paper detailing the Open-Source contributions of Multinet that benefit the AI community has been accepted at the CodeML Workshop at ICML 2025! Read our paper here.
  • πŸ† 2025-05-22: Multinet v0.2 - We systematically profile state-of-the-art VLAs and VLMs to understand how they perform in procedurally generated OOD game environments! Read more about our release here
  • πŸŽ‰ 2024-11-08: We release the first version of MultiNet where we profiled SoTA VLMs and VLAs on real-world robotics tasks - Multinet v0.1! Check our release page for more details.
  • πŸš€ 2024-03-22: Introducing Multinet! A new generalist benchmark to evaluate Vision-Language & Action models. Learn more here

πŸ” Overview

This repo provides the following:

  1. Ability to profile VLMs, VLAs, and generalist models on our generalist evaluation framework with a comprehensive coverage of open-source physical commonsense reasoning, image classification, visual question answering, control/action (RL, Robotics), gameplay, and function calling tasks
  2. Ability to translate control data of various formats and from various sources to a unified Tensorflow Dataset format.
  3. Evaluate the performance of SoTA VLMs and VLAs such as GPT-5, Pi0, and Magma in a zero-shot setting on a wide-variety of tasks detaied here.
  4. A general framework for mapping VLMs to other modality classes, with particular emphasis on action spaces. This framework allows one to adapt a wide range of models to multiple types of tasks or datasets for scaling effectively while reducing the amount of engineering effort required. In MultiNet v1.0, GenESIS is used to evaluate GPT 5 on the OpenX, Overcooked, PIQA, ODINW, and SQA3D datasets.
  5. Sample datasets and clear guidelines to test your model locally and submit for official benchmark evaluation; leaderboard results are generated by the MultiNet team.

Also related to the MultiNet effort is ΞΌGATO on GitHub - a small, simple, open-source implementation of what is described in DeepMind's GATO paper. This project marks our initial step towards building a multimodal generalist action model.


Multinet v1.0 Figure

πŸš€ Getting Started

To set up the environment for Multinet:

conda create -n multinet python=3.10
conda activate multinet
git clone https://github.com/ManifoldRG/MultiNet.git
cd MultiNet/src
pip install -r requirements.txt

To download the datasets in v1:

cd Multinet/src/v1
python centralized_downloader.py --download <name of dataset you would like to download>

To translate one file/shard of your desired control dataset (downloaded using the downloader script in this repo) to the TFDS format

cd Multinet/src/v1
python centralized_processor.py --input_dir <path to the downloaded dataset> --output_dir <directory where you would like to store the translated file>

To translate multiple files/shards of your desired control dataset (downloaded using the downloader script in this repo) to the TFDS format

Note: Make sure to modify the way the multiple files are being traversed for translation in translate_multiple.py in Multinet/src/control_translation according to your local file structure.

cd Multinet/src/v1
python wrapper_centralized_processor.py --input_dir <path to the downloaded dataset> --output_dir <directory where you would like to store the translated file>

To evaluate models on MultiNet datasets

We provide comprehensive evaluation guides for different models:

Magma Model Evaluation: For detailed instructions on evaluating Magma on ODINW, PIQA, SQA3D, RoboVQA, Overcooked, BFCL, and OpenX datasets, see the Magma Evaluation Guide.

Pi0 Base Model Evaluation: For detailed instructions on evaluating Pi0 Base on ODINW, PIQA, SQA3D, RoboVQA, BFCL, Overcooked, and OpenX datasets, see the Pi0 Evaluation Guide.

GPT Model Evaluation (GenESIS Framework): For detailed instructions on evaluating GPT-5 using the GenESIS framework on ODINW, PIQA, SQA3D, RoboVQA, Overcooked, and OpenX datasets, see the GenESIS Evaluation Guide.

πŸ“Š Process for Submission to the MultiNet Benchmark

We provide a submission toolkit and comprehensive instructions to benchmark your model on MultiNet datasets:

  • Standardized Interface: Create model adapters that inherit from the base ModelAdapter class
  • Dockerized Evaluation: Reproducible evaluations in isolated containers
  • Various Task Types: Support for datasets that span VQA, action prediction, function calling, and more

Quick Start:

  1. Create your model adapter(s) by inheriting from ModelAdapter in src/eval_harness/model_adapter.py
  2. Test your model adapter using the scripts/eval_harness/evaluate.py entrypoint which loads sample data in a standard format
  3. Configure harness_dataset_config.txt and Dockerfile with your adapter settings
  4. Run ./build_and_run_eval_container.sh DATASET_NAME to test containerized evaluation
  5. Open a PR with your code

Official benchmark runs are executed by the MultiNet team using your submitted Dockerfile and adapters. Local runs operate on the provided sample datasets to validate your setup.

For complete instructions, see the Model Submission Guide.

If you're experiencing any issues, open a GitHub issue or contact [email protected] directly.

πŸ“œ Citation

If you use MultiNet in your research, please cite our work:

ICML CodeML Paper Submission - An Open-Source Software Toolkit & Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models

@misc{guruprasad2025opensourcesoftwaretoolkit,
      title={An Open-Source Software Toolkit & Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models}, 
      author={Pranav Guruprasad and Yangyue Wang and Sudipta Chowdhury and Jaewoo Song and Harshvardhan Sikka},
      year={2025},
      eprint={2506.09172},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2506.09172}, 
      }

Multinet v0.2 - Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments

@misc{guruprasad2025benchmarkingvisionlanguage,
      title={Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments}, 
      author={Pranav Guruprasad and Yangyue Wang and Sudipta Chowdhury and Harshvardhan Sikka},
      year={2025},
      eprint={2505.05540},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.05540}, 
      }

Multinet v0.1 - Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks

@misc{guruprasad2024benchmarkingvisionlanguage,
      title={Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks}, 
      author={Pranav Guruprasad and Harshvardhan Sikka and Jaewoo Song and Yangyue Wang and Paul Pu Liang},
      year={2024},
      eprint={2411.05821},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2411.05821},
      }

Multinet Vision and Dataset specification

@misc{guruprasad2024benchmarking,
      author={Guruprasad, Pranav and Sikka, Harshvardhan and Song, Jaewoo and Wang, Yangyue and Liang, Paul},
      title={Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks},
      DOI={10.20944/preprints202411.0494.v1},
      year={2024},
      }    

About

A Generalist Benchmark for Multimodal Action models.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 14