FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

Predicting the future requires listening as well as seeing.

Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio–visual cues remains largely unexplored, as existing benchmarks focus mainly on retrospective understanding.

FutureOmni is the first benchmark designed to evaluate omni-modal future forecasting from audio–visual environments. To succeed, models must perform cross-modal causal and temporal reasoning while effectively leveraging internal knowledge to predict future events.

🌟 Key Features

🎯 First Omni-Modal Forecasting Benchmark: Dedicated exclusively to assessing whether models can predict future states based on audio-visual causal logic.
🏗️ Scalable Construction Pipeline: Constructed via a robust LLM-assisted, human-in-the-loop pipeline ensuring high-quality causal pairs.
📊 Comprehensive & Original: Contains 919 videos and 1,034 multiple-choice QA pairs across 8 primary domains. All videos are collected to ensure zero contamination (100% Original Video Rate).
🧩 Challenging Evaluation: Evaluations on 13 omni-modal and 7 video-only models reveal that current systems struggle significantly, particularly in speech-heavy scenarios.
🚀 OFF Training Strategy: We propose the Omni-Modal Future Forecasting (OFF) strategy and provide a 7K-sample instruction-tuning dataset, which enhances both future forecasting and general perception capabilities.

📊 Leaderboard

Snapshot of Zero-shot performance (Accuracy %). See the paper for full results.

Model	Size	Speech	Sound	Music	Avg
Gemini 3 Flash 🏆	-	60.52	67.13	68.31	64.80
Gemini 2.5 Pro	-	48.23	61.89	63.38	56.77
Qwen3-Omni	30B	47.99	55.44	57.54	53.05
Claude Haiku 4.5	-	55.56	64.00	43.82	51.52
Video-SALMONN 2	7B	40.28	48.95	54.15	47.00
Qwen2.5-Omni	7B	37.83	54.55	53.85	47.48
GPT-4o	-	54.41	59.80	45.05	52.29

🌰 Data Example

📂 Data Format

Each entry in the annotation file is formatted as follows:

{
    "id": 0,
    "question": "Given the premise event: 'The man repeatedly demonstrates a new rhythm pattern, playing the guitar on \"One, two, three\" and explicitly pausing on the \"Four\" count while vocally counting', which event is its most direct conclusion?",
    "options": [
        "A. He continues to demonstrate the same pattern for several more minutes",
        "B. He stops playing the guitar and says 'Okay' and 'I hope everyone understood.'",
        "C. He introduces a completely different, more complex strumming pattern",
        "D. He puts down the guitar and begins to explain music theory concepts",
        "E. He asks the viewers to play along with him and checks their progress"
    ],
    "answer": "B",
    "original_video": "uu8c_EH8VPE.mp4",
    "split_point": 227,
    "video_domain": "education",
    "audio_type": "Sound",
    "forecasting_pattern": "Routine Sequences"
}

🧪 Experiments

We conduct extensive evaluations on 13 omni-modal and 7 video-only models.

Overall Performance

Overall Performance on Categories

🔍 Fine-grained Results

Fine-grained Results on Audio Type

Fine-grained Results on Video Duration

🧠 In-depth Analysis

Modality Ablation Results

🚀 Omni-Modal Future Forecasting (OFF) Strategy

To mitigate current limitations, we curate a 7K-sample instruction-tuning dataset.

Key Finding: Evaluations on FutureOmni and popular audio-visual (e.g., WorldSense, DailyOmni) and video-only (e.g., Video-MME) benchmarks demonstrate that the OFF strategy significantly enhances both future forecasting and general perception.

Fine-grained Audio Performance

Fine-grained Video Category Performance

General Capability

🛠️ Quick Start

1. Prepare Data

Download the test videos (splitted) from huggingface and extract them into the videos/ folder.

2. Evaluation

We offer two implentions. One is using DDP. The example code is in eval/infer_ddp.py. Another is using vLLM. The example code is in eval/infer_vllm.py. We strongly recommand preprocess the input feature for speeding up. The feature extraction code is in feature/extract.py.

3. Train

You can download our train videos from google drive or baiduyun. Adaptation code is in train/LLaMA-Factory.

📧 Contact

For any questions, please open an issue or contact [email protected].

📝 Citation

If you find FutureOmni useful for your research, please cite our paper:

@article{chen2026futureomni,
    title={FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs},
    author={Chen, Qian and Fu, Jinlan and Li, Changsong, and Ng, See-Kiong and Qiu, Xipeng},
    journal={To be updated},
    year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
asset		asset
eval		eval
feature		feature
static		static
train/LLaMA-Factory		train/LLaMA-Factory
utils		utils
README.md		README.md
_config.yml		_config.yml
index.html		index.html
moss_version.pdf		moss_version.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

🌟 Key Features

📊 Leaderboard

🌰 Data Example

📂 Data Format

🧪 Experiments

Overall Performance

🔍 Fine-grained Results

🧠 In-depth Analysis

🚀 Omni-Modal Future Forecasting (OFF) Strategy

🛠️ Quick Start

1. Prepare Data

2. Evaluation

3. Train

📧 Contact

📝 Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

OpenMOSS/FutureOmni

Folders and files

Latest commit

History

Repository files navigation

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

🌟 Key Features

📊 Leaderboard

🌰 Data Example

📂 Data Format

🧪 Experiments

Overall Performance

🔍 Fine-grained Results

🧠 In-depth Analysis

🚀 Omni-Modal Future Forecasting (OFF) Strategy

🛠️ Quick Start

1. Prepare Data

2. Evaluation

3. Train

📧 Contact

📝 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages