Qwen2.5-VL Video GRPO Training

🎯 Project Overview

This repository provides Video GRPO (Group Relative Policy Optimization) training capabilities for Qwen2.5-VL models, specifically designed for multiple-choice video understanding tasks. While based on the original Qwen2.5-VL codebase, this project focuses on implementing advanced preference optimization techniques for video comprehension.

🚀 Key Features & New Implementations

🎬 Video GRPO Training

Group Relative Policy Optimization: Advanced preference learning for multiple-choice video questions
Comprehensive Metrics Logging: Detailed wandb integration with policy probabilities, KL divergence, and reward tracking
Format-Aware Training: Automatic detection and reward for proper response formatting (<think>...</think> <answer>...</answer>)
Multi-Choice Optimization: Specialized handling of video understanding tasks with multiple answer options

🔧 New Components Added

GRPO Trainer (qwenvl/train/grpo_trainer.py): Custom trainer implementing group-wise preference optimization
GRPO Dataset Class (qwenvl/data/data_qwen_grpo.py): Specialized data loader for multiple-choice video tasks
GRPO Training Script (qwenvl/train/train_grpo.py): Main training entry point with GRPO-specific configurations
Comprehensive Logging: Real-time metrics tracking for training optimization
Ready-to-Use Scripts: Shell scripts for single and multi-GPU GRPO training

📊 GRPO Metrics Dashboard

Track detailed training metrics in real-time:

Policy Probabilities: grpo/policy_seq_log_prob
KL Divergence: grpo/kl_div
Format Rewards: grpo/format_reward
Accuracy Rewards: grpo/accuracy_reward
Total Rewards: grpo/total_reward
Advantage Values: grpo/advantage

🎯 Use Cases

This implementation is ideal for:

Video Question Answering: Training models to answer multiple-choice questions about video content
Video Reasoning Tasks: Improving logical reasoning capabilities for video understanding
Preference Learning: Optimizing model responses based on correctness and format preferences
Research Applications: Exploring advanced preference optimization techniques for multimodal models

🛠️ Quick Start

Installation

# Clone the repository
git clone https://github.com/your-username/Qwen2.5-VL-Video-GRPO.git
cd Qwen2.5-VL-Video-GRPO

# Install dependencies
pip install -r requirements.txt
pip install qwen-vl-utils[decord]

# Install specific versions for best compatibility
pip install transformers==4.51.3 accelerate

Basic GRPO Training

# Navigate to the training directory
cd qwen-vl-finetune

# Make training script executable
chmod +x scripts/train_grpo.sh

# Run GRPO training
./scripts/train_grpo.sh

Custom GRPO Training

python qwenvl/train/train_grpo.py \
    --model_name_or_path "Qwen/Qwen2.5-VL-3B-Instruct" \
    --dataset_use "your_video_dataset_grpo" \
    --output_dir "./output/grpo_model" \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --grpo_alpha 0.5 \
    --grpo_beta 0.1 \
    --format_reward_weight 0.3 \
    --accuracy_reward_weight 0.7 \
    --grpo_sample_size 4 \
    --report_to wandb \
    --run_name "video-grpo-experiment"

📚 Documentation

GRPO Training Guide

Key GRPO Parameters

grpo_alpha: Reward scaling factor (default: 0.5)
grpo_beta: KL penalty weight (default: 0.1)
format_reward_weight: Weight for format compliance (default: 0.3)
accuracy_reward_weight: Weight for answer accuracy (default: 0.7)
grpo_sample_size: Number of samples per example (default: 4)
grpo_logging_steps: Metrics logging frequency (default: 50)

📊 Dataset Format

Your video dataset should contain multiple-choice questions:

{
    "video": "videos/sample.mp4",
    "conversations": [
        {
            "from": "human",
            "value": "<video>\nWhat caused the blue object to move?\nOptions:\n(A) Gravity\n(B) Collision\n(C) Magnetic force"
        },
        {
            "from": "gpt",
            "value": "<think>I can see in the video that there's a red ball that rolls into the blue object, causing it to move. This is clearly a collision scenario.</think> <answer>(B) Collision</answer>"
        }
    ]
}

🔄 Format Types

Thinking Format (Default)

Dataset: Any name not containing "no_think"
Expected Output: <think>...</think> <answer>...</answer>
Use Case: When you want the model to show reasoning

No-Thinking Format

Dataset: Names containing "no_think"
Expected Output: <answer>...</answer> only
Use Case: When you want concise answers

📈 Monitoring Training

The GRPO trainer provides comprehensive monitoring:

Real-time loss tracking
Format compliance metrics
Answer accuracy measurements
KL divergence from reference model
Detailed reward breakdowns

🏆 Advantages over Standard Fine-tuning

Better Preference Learning: Direct optimization for choosing correct answers
Robust Training: Less prone to distribution collapse
Format Enforcement: Built-in rewards for proper response structure
Detailed Monitoring: Comprehensive metrics for training insights
Video-Specific: Optimized for video understanding tasks

💡 Tips for Best Results

Start Small: Use small batch sizes (1-2) due to generation overhead
Sample Diversity: 4-8 samples per example provides good balance
Learning Rate: Use lower rates (2e-5) for stability
Reward Balance: Adjust format vs accuracy weights based on priorities
KL Control: Increase grpo_beta if model diverges too much

🔧 Troubleshooting

Out of Memory

Reduce grpo_sample_size
Reduce generation_max_length
Enable gradient checkpointing
Use smaller batch sizes

Poor Generation Quality

Verify flash attention configuration
Check tokenizer chat template
Ensure proper video frame sampling

Low Rewards

Verify dataset format matches expectations
Check ground truth answer formatting
Adjust reward weights

🐳 Docker Support

# Use the provided Docker image
docker run --gpus all --ipc=host --network=host --rm --name qwen2.5-grpo -it qwenllm/qwenvl:2.5-cu121 bash

# Or build your own
docker build -f Dockerfile -t qwen2.5-vl-grpo .

📖 Original Qwen2.5-VL Information

This project is built upon the excellent Qwen2.5-VL foundation. Below are the key capabilities inherited from the original model:

Key Enhancements from Qwen2.5-VL:

Powerful Document Parsing Capabilities: Upgrade text recognition to omnidocument parsing, excelling in processing multi-scene, multilingual, and various built-in (handwriting, tables, charts, chemical formulas, and music sheets) documents.
Precise Object Grounding Across Formats: Unlock improved accuracy in detecting, pointing, and counting objects, accommodating absolute coordinate and JSON formats for advanced spatial reasoning.
Ultra-long Video Understanding and Fine-grained Video Grounding: Extend native dynamic resolution to the temporal dimension, enhancing the ability to understand videos lasting hours while extracting event segments in seconds.
Enhanced Agent Functionality for Computer and Mobile Devices: Leverage advanced grounding, reasoning, and decision-making abilities, boosting the model with superior agent functionality on smartphones and computers.

Model Sizes Available:

Qwen2.5-VL-3B-Instruct: Compact model for efficient deployment
Qwen2.5-VL-7B-Instruct: Balanced performance and efficiency
Qwen2.5-VL-32B-Instruct: Enhanced capabilities with improved alignment
Qwen2.5-VL-72B-Instruct: Flagship model with maximum performance

Performance Highlights

Dataset	Qwen2.5-VL-3B	Qwen2.5-VL-7B	Qwen2.5-VL-32B	Qwen2.5-VL-72B	GPT-4o	Claude3.5 Sonnet
MMMU	53.1	58.6	70.0	70.2	70.3	70.4
VideoMME	61.5/67.6	65.1/71.6	70.5/77.9	73.3/79.1	71.9/77.2	60/62.9
DocVQA	93.9	95.7	94.8	96.4	91.1	95.2
MathVista	62.3	68.2	74.7	74.8	63.8	65.4

🎯 Cookbooks & Examples

We provide comprehensive examples for various capabilities:

Cookbook	Description	Open
Universal Recognition	Identify animals, plants, people, and objects
Document Parsing	Advanced document parsing with layout information
Video Understanding	Video OCR, long video understanding, and grounding
Mobile Agent	Mobile phone control and interaction

🚀 Using Base Qwen2.5-VL Models

Basic Inference

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

# Load model
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

# Process input
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "path/to/video.mp4",
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# Generate response
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt")
inputs = inputs.to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(output_text)

Video Processing Configuration

# Configure video processing parameters
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "https://example.com/video.mp4",
                "min_pixels": 4 * 28 * 28,
                "max_pixels": 256 * 28 * 28,
                "total_pixels": 20480 * 28 * 28,
                "fps": 1.0,
            },
            {"type": "text", "text": "What happens in this video?"},
        ],
    }
]

🔬 Research & Development

This project extends the capabilities of Qwen2.5-VL for research in:

Preference Learning: Advanced techniques for aligning model outputs with human preferences
Video Understanding: Specialized training for temporal visual reasoning
Multi-Modal Reasoning: Integration of visual and textual information processing
Reward Modeling: Sophisticated reward functions for training optimization

📄 Citation

If you use this Video GRPO implementation, please cite both this work and the original Qwen2.5-VL:

🤝 Contributing

We welcome contributions to improve the Video GRPO implementation:

Fork the repository
Create a feature branch
Make your improvements
Submit a pull request

📝 License

This project inherits the license from the original Qwen2.5-VL repository. Please see the LICENSE file for details.

🆘 Support

For questions about:

GRPO Implementation: Open an issue in this repository
Base Qwen2.5-VL: Refer to the original repository
General Discussion: Join the Discord community

Note: This is a specialized implementation focusing on Video GRPO training. For general Qwen2.5-VL usage, please refer to the original repository.

@article{Qwen2.5-VL,
  title={Qwen2.5-VL Technical Report},
  author={Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Song, Sibo and Dang, Kai and Wang, Peng and Wang, Shijie and Tang, Jun and Zhong, Humen and Zhu, Yuanzhi and Yang, Mingkun and Li, Zhaohai and Wan, Jianqiang and Wang, Pengfei and Ding, Wei and Fu, Zheren and Xu, Yiheng and Ye, Jiabo and Zhang, Xi and Xie, Tianbao and Cheng, Zesen and Zhang, Hang and Yang, Zhibo and Xu, Haiyang and Lin, Junyang},
  journal={arXiv preprint arXiv:2502.13923},
  year={2025}
}

@article{Qwen2-VL,
  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2409.12191},
  year={2024}
}

@article{Qwen-VL,
  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2308.12966},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 223 Commits
cookbooks		cookbooks
datasets		datasets
docker		docker
evaluation/mmmu		evaluation/mmmu
qwen-vl-finetune		qwen-vl-finetune
qwen-vl-utils		qwen-vl-utils
sbatch_srun		sbatch_srun
web_demo_streaming		web_demo_streaming
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
check-cuda-image.sh		check-cuda-image.sh
docker-compose.yml		docker-compose.yml
docker-instructions.md		docker-instructions.md
requirements_web_demo.txt		requirements_web_demo.txt
test_qwen_chat.py		test_qwen_chat.py
web_demo_mm.py		web_demo_mm.py

Folders and files

Latest commit

History

Repository files navigation

Qwen2.5-VL Video GRPO Training

🎯 Project Overview

🚀 Key Features & New Implementations

🎬 Video GRPO Training

🔧 New Components Added

📊 GRPO Metrics Dashboard

🎯 Use Cases

🛠️ Quick Start

Installation

Basic GRPO Training

Custom GRPO Training

📚 Documentation

GRPO Training Guide

Key GRPO Parameters

📊 Dataset Format

🔄 Format Types

Thinking Format (Default)

No-Thinking Format

📈 Monitoring Training

🏆 Advantages over Standard Fine-tuning

💡 Tips for Best Results

🔧 Troubleshooting

Out of Memory

Poor Generation Quality

Low Rewards

🐳 Docker Support

📖 Original Qwen2.5-VL Information

Key Enhancements from Qwen2.5-VL:

Model Sizes Available:

Performance Highlights

🎯 Cookbooks & Examples

🚀 Using Base Qwen2.5-VL Models

Basic Inference

Video Processing Configuration

🔬 Research & Development

📄 Citation

🤝 Contributing

📝 License

🆘 Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages