Light, Efficient, Omni-modal & Reward-model Driven Reinforcement Fine-Tuning Framework
English | 简体中文
LightRFT is a lightweight post-training framework for LLM models, designed to support post-training exploration in the security and multimodal domains.
LightRFT (Light Reinforcement Fine-Tuning) is an advanced reinforcement learning fine-tuning framework designed for Large Language Models (LLMs) and Vision-Language Models (VLMs). This framework provides efficient and scalable RLHF (Reinforcement Learning from Human Feedback) and RLVR training capabilities, supporting multiple state-of-the-art algorithms and distributed training strategies.
-
🚀 High-Performance Inference Engines
- Integrated vLLM and SGLang for efficient sampling and inference
- FP8 inference optimization for significantly reduced latency and memory usage
- Flexible engine sleep/wake mechanisms for optimal resource utilization
-
🧠 Rich Algorithm Ecosystem
- Policy Optimization: GRPO, GSPO, GMPO, Dr.GRPO
- Advantage Estimation: REINFORCE++, CPGD
- Reward Processing: Reward Norm/Clip
- Sampling Strategy: FIRE Sampling, Token-Level Policy
- Stability Enhancement: DAPO, select_high_entropy_tokens
-
🔧 Flexible Training Strategies
- FSDP (Fully Sharded Data Parallel) v2 support
- DeepSpeed ZeRO (Stage 1/2/3) support
- Gradient checkpointing and mixed precision training (BF16/FP16)
- Adam Offload and memory optimization techniques
- Support packing samples
-
🎯 Innovative Resource Collaboration
- Colocate Anything: Co-locate reward models with training models to maximize GPU utilization
- Support multiple reward models for parallel inference on the same device
- Dynamic memory management with automatic training/inference phase switching
- Reduced cross-device communication overhead for improved end-to-end training efficiency
- Balance Anything 🚧 (Under Development): Intelligent load balancing system
- Adaptive task scheduling and resource allocation
- Automatic load balancing for multi-node training
- Performance optimization for heterogeneous hardware environments
- Colocate Anything: Co-locate reward models with training models to maximize GPU utilization
-
🌐 Comprehensive Multimodal Support
- Native Vision-Language Model (VLM) Training
- Support for mainstream VLMs like Qwen-VL, InternVL
- Parallel processing of multimodal image-text data
- Support sequence parallel
- Efficient multimodal tokenization and batching
- Support processing one text with more than one images
- Multimodal Reward Modeling
- Support for multiple visual reward models working in collaboration
- Joint optimization of image understanding and text generation
- Support reward model as a service
- Complete Vision-Language Alignment Training Pipeline
- Optimized for multimodal RLVR/RLHF training
- Built-in support for vision-language model fine-tuning
- Native Vision-Language Model (VLM) Training
-
📊 Complete Experimental Toolkit
- Weights & Biases (W&B) integration
- Math capability benchmarking (GSM8K, Geo3K, etc.)
- Trajectory saving and analysis tools
- Automatic checkpoint management
-
✈️ Efficient Performance Optimization Strategies- Support data load balancing during mixed text-image training, reducing training time by 30%
- Support dynamic batch size
- Support memory optimization for logprobs calculation
For detailed algorithm descriptions, implementation details, and usage guide, see Algorithm Documentation.
| Algorithm | Type | Key Improvement | Paper |
|---|---|---|---|
| GRPO | Policy Optimization | Group normalized advantage estimation | arXiv:2402.03300 |
| GSPO | Policy Optimization | Group sequence policy optimization | arXiv:2507.18071 |
| GMPO (WIP) | Policy Optimization | Geometric-mean policy optimization | arXiv:2507.20673 |
| Dr.GRPO | Policy Optimization | Length bias mitigation | arXiv:2503.20783 |
| DAPO | Policy Optimization | Decoupled clip and dynamic sampling policy optimization | arXiv:2503.14476 |
| REINFORCE++ | Advantage Estimation | Improved baseline estimation | arXiv:2501.03262 |
| CPGD | Advantage Estimation | KL-based drift constraint | arXiv:2505.12504 |
| FIRE Sampling | Sampling Strategy | Filtering and ranking strategies | arXiv:2410.21236 |
- Python >= 3.10
- CUDA >= 12.8
- PyTorch >= 2.5.1
TO BE DONE
Clone and install LightRFT:
# Clone the repository
git clone https://github.com/DeepLink-org/LightRFT.git
cd LightRFT
# Install dependencies
pip install -r requirements.txt
# Install LightRFT
pip install -e .# Single node, 8 GPU training example
cd LightRFT
# Run GRPO training (GSM8K math reasoning task)
bash examples/gsm8k_geo3k/run_grpo_gsm8k_qwen2.5_0.5b.sh
# Or run Geo3K geometry problem training (VLM multimodal)
bash examples/gsm8k_geo3k/run_grpo_geo3k_qwen2.5_vl_7b.shLightRFT/
├── lightrft/ # Core library
│ ├── strategy/ # Training & inference strategies
│ │ ├── fsdp/ # FSDP implementation
│ │ ├── deepspeed/ # DeepSpeed implementation
│ │ ├── vllm_utils/ # vLLM utilities
│ │ ├── sglang_utils/ # SGLang utilities
│ │ └── utils/ # Strategy utilities
│ ├── models/ # Model definitions
│ │ ├── actor_al.py # Audio-language model actor
│ │ ├── actor_language.py # Language model actor
│ │ ├── actor_vl.py # Vision-language model actor
│ │ ├── grm_vl.py # Generative reward model (Vision-Language)
│ │ ├── srm_al.py # Scalar reward model (Audio-Language)
│ │ ├── srm_vl.py # Scalar reward model (Vision-Language)
│ │ ├── loss.py # Loss functions
│ │ ├── monkey_patch/ # Model adaptation patches for distributed training
│ │ ├── tests/ # Model tests
│ │ └── utils.py # Model utilities
│ ├── trainer/ # Trainer implementations
│ │ ├── ppo_trainer.py # LLM PPO trainer
│ │ ├── ppo_trainer_vl.py # VLM PPO trainer
│ │ ├── spmd_ppo_trainer.py # SPMD PPO trainer Extension (Core)
│ │ ├── grm_trainer_vl.py # Generative reward model trainer (Vision-Language)
│ │ ├── srm_trainer_al.py # Scalar reward model trainer (Audio-Language)
│ │ ├── srm_trainer_vl.py # Scalar reward model trainer (Vision-Language)
│ │ ├── fast_exp_maker.py # Fast experience generator (Core)
│ │ ├── experience_maker.py # Base experience generator
│ │ ├── experience_maker_vl.py # Base experience generator for VLM
│ │ ├── replay_buffer.py # Replay buffer
│ │ ├── replay_buffer_vl.py # VLM replay buffer
│ │ ├── replay_buffer_utils.py # Replay buffer utilities
│ │ ├── kl_controller.py # KL divergence controller
│ │ └── utils.py # Trainer utilities
│ ├── datasets/ # Dataset processing
│ │ ├── audio_alpaca.py # Audio Alpaca dataset
│ │ ├── grm_dataset.py # Generative reward model dataset
│ │ ├── hpdv3.py # HPDv3 reward model dataset
│ │ ├── image_reward_db.py # Image reward database
│ │ ├── imagegen_cot_reward.py # Image generation CoT generative reward
│ │ ├── omnirewardbench.py # OmniRewardBench dataset
│ │ ├── process_reward_dataset.py # Reward dataset processing
│ │ ├── prompts_dataset.py # LLM Prompts dataset
│ │ ├── prompts_dataset_vl.py # Vision-language prompts dataset
│ │ ├── rapidata.py # Rapidata reward modeldataset
│ │ ├── sft_dataset.py # SFT dataset
│ │ ├── sft_dataset_vl.py # VLM SFT dataset
│ │ ├── srm_dataset.py # Scalar reward model base dataset
│ │ └── utils.py # Dataset utilities
│ └── utils/ # Utility functions
│ ├── ckpt_scripts/ # Checkpoint processing scripts
│ ├── cli_args.py # CLI argument parsing
│ ├── distributed_sampler.py # Distributed sampler
│ ├── logging_utils.py # Logging utilities
│ ├── processor.py # Data processor for HF model
│ ├── remote_rm_utils.py # Remote reward model utilities
│ ├── timer.py # Timer utilities
│ ├── trajectory_saver.py # Trajectory saver
│ └── utils.py # General utilities
│
├── examples/ # Usage examples
│ ├── gsm8k_geo3k/ # GSM8K/Geo3K math reasoning training examples
│ ├── grm_training/ # Generative reward model training examples
│ ├── srm_training/ # Scalar reward model training examples
│ ├── chat/ # Model dialogue examples
│
├── docs/ # 📚 Sphinx documentation
│ ├── Makefile # Documentation build Makefile
│ ├── make.bat # Documentation build batch file
│ └── source/ # Documentation source
│ ├── _static/ # Static files (CSS, etc.)
│ ├── api_doc/ # API documentation
│ ├── best_practice/ # Best practices & resources
│ ├── installation/ # Installation guides
│ └── quick_start/ # Quick start & user guides
│
├── assets/ # Assets
│ └── logo.png # Project logo
│
├── CHANGELOG.md # Changelog
├── LICENSE # License file
├── Makefile # Project Makefile
├── README.md # Project documentation (English)
├── README_zh.md # Project documentation (Chinese)
├── requirements.txt # Python dependencies
├── requirements-dev.txt # Development dependencies
├── requirements-doc.txt # Documentation dependencies
└── setup.py # Package setup script
lightrft/: LightRFT core library, providing training strategies, model definitions, and trainer implementationsexamples/: Complete training examples and scriptsgsm8k_geo3k/: GSM8K and Geo3K math reasoning training examplesgrm_training/: Generative reward model training examplessrm_training/: Scalar reward model training exampleschat/: Model dialogue examples
docs/: Sphinx documentation with complete user guides and API documentation
TBS=128 # Training batch size
RBS=128 # Rollout batch size
micro_train_batch_size=1 # Micro batch size per GPU
micro_rollout_batch_size=2 # Rollout micro batch size--advantage_estimator group_norm # Advantage estimator: group_norm, reinforce, cpgd
--n_samples_per_prompt 8 # Number of samples per prompt
--max_epochs 1 # Training epochs per episode
--num_episodes 3 # Total training episodes
--kl_estimator k3 # KL estimator type
--init_kl_coef 0.001 # KL penalty coefficient--fsdp # Enable FSDP
--zero_stage 3 # DeepSpeed ZeRO Stage
--gradient_checkpointing # Gradient checkpointing
--adam_offload # Adam optimizer offload
--bf16 # BF16 mixed precision--rm_use_engine # Use inference engine (vLLM/SGLang) for reward model
--engine_mem_util 0.4 # Engine memory utilization
--engine_tp_size 1 # Engine tensor parallelism degree
--enable_engine_sleep # Enable engine sleep mechanism- Use gradient checkpoint
--gradient_checkpointing- Use Adam Offload
--adam_offload- adjust engine memory utilization
--engine_mem_util 0.4- environment variable optimization
export TORCH_NCCL_AVOID_RECORD_STREAMS=1- FP8 rollout (only in the inference stage)
- Reduce inference latency and VRAM usage, while maintaining BF16 precision during training.
--fp8_rollout
--enable_vllm_is_correction- Flash Attention
--flash_attn- batch size optimize
Recommend: train_batch_size >= rollout_batch_size × n_samples_per_prompt
See training scripts for detailed parameter validation logic.
Solutions:
- Reduce
micro_train_batch_sizeandmicro_rollout_batch_size - Enable
--gradient_checkpointing - Lower
--engine_mem_util - Use ZeRO Stage 3
Solutions:
- Enable Reward Normalization:
--normalize_reward - Lower learning rate
- Use
--advantage_estimator group_norm - Try DAPO algorithm
Quick Start:
- Installation Guide - Docker images, installation methods, and troubleshooting
- Supported Algorithms - Comprehensive algorithm guide with implementation details
- Configuration Reference - Complete parameter documentation
Best Practices:
- Training Strategy Usage - FSDP, DeepSpeed, and inference engine configuration
- FAQ - Frequently asked questions and solutions
- Troubleshooting Guide - Common issues and debugging
- Contributing Guide - How to contribute to LightRFT
Install documentation dependencies:
pip install -r requirements-doc.txtGenerate HTML documentation:
make docs
# Open docs/build/index.html to view documentationLive documentation preview:
make docs-live
# Visit http://localhost:8000We welcome community contributions! Please follow these steps:
- Fork this repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
# Install development dependencies
pip install -r requirements-dev.txt
# Code formatting (YAPF)
make format
# Code linting (Flake8)
make fcheckIf you use this codebase in your research or applications, please cite it as follows:
@misc{lightrft,
title={LightRFT},
author={Niu, Yazhe and Pu, Yuan and Shi, Dongxing and Lu, Yudong and Xiong, Yingtong and Ge, Ruijun and Sun, Jiaxuan and Wan, Zunian and Zhang, Shaoang and others},
publisher={GitHub},
howpublished={\url{https://github.com/DeepLink-org/LightRFT}},
year={2025},
}- Business Team Framework Group: Responsible for the development of the algorithm ecosystem, training strategies, multimodal support, and the experimental toolchain, focusing on algorithm innovation and the enhancement of model training capabilities.
- System Team DeepLink: Responsible for high-performance inference engines, resource coordination mechanisms, system-level performance optimization, and underlying infrastructure, focusing on the optimization of system performance and resource utilization efficiency.
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
LightRFT is collaboratively developed by The RL Team of the Safe and Trustworthy Center and The System Platform Center DeepLink Team at Shanghai AI Laboratory. We extend our sincere thanks to the contributors from both teams.
- The reinforcement learning component of this project is developed based on OpenRLHF, and we express our heartfelt gratitude to its development team for their outstanding work.
- We thank the Qwen, InternVL, and DeepSeek teams for providing excellent open-source foundation models.
- We also acknowledge the powerful tools provided by open-source communities such as DeepSpeed, PyTorch, vLLM, and SGLang.
This project builds upon the following outstanding open-source projects (including but not limited):
- OpenRLHF, verl - Core RL framework foundation (parts of key components adapted and reused)
- vLLM - High-performance inference engine
- SGLang - Structured generation language runtime
- DeepSpeed - Distributed training optimization
- PyTorch FSDP - Fully Sharded Data Parallel
Thanks to all contributors and supporters!
We are actively working on the following improvements and features:
-
Trajectory Functionality Extension
- Add more analysis metrics
- Enhanced trajectory saving and analysis capabilities
-
Reward Mechanism Refactoring
- Refactor rule-based and model-based reward computation
- Optimize reward dataset processing pipeline
-
More Algorithm Integration
- Entropy-based token selection
- GMPO (Geometric-Mean Policy Optimization)
- GSPO (Group Sequence Policy Optimization)
-
Advantage Computation Refactoring
- Optimize advantage estimation module architecture
- Unify advantage computation interface across algorithms
-
Loss-Filter Mechanism Optimization
- Refactor loss filtering implementation
- Complete GSM8K/Geo3K benchmark experiments
- Document experimental results and analysis
Community contributions and feedback are welcome!
For questions or suggestions, please contact us via:
- Issues: GitHub Issues
- Email: [email protected]
⭐ If this project helps you, please give us a star!
Made with ❤️ by LightRFT Team