Jun Zhang, Teng Wang, Yuying Ge, Yixiao Ge, Xinhao Li, Ying Shan, Limin Wang
Β Β π PaperΒ Β | Β Β π Project PageΒ Β | π€ Model & DataΒ Β | π TimeLens-Bench LeaderboardΒ Β
TimeLens rethinks video temporal grounding (VTG) with MLLMs along two axes:
- Data Quality. We expose critical quality issues in existing VTG benchmarks and propose quality-assured datasets for both training and evaluation.
- Algorithmic Design. Building upon reliable data, we explore effective timestamp encoding strategies and training recipes, achieving state-of-the-art performance among open-source models.
In this repository, we release:
- π€ TimeLens Models: State-of-the-art open-source models for video temporal grounding.
- π TimeLens-Bench: a comprehensive, high-quality evaluation benchmark for video temporal grounding.
- ποΈ TimeLens-100K: a large-scale, diverse, high-quality training dataset for video temporal grounding, annotated with Gemini-2.5-Pro.
Clone this repository and navigate to the folder
git clone https://github.com/TencentARC/TimeLens.git
cd TimeLensCreate a Conda environment and install the required packages
conda create -n timelens python=3.11 -y
conda activate timelens
pip install -r requirements.txt -f https://download.pytorch.org/whl/cu124 # We use CUDA Version 12.4
pip install flash-attn==2.7.4.post1 --no-build-isolation --no-cache-dirTimeLens models are a family of MLLMs with SotA video temporal grounding performance. They are built upon the Qwen2.5-VL and Qwen3-VL baselines through training on our high-quality TimeLens-100K dataset, leveraging our carefully crafted RLVR (reinforcement learning with verifiable rewards) recipe and improved timestamp encoding strategy.
All models are available on Hugging Face and support out-of-the-box inference using the π€Transformers library. For detailed usage instructions and code examples, please refer to the specific model's Hugging Face page linked below.
The following table lists our models with their Hugging Face links and grounding performance:
| Model (with π€HuggingFace Link) |
Charades-TimeLens | ActivityNet-TimeLens | QVHighlights-TimeLens | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| R1 @0.3 |
R1 @0.5 |
R1 @0.7 |
mIoU | R1 @0.3 |
R1 @0.5 |
R1 @0.7 |
mIoU | R1 @0.3 |
R1 @0.5 |
R1 @0.7 |
mIoU | |
| Qwen2.5-VL-7B-Instruct | 59.7 | 37.8 | 16.6 | 39.3 | 44.1 | 31.0 | 16.1 | 31.4 | 41.5 | 27.8 | 15.2 | 31.6 |
| TimeLens-7Bπ | 70.5 | 55.6 | 28.4 | 48.8 | 62.8 | 51.0 | 32.6 | 46.2 | 74.1 | 62.7 | 43.1 | 56.0 |
| Qwen3-VL-8B-Instruct | 69.2 | 53.4 | 27.5 | 48.3 | 62.1 | 51.2 | 34.4 | 46.8 | 74.2 | 64.6 | 49.3 | 59.4 |
| TimeLens-8Bπ | 76.6 | 63.0 | 35.2 | 55.2 | 68.9 | 58.4 | 40.6 | 53.2 | 80.2 | 71.6 | 55.5 | 65.5 |
TimeLens-7B is fine-tuned from Qwen2.5-VL-7B-Instruct, and TimeLens-8B is fine-tuned from Qwen3-VL-8B-Instruct.
Note
For detailed comparison with other models, please refer to the π Leaderboard.
Download the TimeLens-Bench dataset from Hugging Face and place it in the data/TimeLens-Bench directory:
hf download TencentARC/TimeLens-Bench \
--repo-type=dataset \
--local-dir data/TimeLens-BenchExtract the compressed videos:
mkdir -p data/TimeLens-Bench/videos
find data/TimeLens-Bench/video_shards -name "*.tar.gz" | \
xargs -P 4 -I {} tar -xzf {} -C data/TimeLens-Bench/videos # Parallel extraction with 4 processesThe folder structure should look like this:
TimeLens/
βββ data/
βββ TimeLens-Bench/
βββ activitynet-timelens.json
βββ charades-timelens.json
βββ qvhighlights-timelens.json
βββ videos/ # extracted videos
β βββ activitynet/
β βββ charades/
β βββ qvhighlights/
βββ video_shards/ # compressed videos (can be deleted after extraction)
Our codebase supports evaluation of the following models:
| Model | Supported |
|---|---|
| TimeLens-7B | β |
| TimeLens-8B | β |
| Qwen2.5-VL | β |
| Qwen3-VL | β |
The evaluation script is scripts/eval_timelens_bench.sh. You can set the following environment variables:
model_path: Path or HuggingFace ID of the model to evaluate. Default:TencentARC/TimeLens-8Bdatasets: Comma-separated list of datasets to evaluate. Default:charades-timelens,activitynet-timelens,qvhighlights-timelensCUDA_VISIBLE_DEVICES: GPU indices to use (e.g.,0,1,2,3). Default: Auto-detect all available GPUspred_path: Directory to save results. Default:./logsmin_tokens: Minimum tokens for video encoding. Default:64total_tokens: Total tokens for video encoding. Default:14336FPS: Frames per second for video sampling. Default:2
Example 1: Evaluate TimeLens-8B (default settings)
model_path="TencentARC/TimeLens-8B" bash scripts/eval_timelens_bench.shExample 2: Evaluate TimeLens-7B on specific datasets with specific GPUs
CUDA_VISIBLE_DEVICES=0,1 \
datasets="activitynet-timelens,qvhighlights-timelens" \
model_path="TencentARC/TimeLens-7B" \
bash scripts/eval_timelens_bench.shExample 3: Evaluate Qwen3-VL with a local model path and a custom path to save results:
pred_path="./path/to/results" \
model_path="path/to/Qwen3-VL-8B-Instruct" \
bash scripts/eval_timelens_bench.shTip
Faster Evaluation with DataLoader π
Our evaluation script evaluation/eval_dataloader.py supports multi-GPU inference. More importantly, we use PyTorch DataLoader with multiple workers to prefetch and preprocess video data in parallel, while the GPU handles model inference. This significantly accelerates evaluation for long-video tasks like video temporal grounding. Additionally, this approach is more research-friendly compared to inference engines like vLLM, as it allows easy customization of the model inference code.
Evaluating TimeLens-7B on ActivityNet-TimeLens with 8Γ H20 GPUs:
| Method | Time |
|---|---|
| Without DataLoader | 1h23min |
| With DataLoader | ~34min (~2.4x faster) |
To evaluate your own model on TimeLens-Bench, follow these steps:
-
Load annotations: Use our provided timelens_data.py for loading annotations.
-
Run inference and save results: Run inference with your model and save results in a JSON or JSONL file with the following format:
{ f'{video_name}>>>{query}>>>{ground_truth_span}': { "timestamps": timestamps, # the predicted time span from the model "answers": answer, # the full answer text from the model } }An example of a correctly saved JSON file:
{ "v_BrgYIg6UXhU.mp4>>>A man wearing a blue jacket approaches a blue car>>>[0.0, 4.0]": { "timestamps": [[0.0, 5.0]], "answers": "The event happens in 0.0 - 5.0 seconds." }, ... }In your inference results, you can provide either
timestampsoranswers. In the next step (Step 3, compute metrics),evaluation/compute_metrics.pyapplies the following logic:- If
timestampsis provided, IoU metrics are computed directly from it. - If only
answersis provided, the script will automatically extract the timestamp pair from the answer text.
- If
-
Compute metrics: Use our provided evaluation/compute_metrics.py to compute metrics.
python evaluation/compute_metrics.py -f /path/to/your_result.jsonFor more details on implementing the above steps, you can refer to the evaluation scripts of our supported models.
Download the TimeLens-100K dataset from Hugging Face and place it in the data/TimeLens-100K directory:
hf download TencentARC/TimeLens-100K \
--repo-type=dataset \
--local-dir data/TimeLens-100KExtract the compressed videos:
mkdir -p data/TimeLens-100K/videos
find data/TimeLens-100K/video_shards -name "*.tar.gz" | \
xargs -P 4 -I {} tar -xzf {} -C data/TimeLens-100K/videos # Parallel extraction with 4 processesThe folder structure should look like this:
TimeLens/
βββ data/
βββ TimeLens-100K/
βββ README.md
βββ timelens-100k.jsonl
βββ videos/ # extracted videos
β βββ cosmo_cap/
β βββ didemo/
β βββ hirest/
β βββ internvid_vtime/
β βββ queryd/
βββ video_shards/ # compressed videos (can be deleted after extraction)
We provide an example script timelens_data.py for loading TimeLens-100K annotations. You can refer to this code to integrate TimeLens-100K into your own training codebase.
Our training code will be released soon! Stay tuned!
If you find our paper, code, model, and data helpful for your research and applications, please consider giving a star β and citation π :)
@article{zhang2025timelens,
title={TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs},
author={Zhang, Jun and Wang, Teng and Ge, Yuying and Ge, Yixiao and Li, Xinhao and Shan, Ying and Wang, Limin},
journal={arXiv preprint arXiv:2512.14698},
year={2025}
}Our project is built upon the following awesome works: