TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

Jun Zhang, Teng Wang, Yuying Ge, Yixiao Ge, Xinhao Li, Ying Shan, Limin Wang

📑 Paper | 🏠 Project Page | 🤗 Model & Data | 🏆 TimeLens-Bench Leaderboard

🔎 Overview

TimeLens rethinks video temporal grounding (VTG) with MLLMs along two axes:

Data Quality. We expose critical quality issues in existing VTG benchmarks and propose quality-assured datasets for both training and evaluation.
Algorithmic Design. Building upon reliable data, we explore effective timestamp encoding strategies and training recipes, achieving state-of-the-art performance among open-source models.

📚 Quick Navigation

In this repository, we release:

🤖 TimeLens Models: State-of-the-art open-source models for video temporal grounding.
- Model Usage
📊 TimeLens-Bench: a comprehensive, high-quality evaluation benchmark for video temporal grounding.
- 🏆 Leaderboard
- Evaluation Guide
🏋️ TimeLens-100K: a large-scale, diverse, high-quality training dataset for video temporal grounding, annotated with Gemini-2.5-Pro.
- Training Guide

📦 Installation

Clone this repository and navigate to the folder

git clone https://github.com/TencentARC/TimeLens.git
cd TimeLens

Create a Conda environment and install the required packages

conda create -n timelens python=3.11 -y
conda activate timelens
pip install -r requirements.txt -f https://download.pytorch.org/whl/cu124 # We use CUDA Version 12.4
pip install flash-attn==2.7.4.post1 --no-build-isolation --no-cache-dir

🤖 Using TimeLens Models

TimeLens models are a family of MLLMs with SotA video temporal grounding performance. They are built upon the Qwen2.5-VL and Qwen3-VL baselines through training on our high-quality TimeLens-100K dataset, leveraging our carefully crafted RLVR (reinforcement learning with verifiable rewards) recipe and improved timestamp encoding strategy.

🚀 Quick Start

All models are available on Hugging Face and support out-of-the-box inference using the 🤗Transformers library. For detailed usage instructions and code examples, please refer to the specific model's Hugging Face page linked below.

🏆 Model Zoo & Performance

The following table lists our models with their Hugging Face links and grounding performance:

Model (with 🤗HuggingFace Link)	Charades-TimeLens				ActivityNet-TimeLens				QVHighlights-TimeLens
Model (with 🤗HuggingFace Link)	R1 @0.3	R1 @0.5	R1 @0.7	mIoU	R1 @0.3	R1 @0.5	R1 @0.7	mIoU	R1 @0.3	R1 @0.5	R1 @0.7	mIoU
Qwen2.5-VL-7B-Instruct	59.7	37.8	16.6	39.3	44.1	31.0	16.1	31.4	41.5	27.8	15.2	31.6
TimeLens-7B🚀	70.5	55.6	28.4	48.8	62.8	51.0	32.6	46.2	74.1	62.7	43.1	56.0
Qwen3-VL-8B-Instruct	69.2	53.4	27.5	48.3	62.1	51.2	34.4	46.8	74.2	64.6	49.3	59.4
TimeLens-8B🚀	76.6	63.0	35.2	55.2	68.9	58.4	40.6	53.2	80.2	71.6	55.5	65.5

TimeLens-7B is fine-tuned from Qwen2.5-VL-7B-Instruct, and TimeLens-8B is fine-tuned from Qwen3-VL-8B-Instruct.

Note

For detailed comparison with other models, please refer to the 🏆 Leaderboard.

📊 Evaluation on TimeLens-Bench

Download TimeLens-Bench

Download the TimeLens-Bench dataset from Hugging Face and place it in the data/TimeLens-Bench directory:

hf download TencentARC/TimeLens-Bench \
  --repo-type=dataset \
  --local-dir data/TimeLens-Bench

Extract the compressed videos:

mkdir -p data/TimeLens-Bench/videos
find data/TimeLens-Bench/video_shards -name "*.tar.gz" | \
  xargs -P 4 -I {} tar -xzf {} -C data/TimeLens-Bench/videos # Parallel extraction with 4 processes

The folder structure should look like this:

TimeLens/
└── data/
    └── TimeLens-Bench/
        ├── activitynet-timelens.json
        ├── charades-timelens.json
        ├── qvhighlights-timelens.json
        ├── videos/              # extracted videos
        │   ├── activitynet/
        │   ├── charades/
        │   └── qvhighlights/
        └── video_shards/        # compressed videos (can be deleted after extraction)

Evaluate with Our Codebase (TimeLens / Qwen-VL Models)

Our codebase supports evaluation of the following models:

Model	Supported
TimeLens-7B	✅
TimeLens-8B	✅
Qwen2.5-VL	✅
Qwen3-VL	✅

The evaluation script is scripts/eval_timelens_bench.sh. You can set the following environment variables:

model_path: Path or HuggingFace ID of the model to evaluate. Default: TencentARC/TimeLens-8B
datasets: Comma-separated list of datasets to evaluate. Default: charades-timelens,activitynet-timelens,qvhighlights-timelens
CUDA_VISIBLE_DEVICES: GPU indices to use (e.g., 0,1,2,3). Default: Auto-detect all available GPUs
pred_path: Directory to save results. Default: ./logs
min_tokens: Minimum tokens for video encoding. Default: 64
total_tokens: Total tokens for video encoding. Default: 14336
FPS: Frames per second for video sampling. Default: 2

Example 1: Evaluate TimeLens-8B (default settings)

model_path="TencentARC/TimeLens-8B" bash scripts/eval_timelens_bench.sh

Example 2: Evaluate TimeLens-7B on specific datasets with specific GPUs

CUDA_VISIBLE_DEVICES=0,1 \
datasets="activitynet-timelens,qvhighlights-timelens" \
model_path="TencentARC/TimeLens-7B" \
bash scripts/eval_timelens_bench.sh

Example 3: Evaluate Qwen3-VL with a local model path and a custom path to save results:

pred_path="./path/to/results" \
model_path="path/to/Qwen3-VL-8B-Instruct" \
bash scripts/eval_timelens_bench.sh

Tip

Faster Evaluation with DataLoader 🚀

Our evaluation script evaluation/eval_dataloader.py supports multi-GPU inference. More importantly, we use PyTorch DataLoader with multiple workers to prefetch and preprocess video data in parallel, while the GPU handles model inference. This significantly accelerates evaluation for long-video tasks like video temporal grounding. Additionally, this approach is more research-friendly compared to inference engines like vLLM, as it allows easy customization of the model inference code.

Evaluating TimeLens-7B on ActivityNet-TimeLens with 8× H20 GPUs:

Method	Time
Without DataLoader	1h23min
With DataLoader	~34min (~2.4x faster)

Evaluate Your Own Model

To evaluate your own model on TimeLens-Bench, follow these steps:

Load annotations: Use our provided timelens_data.py for loading annotations.
Run inference and save results: Run inference with your model and save results in a JSON or JSONL file with the following format:
```
{
    f'{video_name}>>>{query}>>>{ground_truth_span}': {
        "timestamps": timestamps,  # the predicted time span from the model
        "answers": answer,  # the full answer text from the model
    }
}
```
An example of a correctly saved JSON file:
```
{
    "v_BrgYIg6UXhU.mp4>>>A man wearing a blue jacket approaches a blue car>>>[0.0, 4.0]":
    {
        "timestamps": [[0.0, 5.0]],
        "answers": "The event happens in 0.0 - 5.0 seconds."
    },
    ...
}
```
In your inference results, you can provide either timestamps or answers. In the next step (Step 3, compute metrics), evaluation/compute_metrics.py applies the following logic:
- If timestamps is provided, IoU metrics are computed directly from it.
- If only answers is provided, the script will automatically extract the timestamp pair from the answer text.
Compute metrics: Use our provided evaluation/compute_metrics.py to compute metrics.

python evaluation/compute_metrics.py -f /path/to/your_result.json

For more details on implementing the above steps, you can refer to the evaluation scripts of our supported models.

🏋️ Training on TimeLens-100K

Download TimeLens-100K

Download the TimeLens-100K dataset from Hugging Face and place it in the data/TimeLens-100K directory:

hf download TencentARC/TimeLens-100K \
  --repo-type=dataset \
  --local-dir data/TimeLens-100K

Extract the compressed videos:

mkdir -p data/TimeLens-100K/videos
find data/TimeLens-100K/video_shards -name "*.tar.gz" | \
  xargs -P 4 -I {} tar -xzf {} -C data/TimeLens-100K/videos # Parallel extraction with 4 processes

The folder structure should look like this:

TimeLens/
└── data/
    └── TimeLens-100K/
        ├── README.md
        ├── timelens-100k.jsonl
        ├── videos/              # extracted videos
        │   ├── cosmo_cap/
        │   ├── didemo/
        │   ├── hirest/
        │   ├── internvid_vtime/
        │   └── queryd/
        └── video_shards/        # compressed videos (can be deleted after extraction)

Train with Your Own Codebase

We provide an example script timelens_data.py for loading TimeLens-100K annotations. You can refer to this code to integrate TimeLens-100K into your own training codebase.

Use Our Training Code

Our training code will be released soon! Stay tuned!

📝 Citation

If you find our paper, code, model, and data helpful for your research and applications, please consider giving a star ⭐ and citation 📝 :)

@article{zhang2025timelens,
  title={TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs},
  author={Zhang, Jun and Wang, Teng and Ge, Yuying and Ge, Yixiao and Li, Xinhao and Shan, Ying and Wang, Limin},
  journal={arXiv preprint arXiv:2512.14698},
  year={2025}
}

🙏 Acknowledgement

Our project is built upon the following awesome works:

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
evaluation		evaluation
scripts		scripts
timelens		timelens
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

🔎 Overview

📚 Quick Navigation

📦 Installation

🤖 Using TimeLens Models

🚀 Quick Start

🏆 Model Zoo & Performance

📊 Evaluation on TimeLens-Bench

Download TimeLens-Bench

Evaluate with Our Codebase (TimeLens / Qwen-VL Models)

Evaluate Your Own Model

🏋️ Training on TimeLens-100K

Download TimeLens-100K

Train with Your Own Codebase

Use Our Training Code

📝 Citation

🙏 Acknowledgement

About

Uh oh!

Releases

Packages

Languages

License

TencentARC/TimeLens

Folders and files

Latest commit

History

Repository files navigation

TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

🔎 Overview

📚 Quick Navigation

📦 Installation

🤖 Using TimeLens Models

🚀 Quick Start

🏆 Model Zoo & Performance

📊 Evaluation on TimeLens-Bench

Download TimeLens-Bench

Evaluate with Our Codebase (TimeLens / Qwen-VL Models)

Evaluate Your Own Model

🏋️ Training on TimeLens-100K

Download TimeLens-100K

Train with Your Own Codebase

Use Our Training Code

📝 Citation

🙏 Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages