Skip to content

TencentARC/TimeLens

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

Jun Zhang, Teng Wang, Yuying Ge, Yixiao Ge, Xinhao Li, Ying Shan, Limin Wang

Β Β πŸ“‘ PaperΒ Β  |   🏠 Project PageΒ Β  | πŸ€— Model & DataΒ Β  | πŸ† TimeLens-Bench LeaderboardΒ Β 

πŸ”Ž Overview

TimeLens rethinks video temporal grounding (VTG) with MLLMs along two axes:

  • Data Quality. We expose critical quality issues in existing VTG benchmarks and propose quality-assured datasets for both training and evaluation.
  • Algorithmic Design. Building upon reliable data, we explore effective timestamp encoding strategies and training recipes, achieving state-of-the-art performance among open-source models.

πŸ“š Quick Navigation

In this repository, we release:

  • πŸ€– TimeLens Models: State-of-the-art open-source models for video temporal grounding.
  • πŸ“Š TimeLens-Bench: a comprehensive, high-quality evaluation benchmark for video temporal grounding.
  • πŸ‹οΈ TimeLens-100K: a large-scale, diverse, high-quality training dataset for video temporal grounding, annotated with Gemini-2.5-Pro.

πŸ“¦ Installation

Clone this repository and navigate to the folder

git clone https://github.com/TencentARC/TimeLens.git
cd TimeLens

Create a Conda environment and install the required packages

conda create -n timelens python=3.11 -y
conda activate timelens
pip install -r requirements.txt -f https://download.pytorch.org/whl/cu124 # We use CUDA Version 12.4
pip install flash-attn==2.7.4.post1 --no-build-isolation --no-cache-dir

πŸ€– Using TimeLens Models

TimeLens models are a family of MLLMs with SotA video temporal grounding performance. They are built upon the Qwen2.5-VL and Qwen3-VL baselines through training on our high-quality TimeLens-100K dataset, leveraging our carefully crafted RLVR (reinforcement learning with verifiable rewards) recipe and improved timestamp encoding strategy.

πŸš€ Quick Start

All models are available on Hugging Face and support out-of-the-box inference using the πŸ€—Transformers library. For detailed usage instructions and code examples, please refer to the specific model's Hugging Face page linked below.

πŸ† Model Zoo & Performance

The following table lists our models with their Hugging Face links and grounding performance:

Model
(with πŸ€—HuggingFace Link)
Charades-TimeLens ActivityNet-TimeLens QVHighlights-TimeLens
R1
@0.3
R1
@0.5
R1
@0.7
mIoU R1
@0.3
R1
@0.5
R1
@0.7
mIoU R1
@0.3
R1
@0.5
R1
@0.7
mIoU
Qwen2.5-VL-7B-Instruct 59.7 37.8 16.6 39.3 44.1 31.0 16.1 31.4 41.5 27.8 15.2 31.6
TimeLens-7BπŸš€ 70.5 55.6 28.4 48.8 62.8 51.0 32.6 46.2 74.1 62.7 43.1 56.0
Qwen3-VL-8B-Instruct 69.2 53.4 27.5 48.3 62.1 51.2 34.4 46.8 74.2 64.6 49.3 59.4
TimeLens-8BπŸš€ 76.6 63.0 35.2 55.2 68.9 58.4 40.6 53.2 80.2 71.6 55.5 65.5

TimeLens-7B is fine-tuned from Qwen2.5-VL-7B-Instruct, and TimeLens-8B is fine-tuned from Qwen3-VL-8B-Instruct.

Note

For detailed comparison with other models, please refer to the πŸ† Leaderboard.

πŸ“Š Evaluation on TimeLens-Bench

Download TimeLens-Bench

Download the TimeLens-Bench dataset from Hugging Face and place it in the data/TimeLens-Bench directory:

hf download TencentARC/TimeLens-Bench \
  --repo-type=dataset \
  --local-dir data/TimeLens-Bench

Extract the compressed videos:

mkdir -p data/TimeLens-Bench/videos
find data/TimeLens-Bench/video_shards -name "*.tar.gz" | \
  xargs -P 4 -I {} tar -xzf {} -C data/TimeLens-Bench/videos # Parallel extraction with 4 processes

The folder structure should look like this:

TimeLens/
└── data/
    └── TimeLens-Bench/
        β”œβ”€β”€ activitynet-timelens.json
        β”œβ”€β”€ charades-timelens.json
        β”œβ”€β”€ qvhighlights-timelens.json
        β”œβ”€β”€ videos/              # extracted videos
        β”‚   β”œβ”€β”€ activitynet/
        β”‚   β”œβ”€β”€ charades/
        β”‚   └── qvhighlights/
        └── video_shards/        # compressed videos (can be deleted after extraction)

Evaluate with Our Codebase (TimeLens / Qwen-VL Models)

Our codebase supports evaluation of the following models:

Model Supported
TimeLens-7B βœ…
TimeLens-8B βœ…
Qwen2.5-VL βœ…
Qwen3-VL βœ…

The evaluation script is scripts/eval_timelens_bench.sh. You can set the following environment variables:

  • model_path: Path or HuggingFace ID of the model to evaluate. Default: TencentARC/TimeLens-8B
  • datasets: Comma-separated list of datasets to evaluate. Default: charades-timelens,activitynet-timelens,qvhighlights-timelens
  • CUDA_VISIBLE_DEVICES: GPU indices to use (e.g., 0,1,2,3). Default: Auto-detect all available GPUs
  • pred_path: Directory to save results. Default: ./logs
  • min_tokens: Minimum tokens for video encoding. Default: 64
  • total_tokens: Total tokens for video encoding. Default: 14336
  • FPS: Frames per second for video sampling. Default: 2

Example 1: Evaluate TimeLens-8B (default settings)

model_path="TencentARC/TimeLens-8B" bash scripts/eval_timelens_bench.sh

Example 2: Evaluate TimeLens-7B on specific datasets with specific GPUs

CUDA_VISIBLE_DEVICES=0,1 \
datasets="activitynet-timelens,qvhighlights-timelens" \
model_path="TencentARC/TimeLens-7B" \
bash scripts/eval_timelens_bench.sh

Example 3: Evaluate Qwen3-VL with a local model path and a custom path to save results:

pred_path="./path/to/results" \
model_path="path/to/Qwen3-VL-8B-Instruct" \
bash scripts/eval_timelens_bench.sh

Tip

Faster Evaluation with DataLoader πŸš€

Our evaluation script evaluation/eval_dataloader.py supports multi-GPU inference. More importantly, we use PyTorch DataLoader with multiple workers to prefetch and preprocess video data in parallel, while the GPU handles model inference. This significantly accelerates evaluation for long-video tasks like video temporal grounding. Additionally, this approach is more research-friendly compared to inference engines like vLLM, as it allows easy customization of the model inference code.

Evaluating TimeLens-7B on ActivityNet-TimeLens with 8Γ— H20 GPUs:

Method Time
Without DataLoader 1h23min
With DataLoader ~34min (~2.4x faster)

Evaluate Your Own Model

To evaluate your own model on TimeLens-Bench, follow these steps:

  1. Load annotations: Use our provided timelens_data.py for loading annotations.

  2. Run inference and save results: Run inference with your model and save results in a JSON or JSONL file with the following format:

    {
        f'{video_name}>>>{query}>>>{ground_truth_span}': {
            "timestamps": timestamps,  # the predicted time span from the model
            "answers": answer,  # the full answer text from the model
        }
    }

    An example of a correctly saved JSON file:

    {
        "v_BrgYIg6UXhU.mp4>>>A man wearing a blue jacket approaches a blue car>>>[0.0, 4.0]":
        {
            "timestamps": [[0.0, 5.0]],
            "answers": "The event happens in 0.0 - 5.0 seconds."
        },
        ...
    }

    In your inference results, you can provide either timestamps or answers. In the next step (Step 3, compute metrics), evaluation/compute_metrics.py applies the following logic:

    • If timestamps is provided, IoU metrics are computed directly from it.
    • If only answers is provided, the script will automatically extract the timestamp pair from the answer text.
  3. Compute metrics: Use our provided evaluation/compute_metrics.py to compute metrics.

python evaluation/compute_metrics.py -f /path/to/your_result.json

For more details on implementing the above steps, you can refer to the evaluation scripts of our supported models.

πŸ‹οΈ Training on TimeLens-100K

Download TimeLens-100K

Download the TimeLens-100K dataset from Hugging Face and place it in the data/TimeLens-100K directory:

hf download TencentARC/TimeLens-100K \
  --repo-type=dataset \
  --local-dir data/TimeLens-100K

Extract the compressed videos:

mkdir -p data/TimeLens-100K/videos
find data/TimeLens-100K/video_shards -name "*.tar.gz" | \
  xargs -P 4 -I {} tar -xzf {} -C data/TimeLens-100K/videos # Parallel extraction with 4 processes

The folder structure should look like this:

TimeLens/
└── data/
    └── TimeLens-100K/
        β”œβ”€β”€ README.md
        β”œβ”€β”€ timelens-100k.jsonl
        β”œβ”€β”€ videos/              # extracted videos
        β”‚   β”œβ”€β”€ cosmo_cap/
        β”‚   β”œβ”€β”€ didemo/
        β”‚   β”œβ”€β”€ hirest/
        β”‚   β”œβ”€β”€ internvid_vtime/
        β”‚   └── queryd/
        └── video_shards/        # compressed videos (can be deleted after extraction)

Train with Your Own Codebase

We provide an example script timelens_data.py for loading TimeLens-100K annotations. You can refer to this code to integrate TimeLens-100K into your own training codebase.

Use Our Training Code

Our training code will be released soon! Stay tuned!

πŸ“ Citation

If you find our paper, code, model, and data helpful for your research and applications, please consider giving a star ⭐ and citation πŸ“ :)

@article{zhang2025timelens,
  title={TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs},
  author={Zhang, Jun and Wang, Teng and Ge, Yuying and Ge, Yixiao and Li, Xinhao and Shan, Ying and Wang, Limin},
  journal={arXiv preprint arXiv:2512.14698},
  year={2025}
}

πŸ™ Acknowledgement

Our project is built upon the following awesome works:

About

TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published