Skip to content

[ICCV'25] When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning

Notifications You must be signed in to change notification settings

VisionXLab/LRS-VQA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

25 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning


  • 2026/2/14 🌟 Code and model weights are released.
  • 2025/6/26] 🌟 Our paper is accepted by ICCV 2025!
  • 2025/3/11 🌟 LRS-VQA Benchmark is now released!

This project focuses on efficient perception of Large Remote Sensing Images (RSIs) using Large Vision-Language Models (LVLMs) under limited resources, covering the following key aspects:

  • Coarse-to-Fine Focusing & Pruning: An iterative process that zooms from coarse, low-resolution overviews into fine-grained, high-resolution views to focus and analyze text-related regions.
  • Region Focus Module (RFM): Learns text-aware key region localization capabilities from LVLM through attention distillation, enabling focused analysis on critical image tiles.
  • LRS-VQA: A new benchmark for Large RSI perception, featuring 7,333 QA pairs across 8 categories, with images reaching up to 27,328 pixels in length and an average size of 7,099Γ—6,329 pixels.

πŸ“ TODO

  • Release benchmark.
  • Release code and model weights.
  • Release training script.

πŸ› οΈ Method

Our method introduces: (i) a Region Focus Module (RFM) that leverages text-aware region localization capability to identify critical vision tokens, and (ii) a coarse-to-fine image tile selection and vision token pruning strategy based on DIP, which is guided by RFM outputs and avoids directly processing the entire large imagery.

1. Region Focus Module (RFM)

Region Focus Module

Schematic illustration of the Region Focus Module (RFM).

The RFM aims at learning text-aware key vision tokens localization from the LLM part of LVLM by employing attention distillation, which allows it to focus on the most relevant parts of an image for detailed analysis.

The phenomenon within the LVLMs was first observed in small-size general images:

text-based attention convergence in general images

2. Coarse-to-Fine Token Pruning

Pipeline Overview

The overall pipeline of our proposed method.

Initially, the DIP is constructed based on the input large RSI. At the low-resolution DIP level, the RFM provides attention distribution for the initial vision tokens, which guides the retrieval of corresponding image tiles from higher-resolution DIP levels or trigger token pruning at the current level. This iterative process could continue through the pyramid until reaching the original resolution.


πŸ“š LRS-VQA Benchmark

Dataset Examples

Construction process of LRS-VQA.

MME-RealWorld has provided a high-quality benchmark for multiple domains. In the field of remote sensing, we aim to further enrich the types of tasks and reflect the challenges of large RSI perception. LRS-VQA includes 1,657 images ranging in length from 1,024 to 27,328 pixels, covering 8 different types of questions, and contains 7,333 QA pairs.

Resolution vs Accuracy

The accuracy trends of Qwen2-VL across varying input maximum pixels. This demonstrates that accuracy on both the manually annotated MME-RealWorld-RS and our proposed LRS-VQA exhibit a positive correlation with resolution improvement, proving the effectiveness of LRS-VQA in evaluating LVLM's high-resolution RSI perception capabilities.

Download and Evaluation

To get started with the dataset and evaluation scripts, follow these steps:

Results

Resolution vs Accuracy

Leaderboard and performance comparision. Average Accuracy is reported for each dataset.

res2 res3

Efficiency comparision.

Detailed Result

Click to view detailed results in LRS-FAIR, LRS-Bridge, and LRS-STAR

Detailed results in LRS-FAIR

Detailed results in LRS-FAIR.

Detailed results in LRS-Bridge

Detailed results in LRS-Bridge.

Detailed results in LRS-STAR

Detailed results in LRS-STAR.


πŸš€ Getting Started

Model Weights

We provide the model weights on ModelScope and Hugging Face.

Model Base ModelScope Hugging Face
LLaVA-NeXT-7B (Qwen2) ModelScope Hugging Face
LLaVA-1.5-7B (Vicuna) ModelScope Hugging Face

1. Environment Setup

We recommend using Conda to manage the environment. This project requires Python 3.10 and is tested on NVIDIA A100/A800 GPUs.

# Create and activate the conda environment
conda create -n lrsvqa python=3.10 -y
conda activate lrsvqa

# Upgrade pip
pip install --upgrade pip

2. Install PyTorch

The required PyTorch version depends on your NVIDIA Driver Version (check with nvidia-smi).

  • For Modern Drivers (Version >= 525.60):
pip install torch==2.1.2 torchvision==0.16.2 --index-url [https://download.pytorch.org/whl/cu121](https://download.pytorch.org/whl/cu121)
  • For Older Drivers (Version < 525.60, e.g., 470.xx): If you encounter a RuntimeError related to an old NVIDIA driver, you must use the CUDA 11.8 compiled version:
pip install torch==2.1.2+cu118 torchvision==0.16.2+cu118 --extra-index-url [https://download.pytorch.org/whl/cu118](https://download.pytorch.org/whl/cu118)

3. Install Dependencies

cd LRS-VQA-Code
# Install base and training requirements
pip install -e .
pip install -e ".[train]"

# Install specific versions of other key packages
pip install flash-attn==2.6.3 --no-build-isolation
pip install bitsandbytes==0.43.3 safetensors==0.4.5 pydantic==2.9.1 peft==0.3.0

βš™οΈ Inference

Step 1: Apply Environment Patch (Crucial)

⚠️ Important: Our model includes custom modifications that require patching your local transformers library installation. Before running inference, please copy our modified modeling files into your conda environment.

# Get the site-packages path of your conda environment
SITE_PACKAGES=$(python -c 'import site; print(site.getsitepackages()[0])')

# Copy the patched files (Execute from project root)
cp ./LRS-VQA-Code/llava/model/multimodal_encoder/transformers/models/llama/modeling_llama.py $SITE_PACKAGES/transformers/models/llama/
cp ./LRS-VQA-Code/llava/model/multimodal_encoder/transformers/models/qwen2/modeling_qwen2.py $SITE_PACKAGES/transformers/models/qwen2/
cp ./LRS-VQA-Code/llava/model/multimodal_encoder/transformers/modeling_outputs.py $SITE_PACKAGES/transformers/

Step 2: Prepare Data

Depending on which benchmark you wish to evaluate, follow the corresponding instructions below.

Option A: For MME-RealWorld (Remote Sensing) Evaluation

  1. Download: Obtain the dataset from the official MME-RealWorld repository.
  2. Organize: Arrange your files as shown below. The evaluation script will need the path to this main data directory.
/path/to/your/MME_RealWorld/
β”œβ”€β”€ MME_RealWorld_RS.json
β”œβ”€β”€ evaluation/
β”‚   └── eval_your_results.py
└── remote_sensing/
    β”œβ”€β”€ 03553_Toronto.png
    β”œβ”€β”€ 03555_Toronto.png
    └── ...

Option B: For LRS-VQA Benchmark Evaluation

  1. Download: Download the dataset from our Hugging Face repository.
  2. Extract: Place all downloaded parts in the same directory. Use 7-Zip (Windows) or p7zip (Linux) to extract the archive by running the command on the first file (.001).
# Example using p7zip on Linux
7z x LRS_VQA.7z.001
  1. Verify Structure: After extraction, you will have an LRS_VQA folder. The evaluation script expects the path to its parent directory. Ensure your final structure looks like this:
/path/to/dataset_parent_directory/
└── LRS_VQA/
    β”œβ”€β”€ LRS_VQA_merged.jsonl
    └── image/
        β”œβ”€β”€ 15565.tif
        β”œβ”€β”€ 9043.tif
        └── ...

Step 3: Run Inference

We provide shell scripts to simplify the evaluation process.

To evaluate on LRS-VQA:

bash LRS-VQA-Code/eval_lrs_vqa.sh

To evaluate on MME-RealWorld-RS:

bash LRS-VQA-Code/eval_mme-realworld-rs.sh

Citation

If you find this work helpful for your research, please consider giving this repo a star ⭐ and citing our paper:

@InProceedings{Luo_2025_ICCV,
    title={When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning},
    author={Luo, Junwei and Zhang, Yingying and Yang, Xue and Wu, Kang and Zhu, Qi and Liang, Lei and Chen, Jingdong and Li, Yansheng},
    booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month={October},
    year={2025},
    pages={9206-9217}
}

@article{li2024scene,
    title={STAR: A First-Ever Dataset and A Large-Scale Benchmark for Scene Graph Generation in Large-Size Satellite Imagery},
    author={Li, Yansheng and Wang, Linlin and Wang, Tingzhu and Yang, Xue and Luo, Junwei and Wang, Qi and Deng, Youming and Wang, Wenbin and Sun, Xian and Li, Haifeng and Dang, Bo and Zhang, Yongjun and Yu, Yi and Yan Junchi},
    journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
    year={2024},
    publisher={IEEE}}

@article{li2024learning,
    title={Learning to Holistically Detect Bridges From Large-Size VHR Remote Sensing Imagery},
    author={Li, Yansheng and Luo, Junwei and Zhang, Yongjun and Tan, Yihua and Yu, Jin-Gang and Bai, Song},
    journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
    volume={44},
    number={11},
    pages={7778--7796},
    year={2024},
    publisher={IEEE}
}

Acknowledgement

We thank the authors of MME-RealWorld and PyramidDrop for their great works and codebases.

About

[ICCV'25] When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published