Text Before Vision: Staged Knowledge Injection Matters for Agentic RLVR in UHR Remote Sensing

Fengxiang Wang^1,2, Mingshuo Chen³, Yueying Li¹, Yajie Yang⁴, Yuhao Zhou⁵, Di Wang⁶
Yifan Zhang⁷, Haoyu Wang⁸, Haiyan Zhao⁸, Hongda Sun⁹, Long Lan¹, Jun Song
Yulin Wang^8*, Jing Zhang^6*, Wenlong Zhang^2*, Bo Du⁶

¹National University of Defense Technology, ²Shanghai AI Laboratory
³Beijing University of Posts and Telecommunications, ⁴University of the Chinese Academy of Sciences
⁵Sichuan University, ⁶Wuhan University, ⁷Chinese Academy of Science
⁸Tsinghua University, ⁹Renmin University of China

🔍Overview

Fig 1. The impact of domain data and staged training on UHR Remote Sensing Understanding.

We introduce a novel Staged Knowledge Injection framework for Ultra-High-Resolution (UHR) Remote Sensing (RS) understanding. While Agentic Reinforcement Learning with Verifiable Rewards (RLVR) offers a path for navigating massive pixel spaces, we find that standard RL struggles without structured domain priors.

Our controlled studies yield a counter-intuitive finding: High-quality Earth-science text-only QA is a primary driver of UHR visual reasoning gains. Based on this, we propose a "Text-Before-Vision" recipe that achieves a 60.04% Pass@1 on XLRS-Bench, establishing a new state-of-the-art.

The key contributions are:

Mechanistic Insights: We demonstrate that the reasoning boundary (Pass@32) in UHR tasks is primarily governed by domain-prior coverage. Text-only QA instills reasoning structures that facilitate visual evidence retrieval.
Earth-Science Text QA Pipeline: We release an automated pipeline and a dataset of 148,777 high-quality Text CoT QA pairs, constructed from 8.8k textbooks and 200k scientific papers, rigorously verified by a domain-specific Knowledge Graph.
Staged Knowledge Injection Recipe: We propose a training strategy: (1) Cold-starting with text QA to instill reasoning structures, followed by (2) "Pre-warming" on hard UHR image-text examples during SFT to stabilize subsequent tool-based Agentic RLVR.

🌐Earth-Science Text QA Dataset

We construct a large-scale, domain-specialized text QA dataset using a fully automated pipeline with "Active Pre-emptive Validation." The data generation process (Fig 2) utilizes a Knowledge Graph (built via LightRAG) to filter hallucinations before generation.

Fig 2. Automated pipeline for Earth-science text QA generation and verification.

Dataset Statistics

Statistic	Value
Total QA Pairs	148,777
Avg. Question Length	64.0 tokens
Avg. CoT + Answer Length	256.9 tokens
Reasoning Steps (Avg)	2.6 steps
Question Types	MCQ (24%), Fill (7%), T/F (4%), Free-form (65%)

🛠️Methodology & Training

Our approach (Agentic RLVR) utilizes Qwen2.5-VL as the base model and integrates GRPO with zoom-in tools.

1. Prepare Data

Earth-Science Text QA: Download our constructed dataset through huggingface.
SuperRS-VQA: Ensure you have the SuperRS-VQA images for the SFT.
General RL Data: We utilize DeepEyes-47K for general reasoning stability.

2. Training Stages

We used a staged training process for Text-Before-Vision. Please check the training folder first.

Stage 1: Cold-Start SFT

We utilize LLaMA-Factory to perform SFT.

# 1. Download and prepare sft data from huggingface
# please make sure to modify the absolute image paths in SuperRS-VQA
# two json files are required here: geollava_superrs.json for SuperRS-VQA, text_148k_sft.json for text knowledge injection.
# 2. SFT using llamafactory
# We use this specific commit: https://github.com/hiyouga/LlamaFactory/tree/2a822178dea4d1c05f595521dd883a8e4f4e2e77
# if encountered TypeError during dataset preprocess, refer to https://github.com/hiyouga/LlamaFactory/issues/5613
# modify json paths in dataset_info.json and yaml file
llamafactory-cli train yamls/base.yaml

Stage 2: Agentic RLVR (with Tools)

Perform Group Relative Policy Optimization (GRPO) with zoom-in tools enabled. Here we utilize DeepEyes as the specific training framework.

# 1. first install DeepEyes through https://github.com/Visual-Agent/DeepEyes
# 2. download RL data, and modify parquet file paths in the training script
# there are 3 parquets from DeepEyes-47k and 1 parquet file from our HF repo
# 3. modify verl/utils/reward_score/__init__.py according to our provided __init__.py
# 4. follow deepeyes to set LLM judge and start training through:
bash train_rq_general.sh

🚀Evaluation

We evaluate primarily on XLRS-Bench to measure both average performance (Pass@1) and reasoning boundary (Pass@32).

Evaluation Steps

To replicate the results in our paper:

# 0. execute the prepare_xlrs_data.ipynb to preprocess the evaluation data
# 1. convert model from pt format to hf model
bash s1.sh
# 2. deploy model using vllm (or ray using `serve run ray.yaml`)
bash s21.sh
# 3. prompting vllm, this may take 1~2 days for pass@32 evaluation
bash s22.sh
# 4. calculate metrics
bash s232.sh

Expected Results

We also provide our trained model checkpoints here.

Method	Pass@1	Pass@32
Baseline (RLVR)	50.01	82.58
+ Pre-warming (SuperRS-VQA)	52.39	91.85
+ Text Cold Start (Ours)	60.40	96.25

🤝Acknowledgement

This repo benefits from DeepEyes and LLaMA-Factory. Thanks for their wonderful works.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
evaluation		evaluation
training		training
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Before Vision: Staged Knowledge Injection Matters for Agentic RLVR in UHR Remote Sensing

📚 Contents

🔍Overview

🌐Earth-Science Text QA Dataset

Dataset Statistics

🛠️Methodology & Training

1. Prepare Data

2. Training Stages

Stage 1: Cold-Start SFT

Stage 2: Agentic RLVR (with Tools)

🚀Evaluation

Evaluation Steps

Expected Results

🤝Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Text Before Vision: Staged Knowledge Injection Matters for Agentic RLVR in UHR Remote Sensing

📚 Contents

🔍Overview

🌐Earth-Science Text QA Dataset

Dataset Statistics

🛠️Methodology & Training

1. Prepare Data

2. Training Stages

Stage 1: Cold-Start SFT

Stage 2: Agentic RLVR (with Tools)

🚀Evaluation

Evaluation Steps

Expected Results

🤝Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages