Institute of Automation, Chinese Academy of Sciences; University of Chinese Academy of Sciences; Linketic
[2025.10.26] Code of VGGT-X had been released!
[2025.09.30] Paper release of our VGGT-X on arXiv!
VGGT-X takes dense multi-view images as input. It first uses memory-efficient VGGT to losslessly predict 3D key attributes. Then, a fast global alignment module refines the predicted camera poses and point clouds. Finally, a robust joint pose and 3DGS training pipeline is applied to produce high-fidelity novel view synthesis.
First, clone this repository to your local machine, and install the dependencies.
git clone --recursive https://github.com/Linketic/VGGT-X.git
cd VGGT-X
conda create -n vggt_x python=3.10
conda activate vggt_x
pip install -r requirements.txtNow, put the image collection to path/to/your/scene/images. Please ensure that the images are stored in /YOUR/SCENE_DIR/images/. This folder should contain only the images. Then run the model and get COLMAP results:
python demo_colmap.py --scene_dir /YOUR/SCENE_DIR --shared_camera --use_gaThe reconstruction result (camera parameters and 3D points) will be automatically saved under /YOUR/SCENE_DIR_vggt_x/sparse/ in the COLMAP format (currently only supports PINHOLE camera type), such as:
SCENE_DIR/
βββ images/
SCENE_DIR_vggt_x/
βββ images/
βββ sparse/
βββ cameras.bin
βββ images.bin
βββ points3D.bin
Note that it would soft link everything in /YOUR/SCENE_DIR/ to the new folder /YOUR/SCENE_DIR_vggt_x/, except for the sparse/ folder. It minimizes additional storage usage and facilitates usage of reconstruction results. If /YOUR/SCENE_DIR/sparse/ exists, it would take it as ground-truth and save pose and point map evaluation result to /YOUR/SCENE_DIR_vggt_x/sparse/eval_results.txt. If you have multiple scenes, don't hesitate to try our provided colmap_parallel.sh for parallel running and automatic metrics gathering.
Script Parameters
Post fix for the output folder (_vggt_x by default). You can set any desired name for the output folder.
Random seed for reproducibility.
If specified, the global alignment will be applied to VGGT output for better reconstruction. The matching results would be saved to /YOUR/SCENE_DIR_vggt_x/matches.pt.
If specified, it would save the depth and confidence to /YOUR/SCENE_DIR_vggt_x/estimated_depths/ and /YOUR/SCENE_DIR_vggt_x/estimated_confs/ as .npy files.
If specified, it would use first the total_frame_num images for reconstruction. Otherwise, all images will be considered in processing.
Chunk size for frame-wise operation in VGGT. Default value is 512. You can specify a smaller value to release VGGT computation burden.
Maximum query points for XFeat matching. For each pair, XFeat would generate max_query_pts matches. If not specified, it is set to 4096 if number of images is less than 500 and 2048 otherwise. You can specify a smaller value to release GA computation burden.
Maximum number for colmap point cloud. Default value is 500000.
If specified, it would use shared camera for all images.
NVS Dataset Preparation
For novel view synthesis on MipNeRF360, please download the 360_v2.zip and 360_extra_scenes.zip from MipNeRF360.
cd data
mkdir MipNeRF360
unzip 360_v2.zip -d MipNeRF360
unzip 360_extra_scenes.zip -d MipNeRF360For reconstruction on TnT dataset, please download the preprocessed TnT_data. More details can be found here.
cd data
unzip TNT_GOF.zipFollowing CF-3DGS and HT-3DGS, we select 5 scenes from CO3Dv2. It can be downloaded from here. Then run:
cd data
unzip CO3Dv2.zipWith dataset prepared, you can replace the $dir in colmap_parallel.sh to your dataset directory and run it to efficiently get inferenced 3D key attributes in COLMAP format. The results can be directly applied integrated for joint pose and 3DGS reconstruction.
Considering the loading and processing of hundreds of images cannot be immediately finished, we provide offline visualization through viser.
python colmap_viser.py \
-c /YOUR/SCENE_DIR_vggt_x/sparse/0 \
-i /YOUR/SCENE_DIR_vggt_x/images \
-r /YOUR/SCENE_DIR/sparse/0Note that -i and -r are optional. Set them when you want to show image along with the camera frustum or compare with the ground truth camera poses.
The exported COLMAP files can be directly used with CityGaussian for Gaussian Splatting training. This repo is based on Gaussian Lightning and supports various baselines such as MipSplatting and MCMC-3DGS. The guidance for joint pose and 3DGS optimization has also been incorporated in our repo.
If you find this repository useful for your research, please use the following BibTeX entry for citation.
@misc{liu2025vggtxvggtmeetsdense,
title={VGGT-X: When VGGT Meets Dense Novel View Synthesis},
author={Yang Liu and Chuanchen Luo and Zimo Tang and Junran Peng and Zhaoxiang Zhang},
year={2025},
eprint={2509.25191},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.25191},
}
This repo benefits from VGGT, VGGT-Low-Ram, MASt3R, PoseDiffusion, 3RGS and many other inspiring works in the community. Thanks for their great work!
See the LICENSE file for details about the license under which this code is made available.


